Extracting Relations from Italian Wikipedia using Unsupervised Information Extraction Pierluigi Cassotti1 , Lucia Siciliani1 , Pierpaolo Basile1 , Marco de Gemmis1 and Pasquale Lops1 1 Department of Computer Science, University of Bari Aldo Moro, Via E. Orabona, 4 - 70125 Bari, Italy Abstract In this paper, we describe WikiOIE, a framework for extracting relations from Wikipedia. The frame- work is based on UDPipe and the Universal Dependencies project for text processing. It easily allows customizing the information extraction (IE) approach to automatically extract triples (subject, predicate, object). In this work, we propose two unsupervised IE methods to extract triples from the Italian version of Wikipedia. The former is based only on PoS-tag patterns; the latter also uses syntactic dependencies. The extraction process is provided in JSON format and is used to build a simple web interface for search- ing and browsing the extracted triples. A preliminary evaluation conducted on a dataset sample shows promising results, although the approach is completely unsupervised. Keywords Open Information Extraction, Natural Language Processing, Information Retrieval, Wikipedia, Univer- sal Dependencies 1. Introduction Open Information Extraction has the aim of extracting relations within a huge amount of text available on the Web. Relations take the form of relational tuples denoted as {(𝑎𝑟𝑔1; 𝑟𝑒𝑙; 𝑎𝑟𝑔2)}. Given a relation, 𝑎𝑟𝑔1 and 𝑎𝑟𝑔2 can be nouns or phrases, while 𝑟𝑒𝑙 is a phrase denoting the semantic relation between them. The importance of Open IE is further enhanced by its role in several NLP applications like Question Answering, Knowledge Graph Acquisition, Knowledge Graph Completion, and Text Summarization. Given the nature of the task, approaches for Open IE are deeply intertwined with the language of the corpora that has to be analyzed. Due to the availability of English corpora, the majority of the works available at the state of the art are specific for that language. For what concerns the Italian language, there is the approach proposed by Guarasci et al. [1], which is founded IIR 2021: The Italian Information Retrieval (IIR) Workshop, September 13–15, 2021, Bari, IT " pierluigi.cassotti@uniba.it (P. Cassotti); lucia.siciliani@uniba.it (L. Siciliani); pierpaolo.basile@uniba.it (P. Basile); marco.degemmis@uniba.it (M. d. Gemmis); pasquale.lops@uniba.it (P. Lops) ~ http://www.di.uniba.it/~swap/cassotti.html (P. Cassotti); http://www.di.uniba.it/~swap/siciliani.html (L. Siciliani); http://www.di.uniba.it/~basilepp/ (P. Basile); http://www.di.uniba.it/~swap/degemmis.html (M. d. Gemmis); http://www.di.uniba.it/~swap/lops.html (P. Lops)  0000-0001-7824-7167 (P. Cassotti); 0000-0002-1438-280X (L. Siciliani); 0000-0002-0545-1105 (P. Basile); 0000-0002-2007-9559 (M. d. Gemmis); 0000-0002-6866-9451 (P. Lops) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) on the use of verbal behavior patterns created based upon Lexicon-Grammar features. The authors have also produced a dataset built on the itWaC corpus, which is not currently available. WikiOIE proposes a framework in which Open IE approaches can be easily developed with the focus on the Italian1 language with the aim to encourage researchers to develop approaches for under-represented languages. The paper is structured as follows: 3 provides details about WikiOIE and its methodology, while Section 4 describes the dataset developed from the Italian version of Wikipedia. A preliminary evaluation is reported in Section 5. 2. Releated Work At the beginning, the task of IE was focused on the extraction of relations pre-specified by the user from small corpora. Even though several works like DIPRE [2], Snowball [3], and [4] tried to reduce the amount of annotated data necessary in the training phase, the systems still required a substantial human intervention to obtain relevant results, and changing domain was difficult. The shift to Open IE system was proposed by [5], along with their system called TextRunner. In this case, the goal is to apply IE algorithms to extract relations from a large collection of documents that can cover any topic. For this reason, limiting the extraction process to a set of relations defined a priori is infeasible. TextRunner applies an approach that is composed of three main modules. The first one is a learner that exploits a parser to label the training data as trustworthy or not and then uses the extracted information to train a Naive Bayes classifier. Next, the extractor uses POS-tag features to obtain a set of candidate tuples from the corpus, which are then sent to the classifier, and only those labeled as trustworthy are kept. Finally, a module denominated assessor assigns a probability score to the tuples extracted at the previous step based on the number of occurrences of each tuple in the corpus. The learning-based approach used in TextRunner has also been applied by several other systems like WOE [6], OLLIE [7], and ReNoun [8]. In particular, WOE exploits Wikipedia-based bootstrapping: the system extracts the sentences matching the attribute-value pairs available within the info-boxes of Wikipedia articles. This data is then used to build two versions of the system: the first one based on PoS-tags, regular expressions, and other shallow features of the sentence, the latter based on features of dependency-parse trees, thus obtaining better results than the other one but with a lack of performance in terms of speed. OLLIE is also capable of identifying not only verb-based relations but also those that involve the use of nouns and adjectives. The approach first builds a training set by selecting the sentences that contain one of the 110,000 seed tuples taken from ReVerb. Then, using a dependency parser, the training set is used to learn a set of extraction pattern templates. OLLIE also tries to take into account the context of the sentence in the extraction process, extending the tuples with additional fields if necessary and assigning a confidence score to each tuple. Finally, ReNoun, as the name suggests, is a system that mainly focuses on extracting noun- based relations. It exploits a set of manually created lexical patterns used as seed facts to learn 1 The framework can be extended to other languages since it is based on universal dependencies. a more extensive set of patterns based on the dependency parse feature of the sentences in the corpus and distant supervision. Next, these patterns are used to generate the final set of extractions, each one having an associated score that represents its correctness. Another common approach to tackle the Open IE task is represented by rule-based systems like ReVerb. ReVerb takes in input a sentence PoS-tagged and NP-chunked using OpenNLP. The sentence is analyzed to check if it is compliant concerning different syntactic and lexical constraints and extract its relations. Given each relation phrase, the system then derives its arguments by checking the nearest noun phrases at the left and the right of the relation. Using a set of rules allows this kind of systems to obtain relations of better quality compared to other approaches, but their variety is limited when compared with other approaches. Finally, there are approaches based on exploiting the presence of specific clauses within the text. This is the case of ClausIE [9] which makes use of the information derived by the dependency parse of the input sentence to determine the set of clauses that compose it. ClausIE then generates a proposition for each selected clause determining which part represents the subject, the relation, and the arguments. Another system that uses this kind of approach is Stanford Open IE [10], where the depen- dency tree of the sentence is traversed to identify the subtrees that represent independent clauses. These clauses are then reduced to shorter entailed sentences through the use of natural logic inference which can be easily transformed into triples. 3. Methodology In this section, we describe our information extraction system called WikiOIE2 . The main WikiOIE features are: • completely unsupervised approach for extracting triples from Wikipedia dumps • text processing based on Universal Dependencies3 to make easy the integration of other languages in the future • customizable pipeline for supporting both new processing and extraction algorithms • integration of an indexing and search engine for browsing the extracted facts. The overall architecture of WikiOIE is presented in Figure 1. The input of the pipeline is the textual format of the Wikipedia dump obtained by the WikiExtractor tool [11], which is the only external component of our framework. It is possible to use other tools for extracting text from a Wikipedia dump, but our import module can read the simple XML format provided by WikiExtractor4 . The text is extracted from the dump and processed by the UDPipe tool [12]. We use version 1 of UDPipe with version 2.5 of the ISDT-Italian model. UDPipe can be trained by using Universal Dependencies data for over 100 languages. UDPipe allows to potentially use our system on different Wikipedia dumps in several languages. WikiOIE directly calls the REST API provided by UDPipe. In this way, it is easy to change the endpoint and the model/language. 2 The code is available on GitHub: https://github.com/pippokill/WikiOIE. 3 https://universaldependencies.org/ 4 https://github.com/attardi/wikiextractor/wiki/File-Format Figure 1: The WikiOIE architecture of the information extraction process. Another advantage of using Universal Dependencies is the common tag-set for all the lan- guages. PoS-tag5 and syntactic dependencies6 are annotated with shared set of tags. The Wikipedia dump is read line-by-line. Each line contains a fragment (passage) of text that is processed by UDPipe. The output of this process is a set of sentences, and each sentence is annotated with syntactic dependencies. The sentence is transformed into a dependency graph that is the input of the Wiki Extractor module. This module extract facts from the sentence in the form of triples (subject, predicate, object) and assigns a score that take into account the number of nouns involved in the subject/object. Triples are stored in a JSON format that reports not only the extracted triple but also the original text of the sentence, the output provided by UDPipe, the reference to the Wikipedia page, the title of the Wikipedia page, and the positions of the subject, predicate and object within the sentence. All this information is useful for indexing and searching and can be exploited by further post-processing steps during the indexing process. The details about both the JSON format and indexing are provided in Section 4. 3.1. The Information Extraction process The information extraction process is sketched in Figure 2. The input text is annotated by UDPipe that provides the CoNLL-U format7 of each sentence occurring in the text. Each token into the sentence is denoted by an index (first column) corresponding to the token position into the sentence (starting from 1). As depicted in Figure 2, the other columns are the features extracted by UDPipe, such as the token, the lemma, the universal PoS-tag, the head of the current word, and the universal dependency relation to the HEAD (root if the head is equal to 0). Figure 2 also reports the dependency graph of the sentence that is used by the Wiki Extractor module for extracting triples. The extraction process that we implement in this first version of the pipeline is an entirely 5 https://universaldependencies.org/u/pos/ 6 https://universaldependencies.org/u/dep/ 7 https://universaldependencies.org/format.html Figure 2: An example of UDpipe processing. unsupervised approach based on both PoS-tag and graph structure. The first step consists of identifying sequences of PoS-tags that match verbs as reported in Table 1. PoS-tag Pattern Example AUX VERB ADP ... è nato nel ... AUX VERB ... è nato ... AUX=(essere, to be) ... è ... VERB ADP ... nacque nel ... VERB ... acquisì ... Table 1 Patterns of valid predicates. In Table 1, the first column reports the PoS-tag patterns, while the second one reports an example of pattern usage. The sentence showed in Figure 2 matches the last pattern (VERB, fondò). When the information extraction algorithm finds a valid predicate pattern, it checks for a candidate subject and object for the predicate. A valid subject/object candidate must match the following constraints: 1. a sequence of tokens belonging to the following PoS-tags: noun, adjective, number, determiner, adposition, proper noun; 2. the sequence can contain only one determiner and/or one adposition. The candidate subject must precede the verb, while the candidate object must follow the predicate pattern. For the sentence in Figure 2 the candidate subject is Nakamura, while the candidate object is il quartier generale di il Kyokushin Karate8 . After the identification of the candidate subject and object, we apply two strategies: simple: the algorithm accepts the subject, predicate, and object as a valid triple and assigns a score. Two distinct scores are calculated for the subject and object. For each noun occurring in the subject/object the score is incremented by 1, while for each proper noun by 2. Finally, the score is multiplied by 1/𝑙 where 𝑙 is the number of tokens occurring in the subject/object. The idea is that a subject/object containing only nouns or proper nouns is more relevant. The final triple score is the sum of the scores assigned to both subject and object. simpledep: in this case, the triple is accepted only if both the subject and the object have a syntactic relation with the verb. In particular, one of the tokens belonging to the subject/object must have a dependent relation with a token of the verb pattern. Taking into account the sentence in Figure 2 the subject token “Nakamura” is connected with the verb “fondò”, while the object token “quartier” is linked with the same verb, which makes the triple valid. The triple score is computed with the same strategy adopted for the simple approach. The idea behind this approach is to validate the triple by verifying the existence of a syntactic relation between the verb and the subject/object. 4. Dataset We provide three datasets9 in JSON format as in Listing 1. The first dataset is the output of the UDpipe process. This dataset contains all the text passages extracted from WikiExtractor and processed by UDpipe. In particular, the JSON reports: the id of the Wikipedia page, the title of the Wikipedia article, the text and the UDpipe output in conll format. In this dataset, the JSON object array is empty since triples are not yet extracted. This dataset contains 23,602,140 text passages. {"id":"3859480", "title":"Anthony Joshua", "text":"Lo spettacolare e drammatico..." "conll":"1 Lo lo DET..." "triples":[ {"subject":{"span":"drammatico incontro","start":3,"end":5,"score":1.0}, 8 It is important to note that UDpipe splits the articulated preposition “del” in “di:ADP” and “il:DET”. 9 A sample of the dataset is available here: https://github.com/pippokill/WikiOIE/tree/master/sample. We will upload the full dataset on Zenodo. "predicate":{"span":"vide","start":5,"end":6,"score":1.0}, "object":{"span":"Joshua","start":6,"end":7,"score":3.0},"score":4.0}, {"subject":{"span":"Joshua","start":6,"end":7,"score":3.0}, "predicate":{"span":"subire","start":7,"end":8,"score":1.0}, "object":{"span":"il primo knockdown in carriera","start":8,"end":13," score":0.6},"score":3.6} ]} Listing 1: JSON structure of the dataset. This dataset can be used by the WikiOIE tool for extracting triples by using several strategies. In this version, we include the approaches described in Section 3: simple and simpledep. The system can be easily extended by implementing other strategies. The other two datasets that we provide contain triples extracted respectively by simple and simpledep. In these datasets the conll value is empty, while the field triples stores the JSON array of extracted triples. Table 2 reports information about the number of triples extracted and the number of distinct subjects, objects, and predicates for each dataset. Dataset #triples #dist. subj #dist pred #dist obj simple 5,739,830 2,184,493 387,706 2,981,188 simpledep 3,562,803 1,298,481 269,551 2,030,742 Table 2 Dataset statistics. We observe that the number of triples extracted by simpledep is less than the ones extracted by simple since simpledep removes triples in which the subject or the object is not linked to the verb. Moreover, WikiOIE provides two valuable tools for indexing the triples dataset using Apache Lucene10 , and for exporting the only triples in TSV format. During the indexing, it is possible to apply a further processing step for post-processing or filtering triples. For example, during the indexing or the export, it is possible to remove triples in which the predicate occurs less than a specified value. This last feature is important for removing not-relevant triples. Table 3 shows the ten most frequent predicates extracted by simpledep. 5. Evaluation For the evaluation, we randomly select a subset of 200 triples from both the simple and simpledep datasets. The triples are selected for the ones in which the predicate occurs at least 20 times. Then, each triple is annotated by two experts as relevant (valid) or not-relevant. We used the Cohen’s Kappa coefficient (K) to measure the pairwise agreement between the two experts. K is a more robust measure than simple percent agreement calculation, since it takes into 10 https://lucene.apache.org/ Predicate Occurrences ha 55,443 divenne 24,361 ebbe 23,363 raccoglie 19,650 fu prodotto da 18,520 fa 16,664 aveva 16,078 presenta 14,445 comprende 13,629 ha battuto in 11,599 Table 3 Most frequent predicates extracted by simpledep. account the agreement occurring by chance. Higher values of K corresponds to higher inter- rater reliability. Open IE task lacks a formal definition of triple relevance. A first attempt to define triple relevance is reported in [13]. The core aspects of this definition are assertiveness, minimalism, and completeness. Namely, they state that the extracted triple should be asserted by the original sentence, the triples are compact and self-contained and that all the relations asserted in the input text are extracted. In our evaluation, we decide to give less weight to minimalism and focus more on the extraction completeness. After the annotation, we compute the ratio of relevant triples for each dataset and expert. Cohen’s kappa coefficient for each dataset is provided in the last column of Table 4. Dataset #valid (exp 1) #ratio (exp 1) #valid (exp 2) #ratio (exp 2) Kappa C. simple 96 0.48 110 0.55 0.42 simpledep 115 0.64 161 0.81 0.24 Table 4 Results of the annotation process. Results in Table 4 show that the simpledep approach provides better results since it can filter out triples in which the subject or the object have not a syntactic link with the verb. However, Cohen’s kappa coefficient shows a low agreement between the experts, and in the case of the simpledep method, the agreement is inferior. A possible solution would be to refine the annotation by introducing a scale of relevance (e.g., from 1 to 5). Moreover, the annotation results may be affected by the small set of triples considered. We plan to extend the annotation to a wider range of extracted triples and employ more annotators. Further, we perform a detailed analysis of triples considered non-relevant by both annotators, classifying errors in four types: 1. Generic subject: the triple subject cannot be linked to a real-world entity; 2. Missing subject: the extracted subject is not a real-world entity; 3. Incomplete object: the extracted object features incomplete information; 4. Generic relation: the whole triple cannot be self-explained. In Table 5, we report examples for each type of error. A frequent error is the extraction of generic subjects, incomplete objects, or whole generic triples. This type of triples cannot be self-contained since they require contextual information (e.g., the feeding Wikipedia page, the feeding sentence). For instance, often generic triples involve demonstrative adjectives that require contextual information uncaught in the extracted triple. Error type Subject Predicate Object Generic subject Quattro di queste specie beneficiano di una protezione speciale Incomplete object Questa voce incorpora informazioni Generic relation il montaggio è ridotto a il minimo Missing subject In quel momento arriva Pitch Missing subject L’ anno seguente passò a lo Start Table 5 Instances of error types. Another common error is missing subjects. Analyzing this type of error, we find that most of them occur when complements of time or place are recognized as the triple subject, e.g. In quel momento - arriva - Pitch. To overcome this issue, it is possible to recognize and filter complements of time and place when they appear in the subject. 6. WikiOIE search engine To facilitate the search and browsing of extracted triples, we develop a web interface based on the index built by WikiOIE. In particular, using the Lucene search engine it is possible to search triples containing specific tokens occurring in selected triple’s fields (e.g., subject, predicate, object, or a combination of them). Figure 3 shows the query result about all triples containing the word ’Amiga’ in the subject. The user interface reports, for each triple, the score provided by the search engine and the score assigned to the triple. Figure 3: Search engine results. Figure 4: Triples details. For each triple, it is possible to visualize details by clicking on the View button. The details page (Figure 4) shows the triple and the sentence from which it was extracted and the title of the Wikipedia page. Moreover, also all the other triples extracted from the same page are visualized. 7. Conclusions and Future Work In this paper, we propose WikiOIE, a framework for open information extraction on Wikipedia dumps. The tool exploits UDpipe for processing and annotating the text and can be extended by adding several information extraction approaches. It is possible to add new extraction algorithms by extending a simple software interface. In this work, we propose two simple unsupervised methods for the Italian version of Wikipedia. Both the approaches are based on heuristics based on both PoS-tags and syntactic dependencies. We apply these methods to the Italian dump of Wikipedia by extracting millions of triples. A small subset of extracted triples is evaluated by two experts obtaining promising results. However, the Kappa coefficient shows a low agreement between annotators. To improve the annotators’ agreement and get a more reliable overview of the system performance, we plan to extend the evaluation to a larger scale study. Moreover, as future work, we plan to improve the extraction accuracy by implementing supervised approaches. In particular, the idea is to collect users’ feedback by the web interface for building a training set. References [1] R. Guarasci, E. Damiano, A. Minutolo, M. Esposito, G. De Pietro, Lexicon-grammar based open information extraction from natural language sentences in italian, Expert Systems with Applications 143 (2020) 112954. [2] S. Brin, Extracting patterns and relations from the world wide web, in: International workshop on the world wide web and databases, Springer, 1998, pp. 172–183. [3] E. Agichtein, L. Gravano, Snowball: Extracting relations from large plain-text collections, in: Proceedings of the fifth ACM conference on Digital libraries, 2000, pp. 85–94. [4] E. Riloff, R. Jones, et al., Learning dictionaries for information extraction by multi-level bootstrapping, in: AAAI/IAAI, 1999, pp. 474–479. [5] O. Etzioni, M. Banko, S. Soderland, D. S. Weld, Open information extraction from the web, Communications of the ACM 51 (2008) 68–74. [6] F. Wu, D. S. Weld, Open information extraction using wikipedia, in: Proceedings of the 48th annual meeting of the association for computational linguistics, 2010, pp. 118–127. [7] M. Schmitz, S. Soderland, R. Bart, O. Etzioni, et al., Open language learning for information extraction, in: Proceedings of the 2012 joint conference on empirical methods in natural language processing and computational natural language learning, 2012, pp. 523–534. [8] M. Yahya, S. Whang, R. Gupta, A. Halevy, Renoun: Fact extraction for nominal attributes, in: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 2014, pp. 325–335. [9] L. Del Corro, R. Gemulla, Clausie: clause-based open information extraction, in: Proceed- ings of the 22nd international conference on World Wide Web, 2013, pp. 355–366. [10] G. Angeli, M. J. J. Premkumar, C. D. Manning, Leveraging linguistic structure for open domain information extraction, in: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2015, pp. 344–354. [11] G. Attardi, Wikiextractor, https://github.com/attardi/wikiextractor, 2015. [12] M. Straka, J. Straková, Tokenizing, pos tagging, lemmatizing and parsing ud 2.0 with udpipe, in: Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, Association for Computational Linguistics, Vancouver, Canada, 2017, pp. 88–99. URL: http://www.aclweb.org/anthology/K/K17/K17-3009.pdf. [13] G. Stanovsky, I. Dagan, Creating a large benchmark for open information extraction, in: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2016, pp. 2300–2305.