Open Information Extraction on German Wikipedia Texts Christian Klose1 , Zhou Gui1 and Andreas Harth1 [1] Friedrich-Alexander-Universität Erlangen-Nürnberg, Chair of Technical Information Systems, Lange Gasse 20, 90403 Nürnberg, Germany Abstract Knowledge Graphs are becoming a fundamental building block for semantic search and voice assistants. This paper deals with the automated Knowledge Graph Construction from unstructured data. Predomi- nantly, the focus is on Open Information Extraction (Open IE), an unsupervised learning approach that attempts to extract triples from plain text independent of their domain. Hence, it is the first step towards automated Knowledge Graph Construction. Previous work mainly applied Open IE to English texts. In this paper, the focus is on German texts. Due to the lack of German Open Information Extraction datasets, a dataset on the basis of Wikipedia is created. Two Open Information Extraction Systems for German are introduced. Finally, the performance of the systems are evaluated. Keywords Open Information Extraction, Natural Language Processing, Knowledge Graph Construction 1. Introduction In his vision of the Semantic Web [1], Tim Berners-Lee described a change from the Web of documents for and by people to a Web of information. According to his vision, information on the Web should not only be manipulable by humans, but also by machines. Most documents in the World Wide Web consist to a large extent of text and are still difficult for machines to process today. For this reason, the W3C1 has developed a universal language Resource Description Framework (RDF), which makes information for machines on the Web accessible. Information in RDF can be serialized in multiple formats. One common format is Turtle. Turtle is a text representation of an RDF Graph which allows to store RDF triples in a compact and human readable form. A large collection of RDF Graphs in a specific domain can construct a Knowledge Graph (KG). Virtual assistants in particular can make use of facts, events and abstract concepts stored in Knowledge Graphs to bring insights to people during semantic search or question answering. In order to build Knowledge Graphs, knowledge can be extracted in the form of triples from documents that are available in natural language. The transformation from text into a machine readable form is, therefore, a core task for building Knowledge Graphs. It can be broadly summarized as the goal of Machine Reading [2]. In the field of AI, Machine Reading is a long Text2KG 2022: International Workshop on Knowledge Graph Generation from Text, Co-located with the ESWC 2022, May 05-30-2022, Crete, Hersonissos, Greece Envelope-Open christian.klose@fau.de (C. Klose); zhou.gui@fau.de (Z. Gui); andreas.harth@fau.de (A. Harth) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings CEUR Workshop Proceedings (CEUR-WS.org) http://ceur-ws.org ISSN 1613-0073 1 World Wide Web Consortium https://www.w3.org/ standing goal and is discussed in the research community under the term Information Extraction (IE). IE includes downstream tasks such as Named Entity Recognition (NER), Relation Extraction (RE) or Entity Linking (EL). In recent years, an unsupervised approach to Relation Extraction, namely Open Information Extraction, has shown promising results and is therefore, the main subjective of this paper. The paper is structured as follows: In Section 2 the previous work is outlined. Thereafter, in Section 3, the scientific approach applied for our research is described. In Section 4 the results are presented and discussed. Finally, in Section 5 a conclusion is drawn. 2. Related Work Open Information Extraction is about extracting all possible triples from text, without knowing the relations or entities occurring in it a priori. The first Open IE system ever created is Text Runner [3], a learning-based system developed by a group of researchers at the University of Washington. In the years to follow, other systems were introduced, each attempting to improve on the results of the state-of-the-art by overcoming identified weaknesses and flaws in the systems. In addition, other types than learning-based system emerged, namely rule- based, clause-based and systems making use of inter-proposition relationships [4]. Rule-based systems entirely depend on hand-crafted rules or patterns. Systems that make use of this approach are, for example, KrakeN [5] or Exemplar [6]. In order to improve the precision of the systems described above, the idea of breaking down complex sentences into smaller components (clauses) came up. Two Clause-based Systems in particular are worth mentioning: ClausIE [7] and Stanford Open IE [8]. The system types mentioned so far have one common weakness. None of them is using the context and, therefore, a correct extraction cannot be guaranteed. Inter-Proposition-based Systems are trying to bridge this gap. Systems of this category, for example, are RelNoun [9], OpenIE4 [10], NestIE [11] and MinIE [12]. One of the first to apply neural networks to Open IE were [13] with RnnOIE. The scientists formulated the problem as a sequence labeling task. Recently proposed models that follow a sequence labeling approach are SenseOIE [14], SpanOIE [15] and iRankOIE [16]. A downside of this discovered by [17] is, however, that sequence labeling models are not able to change the sentence structure or use new auxiliary words in the extraction. [18] used a different neural approach called sequence generation to develop CopyAttention and overcome that downside. Furthermore, [19, 20, 17] describe end-to-end approaches using seq2seq models based on the encoder-decoder framework alleviating the downsides and the need for hand-crafted patterns. 3. Research Method 3.1. Dataset Creation One of the main challenges within Open IE is to verify the quality of extractions made by the system. A solution to this is to have a dataset, that is to find qualitative training and testing data where triples are mapped to sentences. Currently, two approaches to automatically generate training data are considered particularly useful among researchers. First, the infobox-matching approach [21] where Wikipedia infobox values are linked to sentences in the corpus and second, the distantly supervised approach [22] where existing knowledge bases are used to heuristically map triples to sentences. For the creation of the German Wikipedia dataset, the infobox-matching approach is used. In total, 5 steps were executed to create a clean dataset including 1) finding and downloading a Wikipedia dataset 2) prepossessing and cleaning the text 3) matching all infobox triples to the correspond page text, 4) matching the triples on a sentence level and 5) filtering out noisy training examples. After the last step, a number of 6, 453 triples mapped to 5, 372 sentences containing 1, 324 relation types was derived. Furthermore, the average number of words used in each part was calculated. On average, the subject has 2.0, the predicate 1.0 and the object 1.5 words. Last but not least, the average length of a sentence was computed and amounts to 22.8 words. The dataset and the code used to create the dataset are published and freely accessible.2 3.2. Open IE Systems In total, two systems were implemented and used for our research. The first system is turCy and we implemented turCy as a spaCy3 pipeline component to leverage the POS Tagger and Dependency Parser (DP). The second system uses an encoder-decoder seq2seq neural model that we call NeuralGerOIE. 3.2.1. TurCy Research by [5] and [23] implies that with a decent amount of POS and DP patterns, a large variety of triples can be extracted - independently of any other constraints. TurCy is following a similar approach. In fact, it is a pattern learning system for binary extractions that is assembled of two essential components. The first is the Pattern Builder, the second is the Triple Extractor. A pattern consists of nodes that represent the POS-tags of a sentence. The relations between the nodes reflect the dependencies parse tree. A pattern with respect to a sentence can be used to represent exactly one triple. The pattern itself consists of subpatterns. Each subpattern represents a node in a tree and maps a word with left and right child nodes. For the sentence: ”Im Jahr 2019 zählte Nürnberg 518370 Bewohner.” the POS and dependency tree is shown in Figure 1. The Triple Extractor of turCy is - at its core - a recursive sub-tree search using the Pattern List. The algorithm starts at the root node, traverses each path up to the leaf of the tree and checks whether a node and its edge match a sub-pattern of a respective pattern. If all sub-patterns match, the triple is assembled and stored with respect to a sentence. Notice, a match implies that at last one token of all three parts (subject, predicate, object) of a triple were found during the recursive sub-tree search in the sentence. However, the true nature of the algorithm is more complicated. If the interested reader wants to fully understand the working mechanism, we recommend diving into the code. Therefore, and also to ensure full transparency of our research, turCy has been packaged as a python library and is released under an open-source license.4 2 https://github.com/ChrisDelClea/WikiGerman4OIE 3 Library for advanced natural language processing: https://spacy.io 4 https://github.com/ChrisDelClea/turCy Figure 1: POS-DP-Tree or pattern representation of a simple sentence. The colored nodes are the parts of the triple in the pattern, with orange being the subject(s), purple being the predicate(s), and blue being the object(s). All other nodes are colored in green. 3.2.2. NeuralGerOIE 5 The latest state-of-the-art Open IE systems use seq2seq neural networks. Therefore, the second system developed follows the sequence generation approach. and was created using the Simple Transformers 6 library. For training of the model, the WikiGerman4OIE dataset (3.1) was utilized. In addition, a pre-trained BART model for German was used.7 It is important to mention that the model was trained to output multiple extractions within the scope of one subject. We fine-tuned the model for 10 epochs using a batch size of 8. The maximum length of the input sentence was set to 300, i.e. all words thereafter were truncated. A difference between the seq2seq models discussed earlier and the approach described here, is the type of separator used. While [18] used start and end tags for each part of a triple ( Deep Learning is a subfield of Machine Learning ), we found that one token before each part along with a final end token was sufficient. In addition, the model struggled to output the same separator token multiple times within a sequence. Therefore, we added a number to each separator token. Lastly, we noticed that the names of the separator tokens affected the quality of the outputs. We proceeded with the following triple input: Deep Learning is a subfield of Machine Learning . In total, the fine-tuning took about 2 hours 25 minutes on a Nvidia Tesla V100 32GB GPU to complete. The results are discussed in the subsequent section. 5 https://github.com/ChrisDelClea/NeuralGermanOIE 6 https://simpletransformers.ai/ 7 https://huggingface.co/Shahm/bart-german 4. Results Each of the Open IE systems is evaluated against a gold dataset. The reason for high-quality annotations originates from the need to obtain more accurate insights regarding the quality of the extractions. Therefore, a subset of the WikiGerman4OIE dataset (3.1) was annotated by two of the authors. In total, the gold dataset consists of 47 sentences and 175 triples. On average, a sentence contains 3.8 triples. This subset is the basis for the evaluation process. Regarding the quantitative metrics, precision, recall, and 𝐹1 -score is computed using the token-based evaluation method introduced by [24] in a slightly adjusted manner to fit the systems outputs. The evaluation process is as follows: First, the full WikiGermanOIE dataset for turCy-large and the gold dataset for turCy-small was used to create the patterns with the Pattern Builder. The result were two pattern lists with sizes of 6,453 and 175 patterns, respectively, corresponding to the number of triples in each dataset. Second, the 47 sentences were fed as the only input into the Triple Extractor and the NeuralGerOIE prediction function. Lastly, precision, recall and 𝐹1 -score were calculated. Table 1 System’s accuracy analysis using default metrics. System # Patterns # Extractions # Matches Prec. Recall F1 turCy-large 6,453 29 21/175 0.71 0.11 0.19 turCy-small 175 114 94/175 0.83 0.52 0.64 NeuralGerOIE - 52 40/175 0.72 0.30 0.43 In general, we found that turCy-small achieves a better 𝐹1 -score as NeuralGerOIE due to its ability to extract many triples. At the same time, we noticed that the NeuralGerOIE obtained a very high precision for sentences annotated with a single triple. A comparison between turCy-small and turCy-large (the only difference is in the number of patterns and their dataset of origin) indicates that, the quality of the automatically generated dataset is lower than initially expected. The reason for this assumption is that while trucy-large contains a high number of patterns build from the automatically generated dataset, only 29 triples were extracted. TurCy-small, on the other hand, yielded significantly more extractions with a lower number of patterns created from the gold dataset. In fact, the result is very counter-intuitive, as one would expect the number of extractions to be linearly correlated with the number of patterns. Moreover, when comparing turCy and NeuralGerOIE, we found that while turCy can only output words from the text in the extractions, the neural model can learn a direct representation between the words used in the text and the corresponding words in the gold dataset. In addition to the quality analysis, we examined the run-time as Open IE systems are intended to process large amounts of text data at rapid pace. In order to make a judgment about the run-time, basically two metrics are taken into consideration. First, is the number sentences processed per second (# sent./sec.). Second, is the number of triples yielded per second (# triples/sec.). Furthermore, the ratio of these two metrics with respect to the number of stored patterns is of interest. Table 2 shows that turCy-small is the best performing system, followed by Neural- GerOIE. As expected, the number of patterns has a major impact on the run-time - as we can see with turCy-large - leading to the conclusion that one of the main objectives of rule-based Open IE systems is to keep the number of patterns as small as possible, but as large as necessary to maximize the number of extractions. Table 2 Run-time comparison between RE and Open IE. # System # Patterns # sent. /sec. # triples /sec. turCy-large 6,453 0.47 0.71 turCy-small 175 1.12 2.7 neuralGerOIE - 1.14 1.14 5. Conclusion & Outlook In this paper the contribution is twofold. First, a dataset for Open Information Extraction based on German Wikipedia texts were created and published. Second, two different approaches for Open IE were implemented and evaluated. Several interesting research directions for future works can be recommended. We firmly believe that there is still potential for improvement in terms of dataset quality and quantity. For instance, in order to improve the quality, a crowd-sourcing platform such as Amazon Mturk could be leveraged. In addition, the distantly supervised approach for automated training data generation can be explored. In doing so, it would also help to determine what impact the applied approach for automated training data generation has on the quality of extractions made by the Open IE system. Finally, the two Open IE systems can be further optimized to yield better results. For example, for the rule-based system turCy, tree pruning approaches can be explored to reduce the overall number of patterns and therefore, improve the extraction speed. Regarding NeuralGerOIE multi-subject extractions, different architectures and the utilization of more recent language models might be considered. 6. Acknowledgments Acknowledgments This research paper was created within the scope of the project: Software Campus 2.0 (FAU) Grant number 01IS17045. The project was funded by the German government, therefore, we would kindly thank them for their sponsorship. References [1] T. Berners-Lee, J. HENDLER, O. LASSILA, The semantic web, Scientific American (2001) 34–43. [2] O. Etzioni, M. Banko, M. Cafarella, Machine reading., 2007, pp. 1–5. [3] M. Banko, M. Cafarella, S. Soderland, M. Broadhead, O. Etzioni, Open information extrac- tion from the web, IJCAI International Joint Conference on Artificial Intelligence (2007) 2670–2676. [4] C. Niklaus, M. Cetto, A. Freitas, S. Handschuh, A survey on open information extraction, arXiv preprint arXiv:1806.05599 (2018). [5] A. Akbik, A. Löser, KrakeN: N-ary facts in open information extraction, in: Proceedings of the Joint Workshop on Automatic Knowledge Base Construction and Web-scale Knowledge Extraction (AKBC-WEKEX), 2012, pp. 52–56. URL: https://www.aclweb.org/anthology/ W12-3010. [6] F. Mesquita, J. Schmidek, D. Barbosa, Effectiveness and efficiency of open relation extraction (2013) 447–457. URL: https://www.aclweb.org/anthology/D13-1043. [7] L. Del Corro, R. Gemulla, Clausie: clause-based open information extraction, in: Proceed- ings of the 22nd international conference on World Wide Web, 2013, pp. 355–366. [8] G. Angeli, M. J. J. Premkumar, C. D. Manning, Leveraging linguistic structure for open domain information extraction, in: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2015, pp. 344–354. [9] H. Pal, et al., Demonyms and compound relational nouns in nominal open ie, in: Pro- ceedings of the 5th Workshop on Automated Knowledge Base Construction, 2016, pp. 35–39. [10] M. Mausam, Open information extraction systems and downstream applications, in: Proceedings of the twenty-fifth international joint conference on artificial intelligence, 2016, pp. 4074–4077. [11] N. Bhutani, H. Jagadish, D. Radev, Nested propositions in open information extraction, in: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2016, pp. 55–64. [12] K. Gashteovski, R. Gemulla, L. d. Corro, Minie: minimizing facts in open information extraction, Association for Computational Linguistics, 2017. [13] G. Stanovsky, J. Michael, L. Zettlemoyer, I. Dagan, Supervised open information extraction, in: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), 2018, pp. 885–895. [14] A. Roy, Y. Park, T. Lee, S. Pan, Supervising unsupervised open information extraction models, in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019, pp. 728–737. [15] J. Zhan, H. Zhao, Span model for open information extraction on accurate corpus, in: Pro- ceedings of the AAAI Conference on Artificial Intelligence, volume 34, 2020, pp. 9523–9530. [16] Z. Jiang, P. Yin, G. Neubig, Improving open information extraction via iterative rank-aware learning, arXiv preprint arXiv:1905.13413 (2019). [17] K. Kolluru, S. Aggarwal, V. Rathore, S. Chakrabarti, et al., Imojie: Iterative memory-based joint open information extraction, arXiv preprint arXiv:2005.08178 (2020). [18] L. Cui, F. Wei, M. Zhou, Neural open information extraction, arXiv preprint arXiv:1805.04270 (2018). [19] P.-L. H. Cabot, R. Navigli, Rebel: Relation extraction by end-to-end language generation, in: Findings of the Association for Computational Linguistics: EMNLP 2021, 2021, pp. 2370–2381. [20] K. Kolluru, V. Adlakha, S. Aggarwal, S. Chakrabarti, et al., Openie6: Iterative grid labeling and coordination analysis for open information extraction, arXiv preprint arXiv:2010.03147 (2020). [21] F. Wu, D. S. Weld, Open information extraction using wikipedia, in: Proceedings of the 48th annual meeting of the association for computational linguistics, 2010, pp. 118–127. [22] M. Mintz, S. Bills, R. Snow, D. Jurafsky, Distant supervision for relation extraction without labeled data, in: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, 2009, pp. 1003–1011. URL: https://www.aclweb.org/anthology/P09-1113. [23] Mausam, M. Schmitz, S. Soderland, R. Bart, O. Etzioni, Open language learning for information extraction, in: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, 2012, pp. 523–534. URL: https://www.aclweb.org/anthology/D12-1048. [24] W. Lechelle, F. Gotti, P. Langlais, Wire57: A fine-grained benchmark for open information extraction (2019) 6–15.