Open Information Extraction on German Wikipedia Texts

Open Information Extraction on German Wikipedia Texts ChristianKlose christian.klose@fau.de Chair of Technical Information Systems Friedrich-Alexander-Universität Erlangen-Nürnberg

Lange Gasse 20 90403 Nürnberg Germany

ZhouGui zhou.gui@fau.de Chair of Technical Information Systems Friedrich-Alexander-Universität Erlangen-Nürnberg

Lange Gasse 20 90403 Nürnberg Germany

AndreasHarth andreas.harth@fau.de Chair of Technical Information Systems Friedrich-Alexander-Universität Erlangen-Nürnberg

Lange Gasse 20 90403 Nürnberg Germany

Open Information Extraction on German Wikipedia Texts 1613-0073 BDE4B2D95209E95C2CEACF1193A25DD3 GROBID - A machine learning software for extracting information from scholarly documents Open Information Extraction Natural Language Processing Knowledge Graph Construction

Knowledge Graphs are becoming a fundamental building block for semantic search and voice assistants. This paper deals with the automated Knowledge Graph Construction from unstructured data. Predominantly, the focus is on Open Information Extraction (Open IE), an unsupervised learning approach that attempts to extract triples from plain text independent of their domain. Hence, it is the first step towards automated Knowledge Graph Construction. Previous work mainly applied Open IE to English texts. In this paper, the focus is on German texts. Due to the lack of German Open Information Extraction datasets, a dataset on the basis of Wikipedia is created. Two Open Information Extraction Systems for German are introduced. Finally, the performance of the systems are evaluated.

Introduction

In his vision of the Semantic Web [1], Tim Berners-Lee described a change from the Web of documents for and by people to a Web of information. According to his vision, information on the Web should not only be manipulable by humans, but also by machines. Most documents in the World Wide Web consist to a large extent of text and are still difficult for machines to process today. For this reason, the W3C 1 has developed a universal language Resource Description Framework (RDF), which makes information for machines on the Web accessible. Information in RDF can be serialized in multiple formats. One common format is Turtle. Turtle is a text representation of an RDF Graph which allows to store RDF triples in a compact and human readable form. A large collection of RDF Graphs in a specific domain can construct a Knowledge Graph (KG). Virtual assistants in particular can make use of facts, events and abstract concepts stored in Knowledge Graphs to bring insights to people during semantic search or question answering.

In order to build Knowledge Graphs, knowledge can be extracted in the form of triples from documents that are available in natural language. The transformation from text into a machine readable form is, therefore, a core task for building Knowledge Graphs. It can be broadly summarized as the goal of Machine Reading [2]. In the field of AI, Machine Reading is a long standing goal and is discussed in the research community under the term Information Extraction (IE). IE includes downstream tasks such as Named Entity Recognition (NER), Relation Extraction (RE) or Entity Linking (EL). In recent years, an unsupervised approach to Relation Extraction, namely Open Information Extraction, has shown promising results and is therefore, the main subjective of this paper. The paper is structured as follows: In Section 2 the previous work is outlined. Thereafter, in Section 3, the scientific approach applied for our research is described. In Section 4 the results are presented and discussed. Finally, in Section 5 a conclusion is drawn.

Related Work

Open Information Extraction is about extracting all possible triples from text, without knowing the relations or entities occurring in it a priori. The first Open IE system ever created is Text Runner [3], a learning-based system developed by a group of researchers at the University of Washington. In the years to follow, other systems were introduced, each attempting to improve on the results of the state-of-the-art by overcoming identified weaknesses and flaws in the systems. In addition, other types than learning-based system emerged, namely rulebased, clause-based and systems making use of inter-proposition relationships [4]. Rule-based systems entirely depend on hand-crafted rules or patterns. Systems that make use of this approach are, for example, KrakeN [5] or Exemplar [6]. In order to improve the precision of the systems described above, the idea of breaking down complex sentences into smaller components (clauses) came up. Two Clause-based Systems in particular are worth mentioning: ClausIE [7] and Stanford Open IE [8]. The system types mentioned so far have one common weakness. None of them is using the context and, therefore, a correct extraction cannot be guaranteed. Inter-Proposition-based Systems are trying to bridge this gap. Systems of this category, for example, are RelNoun [9], OpenIE4 [10], NestIE [11] and MinIE [12]. One of the first to apply neural networks to Open IE were [13] with RnnOIE. The scientists formulated the problem as a sequence labeling task. Recently proposed models that follow a sequence labeling approach are SenseOIE [14], SpanOIE [15] and iRankOIE [16]. A downside of this discovered by [17] is, however, that sequence labeling models are not able to change the sentence structure or use new auxiliary words in the extraction. [18] used a different neural approach called sequence generation to develop CopyAttention and overcome that downside. Furthermore, [19,20,17] describe end-to-end approaches using seq2seq models based on the encoder-decoder framework alleviating the downsides and the need for hand-crafted patterns.

Research Method

Dataset Creation

One of the main challenges within Open IE is to verify the quality of extractions made by the system. A solution to this is to have a dataset, that is to find qualitative training and testing data where triples are mapped to sentences. Currently, two approaches to automatically generate training data are considered particularly useful among researchers. First, the infobox-matching approach [21] where Wikipedia infobox values are linked to sentences in the corpus and second, the distantly supervised approach [22] where existing knowledge bases are used to heuristically map triples to sentences. For the creation of the German Wikipedia dataset, the infobox-matching approach is used. In total, 5 steps were executed to create a clean dataset including 1) finding and downloading a Wikipedia dataset 2) prepossessing and cleaning the text 3) matching all infobox triples to the correspond page text, 4) matching the triples on a sentence level and 5) filtering out noisy training examples. After the last step, a number of 6, 453 triples mapped to 5, 372 sentences containing 1, 324 relation types was derived. Furthermore, the average number of words used in each part was calculated. On average, the subject has 2.0, the predicate 1.0 and the object 1.5 words. Last but not least, the average length of a sentence was computed and amounts to 22.8 words. The dataset and the code used to create the dataset are published and freely accessible.2

Open IE Systems

In total, two systems were implemented and used for our research. The first system is turCy and we implemented turCy as a spaCy3 pipeline component to leverage the POS Tagger and Dependency Parser (DP). The second system uses an encoder-decoder seq2seq neural model that we call NeuralGerOIE.

TurCy

Research by [5] and [23] implies that with a decent amount of POS and DP patterns, a large variety of triples can be extracted -independently of any other constraints. TurCy is following a similar approach. In fact, it is a pattern learning system for binary extractions that is assembled of two essential components. The first is the Pattern Builder, the second is the Triple Extractor. A pattern consists of nodes that represent the POS-tags of a sentence. The relations between the nodes reflect the dependencies parse tree. A pattern with respect to a sentence can be used to represent exactly one triple. The pattern itself consists of subpatterns. Each subpattern represents a node in a tree and maps a word with left and right child nodes. For the sentence: "Im Jahr 2019 zählte Nürnberg 518370 Bewohner." the POS and dependency tree is shown in Figure 1.

The Triple Extractor of turCy is -at its core -a recursive sub-tree search using the Pattern List. The algorithm starts at the root node, traverses each path up to the leaf of the tree and checks whether a node and its edge match a sub-pattern of a respective pattern. If all sub-patterns match, the triple is assembled and stored with respect to a sentence. Notice, a match implies that at last one token of all three parts (subject, predicate, object) of a triple were found during the recursive sub-tree search in the sentence. However, the true nature of the algorithm is more complicated. If the interested reader wants to fully understand the working mechanism, we recommend diving into the code. Therefore, and also to ensure full transparency of our research, turCy has been packaged as a python library and is released under an open-source license.4 5 The latest state-of-the-art Open IE systems use seq2seq neural networks. Therefore, the second system developed follows the sequence generation approach. and was created using the Simple Transformers6 library. For training of the model, the WikiGerman4OIE dataset (3.1) was utilized. In addition, a pre-trained BART model for German was used. 7 It is important to mention that the model was trained to output multiple extractions within the scope of one subject.

NeuralGerOIE

We fine-tuned the model for 10 epochs using a batch size of 8. The maximum length of the input sentence was set to 300, i.e. all words thereafter were truncated. A difference between the seq2seq models discussed earlier and the approach described here, is the type of separator used. While [18] used start and end tags for each part of a triple (<arg1> Deep Learning </arg1><rel> is a subfield of </rel><arg2> Machine Learning </arg2>), we found that one token before each part along with a final end token was sufficient. In addition, the model struggled to output the same separator token multiple times within a sequence. Therefore, we added a number to each separator token. Lastly, we noticed that the names of the separator tokens affected the quality of the outputs. We proceeded with the following triple input: <sub> Deep Learning <rel0> is a subfield of <obj0> Machine Learning <end>. In total, the fine-tuning took about 2 hours 25 minutes on a Nvidia Tesla V100 32GB GPU to complete. The results are discussed in the subsequent section.

Results

Each of the Open IE systems is evaluated against a gold dataset. The reason for high-quality annotations originates from the need to obtain more accurate insights regarding the quality of the extractions. Therefore, a subset of the WikiGerman4OIE dataset (3.1) was annotated by two of the authors. In total, the gold dataset consists of 47 sentences and 175 triples. On average, a sentence contains 3.8 triples. This subset is the basis for the evaluation process. Regarding the quantitative metrics, precision, recall, and 𝐹 1 -score is computed using the token-based evaluation method introduced by [24] in a slightly adjusted manner to fit the systems outputs. The evaluation process is as follows: First, the full WikiGermanOIE dataset for turCy-large and the gold dataset for turCy-small was used to create the patterns with the Pattern Builder. The result were two pattern lists with sizes of 6,453 and 175 patterns, respectively, corresponding to the number of triples in each dataset. Second, the 47 sentences were fed as the only input into the Triple Extractor and the NeuralGerOIE prediction function. Lastly, precision, recall and 𝐹 1 -score were calculated. In general, we found that turCy-small achieves a better 𝐹 1 -score as NeuralGerOIE due to its ability to extract many triples. At the same time, we noticed that the NeuralGerOIE obtained a very high precision for sentences annotated with a single triple. A comparison between turCy-small and turCy-large (the only difference is in the number of patterns and their dataset of origin) indicates that, the quality of the automatically generated dataset is lower than initially expected. The reason for this assumption is that while trucy-large contains a high number of patterns build from the automatically generated dataset, only 29 triples were extracted. TurCy-small, on the other hand, yielded significantly more extractions with a lower number of patterns created from the gold dataset. In fact, the result is very counter-intuitive, as one would expect the number of extractions to be linearly correlated with the number of patterns. Moreover, when comparing turCy and NeuralGerOIE, we found that while turCy can only output words from the text in the extractions, the neural model can learn a direct representation between the words used in the text and the corresponding words in the gold dataset.

In addition to the quality analysis, we examined the run-time as Open IE systems are intended to process large amounts of text data at rapid pace. In order to make a judgment about the run-time, basically two metrics are taken into consideration. First, is the number sentences processed per second (# sent./sec.). Second, is the number of triples yielded per second (# triples/sec.). Furthermore, the ratio of these two metrics with respect to the number of stored patterns is of interest. Table 2 shows that turCy-small is the best performing system, followed by Neural-GerOIE. As expected, the number of patterns has a major impact on the run-time -as we can see with turCy-large -leading to the conclusion that one of the main objectives of rule-based Open IE systems is to keep the number of patterns as small as possible, but as large as necessary to maximize the number of extractions.

Conclusion & Outlook

In this paper the contribution is twofold. First, a dataset for Open Information Extraction based on German Wikipedia texts were created and published. Second, two different approaches for Open IE were implemented and evaluated. Several interesting research directions for future works can be recommended. We firmly believe that there is still potential for improvement in terms of dataset quality and quantity. For instance, in order to improve the quality, a crowd-sourcing platform such as Amazon Mturk could be leveraged. In addition, the distantly supervised approach for automated training data generation can be explored. In doing so, it would also help to determine what impact the applied approach for automated training data generation has on the quality of extractions made by the Open IE system. Finally, the two Open IE systems can be further optimized to yield better results. For example, for the rule-based system turCy, tree pruning approaches can be explored to reduce the overall number of patterns and therefore, improve the extraction speed. Regarding NeuralGerOIE multi-subject extractions, different architectures and the utilization of more recent language models might be considered.

Acknowledgments

Figure 1 :1Figure 1: POS-DP-Tree or pattern representation of a simple sentence. The colored nodes are the parts of the triple in the pattern, with orange being the subject(s), purple being the predicate(s), and blue being the object(s). All other nodes are colored in green.

Table 11System's accuracy analysis using default metrics.System# Patterns # Extractions # Matches Prec. Recall F1turCy-large6,4532921/1750.710.110.19turCy-small17511494/1750.830.520.64NeuralGerOIE-5240/1750.720.300.43

Table 22Run-time comparison between RE and Open IE.# System# Patterns # sent. /sec. # triples /sec.turCy-large6,4530.470.71turCy-small1751.122.7neuralGerOIE-1.141.14

https://github.com/ChrisDelClea/WikiGerman4OIE Library for advanced natural language processing: https://spacy.io https://github.com/ChrisDelClea/turCy https://github.com/ChrisDelClea/NeuralGermanOIE https://simpletransformers.ai/ https://huggingface.co/Shahm/bart-german

Acknowledgments

This research paper was created within the scope of the project: Software Campus 2.0 (FAU) Grant number 01IS17045. The project was funded by the German government, therefore, we would kindly thank them for their sponsorship.

The semantic web TBerners-Lee JHendler OLassila Scientific American 2001 Machine reading OEtzioni MBanko MCafarella 2007 Open information extraction from the web MBanko MCafarella SSoderland MBroadhead OEtzioni IJCAI International Joint Conference on Artificial Intelligence 2007 CNiklaus MCetto AFreitas SHandschuh arXiv:1806.05599 A survey on open information extraction 2018 arXiv preprint KrakeN: N-ary facts in open information extraction AAkbik ALöser Proceedings of the Joint Workshop on Automatic Knowledge Base Construction and Web-scale Knowledge Extraction (AKBC-WEKEX) the Joint Workshop on Automatic Knowledge Base Construction and Web-scale Knowledge Extraction (AKBC-WEKEX) 2012 FMesquita JSchmidek DBarbosa Effectiveness and efficiency of open relation extraction 2013 Clausie: clause-based open information extraction LDel Corro RGemulla Proceedings of the 22nd international conference on World Wide Web the 22nd international conference on World Wide Web 2013 Leveraging linguistic structure for open domain information extraction GAngeli MJ JPremkumar CDManning Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing Long Papers the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing 2015 1 Demonyms and compound relational nouns in nominal open ie HPal Proceedings of the 5th Workshop on Automated Knowledge Base Construction the 5th Workshop on Automated Knowledge Base Construction 2016 Open information extraction systems and downstream applications MMausam Proceedings of the twenty-fifth international joint conference on artificial intelligence the twenty-fifth international joint conference on artificial intelligence 2016 Nested propositions in open information extraction NBhutani HJagadish DRadev Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing the 2016 Conference on Empirical Methods in Natural Language Processing 2016 Minie: minimizing facts in open information extraction KGashteovski RGemulla LDCorro Association for Computational Linguistics 2017 Supervised open information extraction GStanovsky JMichael LZettlemoyer IDagan Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies Long Papers the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 2018 1 Supervising unsupervised open information extraction models ARoy YPark TLee SPan Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) 2019 Span model for open information extraction on accurate corpus JZhan HZhao Proceedings of the AAAI Conference on Artificial Intelligence the AAAI Conference on Artificial Intelligence 2020 34 ZJiang PYin GNeubig arXiv:1905.13413 Improving open information extraction via iterative rank-aware learning 2019 arXiv preprint KKolluru SAggarwal VRathore SChakrabarti arXiv:2005.08178 Imojie: Iterative memory-based joint open information extraction 2020 arXiv preprint LCui FWei MZhou arXiv:1805.04270 Neural open information extraction 2018 arXiv preprint Rebel: Relation extraction by end-to-end language generation P.-LHCabot RNavigli Findings of the Association for Computational Linguistics: EMNLP 2021 2021 KKolluru VAdlakha SAggarwal SChakrabarti arXiv:2010.03147 Openie6: Iterative grid labeling and coordination analysis for open information extraction 2020 arXiv preprint Open information extraction using wikipedia FWu DSWeld Proceedings of the 48th annual meeting of the association for computational linguistics the 48th annual meeting of the association for computational linguistics 2010 Distant supervision for relation extraction without labeled data MMintz SBills RSnow DJurafsky Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP 2009 Open language learning for information extraction MMausam SSchmitz RSoderland OBart Etzioni Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning 2012 Wire57: A fine-grained benchmark for open information extraction WLechelle FGotti PLanglais 2019