SRDF: Korean Open Information Extraction using Singleton Property Sangha Nam, Younggyun Hahm, Sejin Nam, and Key-Sun Choi Semantic Web Research Center, KAIST, Korea {nam.sangha, hahmyg, namsejin, kschoi}@kaist.ac.kr Abstract. In this paper, we propose a new Korean Open Information Extraction system so-called SRDF. The SRDF system has been designed to effectively ex- tract reified triples from Korean natural language texts based on the use of sin- gleton property and other natural language processing techniques such as part-of- speech tagging and chunking. The SRDF system is the Open Information Ex- traction system that enables extracting a multiple number of triples from a single sentence via reification. 1 Introduction Traditional Information Extraction (IE) thus far has been relying heavily on human in- tervention of hand-crafted rules and hand-tagged training data. In recent years, on the other hand, Open IE based on self-supervised learning has become more strongly sug- gested to overcome such a limitation, and it is now possible to process massive text corpora without having to require much human effort. TextRunner [1], WOE [2] and ReVerb [3] are some of the most representative examples of Open IE systems that offer excellent performance in automatically extracting structured information from unstruc- tured natural language texts. Unfortunately, however, these systems cannot guarantee the same level of performance on languages other than English. For that reason, the Chinese Open IE system as an instance is currently being actively researched [4]. In addition to this, these systems also fall short of representing multiple relationships between argument(s) and relation(s) within a sentence, since they are designed to focus primarily, or rather restrictedly, on binary extractions. In other words, the recent Open IE systems can extract only one triple with a single argument and relation respectively per sentence, whereas many of the statements, especially those describing an event, are generally inclusive of more than one argument such as time and location, and/or two or more relations. This indeed has been one of the most principal challenges remained to be addressed in the study of Open IE. Throughout the following sections of this paper, we introduce SRDF, the new Korean Open IE system, in much greater details. Since the Korean language, in a variety of respects, has uniquely different grammatical structures and the system of postposition and word spacing compared to other languages like English and Chinese in particular, our team has been devoted to develop a new Open IE system specially designed to meet the characteristics of Korean. We, at the same time, have also strived to build a system through which multiple relationships between argument(s) and relation(s) within a sen- tence can be extracted by using singleton property – the new method of reification [5]. Taking the singleton property approach to extracting reified triples from Korean natural language texts is to minimize the number of triples, and to further allow the results of our system to be compatible with well-known knowledge bases such as DBpedia and YAGO. 2 Korean Open Information Extraction using SRDF SRDF simply receives as input a Korean text corpus and returns an extracted set of triples expressed in the form of singleton property. The system of SRDF operates through three steps of procedure in total that are “preprocessing”, “argument and relation detection”, and “triple generation” as described below. 2.1 Preprocessing & Argument and Relation Detection When a Korean sentence is given as input, the SRDF system performs part-of-speech (POS) tagging and chunking first, as preprocessing. The POS-tagged and chunked Korean sentence is then passed on to the next stage of argument and relation detection. This stage literally is to detect argument(s) and rela- tion(s) from the given Korean sentence, and is further divided into three smaller steps similar to other Open IE systems based on self-supervised learning. ─ Labeling: At this stage, the preprocessed Korean sentence gets automatically labeled based on three important factors that are “the POS tag patterns”, “the position of words in sentence”, and “the postposition(s) within the sentence”. ─ Learning: An argument detection model and a relation detection model are learned here using decision tree. The former model uses “lemma”, “POS tag”, “length of the sentence”, “start position of argument”, “end position of argument”, “next lemma” and “next POS tag” as features, and the features that the latter uses include “lemma”, “POS tag” and “postposition”. ─ Extracting: Once a Korean sentence is received as input, the relation detection model in the SRDF system classifies whether a certain word in the given sentence is a re- lation or not, while the argument detection model classifies whether the word is a subject or an object. After that, they return all the classification results that are nec- essary for the next step of triple generation, including “the postposition(s) of de- tected argument(s)” and “the position of detected argument(s) and relation(s) in sentence”. 2.2 Triple Generation Our team has studied not only “how to extract information from Korean sentences” but “how to generate triples for representing multiple relationships between argument(s) and relation(s) within a sentence” as well. For example, the sentence ‘‘Barack Obama was awarded the Nobel Prize in 2009.’’ is including multiple relationships shared between one relation and two arguments, thus it is ideal that two triples should be generated like , . However, when we perform information extraction on the above-mentioned sentence using the ReVerb [3] pro- gram for instance, only one triple is returned excluding the fact that Barack Obama was awarded “in 2009”. In order to address this problem, we have adopted an approach that grafts the concept of singleton property onto Open IE. Fig. 1. Example of Korean Open Information Extraction using SRDF As shown in Fig. 1 (blue = subject, orange = object, purple = relation), the method of generating triples using SRDF is as follows: 1. Identify the association between argument(s) and relation(s) based on the posi- tion of words in sentence. – Korean sentences have a different structure from Eng- lish. Whereas an English sentence typically has a Subject-Relation-Object word order, the word order of Subject-Object-Relation is more common in Korean. In this light, the SRDF system can infer that the object(s) in Korean sentences are associ- ated with the relation(s) located on the right side of them, and vice versa. In effect, the objects “the Nobel Prize” and “2009-year” are associated with the relation “award” on their right-hand side, as shown by the blue curved arrows in Fig. 1. 2. Identify whether the postposition(s) attached to the object(s) of the sentence is accu- sative. – In Korean, it is also common that almost every object is attached with a postposition, and the postposition is considered a very important factor when un- derstanding syntax of the sentence. Among various postpositions, accusative post- position “EUL” specifically indicate that the relation of the sentence is a transitive verb. When the postposition attached to the object of the sentence is accusative, the SRDF system generates a triple with the following form in general , in which the relation “award” is attached with a provenance #1. In other cases where the relation of the sentence is not a transitive verb and the postposition attached to the object is not accusative, triples are made by the SRDF system in the anonymous form of . This method has an advantage of ena- bling representation of sentences with no object in the form of triple. 3. Generate reified triples using remained objects. – The main triple has been made in the previous step and, at this stage, “2009-year-E” should be reified. When generating a reified triple, the SRDF system situates the relation of the main triple as the subject of the reified triple, places the postposition as the relation, and lets the object be the object as for instance. 3 Experiment The performance of SRDF system has been evaluated by application to 100 Korean sen- tences randomly sampled from the web as a testing data set. The evaluation results have been assessed by two human evaluators based on the two criteria of Detection – how precisely the SRDF system has detected the argument(s) and relation(s) from the given sentence – and Triple Generation – how accurately the reified-triple has been generated from the detected argument(s) and relation(s) –. The results and error statistics are pre- sented in Table 1 below. As shown in Table 1, the SRDF system is of an excellent capa- bility of both detecting argument(s) and relation(s) and generating triples, where the per- formance of triple generation is relatively 18% lower. Having thoroughly examined the failed sentences, we found out that most errors occur in the course of detection followed by POS-tagging, and the least errors are made during the process of reification. Table 1. Performance Evaluation and Error Statistics of SRDF Performance Error Statistics Criteria Precision Recall F1-score POS Detection Reification Detection 0.81 0.86 0.83 Triple 0.15 0.74 0.11 0.66 0.65 0.65 Generation 4 Conclusion In this paper, we have demonstrated the feasibility of extracting structured information from Korean natural language texts without any human intervention. We have also proposed a novel method of combining Open IE with the singleton property technique in repre- sentation of multiple relationships between argument(s) and relation(s) within a sen- tence. Our project is still ongoing in active progress, and it is with great expectation for our forthcoming researches to more technically expand the scope of our project. All the expected accomplishments of the next phases of our project work will be made publicly available through the website at http://143.248.135.216:8080/SRDFREST/index.htm. Acknowledgement. This work was supported by Institute for Information & communications Technology Promotion(IITP) grant funded by the Korea government(MSIP) (No. R0101-15- 0054, WiseKB: Big data based self-evolving knowledge base and reasoning platform) References 1. Etzioni, O., et al.: Open Information Extraction from the Web. Communications of the ACM 51(12), 68-74 (2008) 2. Wu, F., and Weld, D. S.: Open Information Extraction using Wikipedia. Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (2010) 3. Etzioni, O., et al.: Open Information Extraction: The Second Generation. IJCAI (2011) 4. Tseng, Y. H., et al.: Chinese Open Relation Extraction for Knowledge Acquisition. EACL (2014) 5. Nguyen, V., Olivier B., and Amit S.: Don't Like RDF Reification? Making Statements about Statements using Singleton Property. In Proceedings of the 23rd international conference on World wide web (2014)