<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>SRDF: Korean Open Information Extraction using Singleton Property</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sangha Nam</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Younggyun Hahm</string-name>
          <email>hahmyg@kaist.ac.kr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sejin Nam</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Key-Sun Choi</string-name>
          <email>kschoi@kaist.ac.kr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Semantic Web Research Center</institution>
          ,
          <addr-line>KAIST</addr-line>
          ,
          <country country="KR">Korea</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper, we propose a new Korean Open Information Extraction system so-called SRDF. The SRDF system has been designed to effectively extract reified triples from Korean natural language texts based on the use of singleton property and other natural language processing techniques such as part-ofspeech tagging and chunking. The SRDF system is the Open Information Extraction system that enables extracting a multiple number of triples from a single sentence via reification.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Traditional Information Extraction (IE) thus far has been relying heavily on human
intervention of hand-crafted rules and hand-tagged training data. In recent years, on the
other hand, Open IE based on self-supervised learning has become more strongly
suggested to overcome such a limitation, and it is now possible to process massive text
corpora without having to require much human effort. TextRunner [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], WOE [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and
ReVerb [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] are some of the most representative examples of Open IE systems that offer
excellent performance in automatically extracting structured information from
unstructured natural language texts. Unfortunately, however, these systems cannot guarantee
the same level of performance on languages other than English. For that reason, the
Chinese Open IE system as an instance is currently being actively researched [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
In addition to this, these systems also fall short of representing multiple relationships
between argument(s) and relation(s) within a sentence, since they are designed to focus
primarily, or rather restrictedly, on binary extractions. In other words, the recent Open
IE systems can extract only one triple with a single argument and relation respectively
per sentence, whereas many of the statements, especially those describing an event, are
generally inclusive of more than one argument such as time and location, and/or two or
more relations. This indeed has been one of the most principal challenges remained to
be addressed in the study of Open IE.
      </p>
      <p>
        Throughout the following sections of this paper, we introduce SRDF, the new Korean
Open IE system, in much greater details. Since the Korean language, in a variety of
respects, has uniquely different grammatical structures and the system of postposition
and word spacing compared to other languages like English and Chinese in particular,
our team has been devoted to develop a new Open IE system specially designed to meet
the characteristics of Korean. We, at the same time, have also strived to build a system
through which multiple relationships between argument(s) and relation(s) within a
sentence can be extracted by using singleton property – the new method of reification [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
Taking the singleton property approach to extracting reified triples from Korean natural
language texts is to minimize the number of triples, and to further allow the results of our
system to be compatible with well-known knowledge bases such as DBpedia and YAGO.
      </p>
    </sec>
    <sec id="sec-2">
      <title>Korean Open Information Extraction using SRDF</title>
      <p>SRDF simply receives as input a Korean text corpus and returns an extracted set of
triples expressed in the form of singleton property. The system of SRDF operates
through three steps of procedure in total that are “preprocessing”, “argument and relation
detection”, and “triple generation” as described below.
2.1</p>
      <sec id="sec-2-1">
        <title>Preprocessing &amp; Argument and Relation Detection</title>
        <p>When a Korean sentence is given as input, the SRDF system performs part-of-speech
(POS) tagging and chunking first, as preprocessing.</p>
        <p>The POS-tagged and chunked Korean sentence is then passed on to the next stage of
argument and relation detection. This stage literally is to detect argument(s) and
relation(s) from the given Korean sentence, and is further divided into three smaller steps
similar to other Open IE systems based on self-supervised learning.
─ Labeling: At this stage, the preprocessed Korean sentence gets automatically labeled
based on three important factors that are “the POS tag patterns”, “the position of
words in sentence”, and “the postposition(s) within the sentence”.
─ Learning: An argument detection model and a relation detection model are learned
here using decision tree. The former model uses “lemma”, “POS tag”, “length of the
sentence”, “start position of argument”, “end position of argument”, “next lemma”
and “next POS tag” as features, and the features that the latter uses include “lemma”,
“POS tag” and “postposition”.
─ Extracting: Once a Korean sentence is received as input, the relation detection model
in the SRDF system classifies whether a certain word in the given sentence is a
relation or not, while the argument detection model classifies whether the word is a
subject or an object. After that, they return all the classification results that are
necessary for the next step of triple generation, including “the postposition(s) of
detected argument(s)” and “the position of detected argument(s) and relation(s) in
sentence”.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Triple Generation</title>
        <p>
          Our team has studied not only “how to extract information from Korean sentences” but
“how to generate triples for representing multiple relationships between argument(s)
and relation(s) within a sentence” as well. For example, the sentence ‘‘Barack
Obama was awarded the Nobel Prize in 2009.’’ is including multiple
relationships shared between one relation and two arguments, thus it is ideal that two
triples should be generated like &lt;Barack Obama, award, the Nobel
Prize&gt;, &lt;Barack Obama, award, in 2009&gt;. However, when we perform
information extraction on the above-mentioned sentence using the ReVerb [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]
program for instance, only one triple &lt;Barack Obama, award, the Nobel
Prize&gt; is returned excluding the fact that Barack Obama was awarded “in 2009”. In
order to address this problem, we have adopted an approach that grafts the concept of
singleton property onto Open IE.
        </p>
        <p>As shown in Fig. 1 (blue = subject, orange = object, purple = relation), the method
of generating triples using SRDF is as follows:
1. Identify the association between argument(s) and relation(s) based on the
position of words in sentence. – Korean sentences have a different structure from
English. Whereas an English sentence typically has a Subject-Relation-Object word
order, the word order of Subject-Object-Relation is more common in Korean. In this
light, the SRDF system can infer that the object(s) in Korean sentences are
associated with the relation(s) located on the right side of them, and vice versa. In
effect, the objects “the Nobel Prize” and “2009-year” are associated with
the relation “award” on their right-hand side, as shown by the blue curved arrows in
Fig. 1.
2. Identify whether the postposition(s) attached to the object(s) of the sentence is
accusative. – In Korean, it is also common that almost every object is attached with a
postposition, and the postposition is considered a very important factor when
understanding syntax of the sentence. Among various postpositions, accusative
postposition “EUL” specifically indicate that the relation of the sentence is a transitive
verb. When the postposition attached to the object of the sentence is accusative, the
SRDF system generates a triple with the following form in general &lt;Barack
Obama, award#1, the Nobel Prize&gt;, in which the relation “award” is
attached with a provenance #1. In other cases where the relation of the sentence is
not a transitive verb and the postposition attached to the object is not accusative,
triples are made by the SRDF system in the anonymous form of &lt;subject,
relation#provenance, ANOMYOUMS&gt;. This method has an advantage of
enabling representation of sentences with no object in the form of triple.
3. Generate reified triples using remained objects. – The main triple &lt;Barack
Obama, award#1, the Nobel Prize&gt; has been made in the previous step
and, at this stage, “2009-year-E” should be reified. When generating a reified
triple, the SRDF system situates the relation of the main triple as the subject of the
reified triple, places the postposition as the relation, and lets the object be the object
as &lt;award#1, E#1, 2009-year&gt; for instance.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Experiment</title>
      <p>The performance of SRDF system has been evaluated by application to 100 Korean
sentences randomly sampled from the web as a testing data set. The evaluation results have
been assessed by two human evaluators based on the two criteria of Detection – how
precisely the SRDF system has detected the argument(s) and relation(s) from the given
sentence – and Triple Generation – how accurately the reified-triple has been generated
from the detected argument(s) and relation(s) –. The results and error statistics are
presented in Table 1 below. As shown in Table 1, the SRDF system is of an excellent
capability of both detecting argument(s) and relation(s) and generating triples, where the
performance of triple generation is relatively 18% lower. Having thoroughly examined the
failed sentences, we found out that most errors occur in the course of detection followed
by POS-tagging, and the least errors are made during the process of reification.</p>
      <sec id="sec-3-1">
        <title>Criteria</title>
      </sec>
      <sec id="sec-3-2">
        <title>Detection Triple Generation</title>
        <p>4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>In this paper, we have demonstrated the feasibility of extracting structured information
from Korean natural language texts without any human intervention. We have also proposed
a novel method of combining Open IE with the singleton property technique in
representation of multiple relationships between argument(s) and relation(s) within a
sentence. Our project is still ongoing in active progress, and it is with great expectation for
our forthcoming researches to more technically expand the scope of our project. All the
expected accomplishments of the next phases of our project work will be made publicly
available through the website at http://143.248.135.216:8080/SRDFREST/index.htm.
Acknowledgement. This work was supported by Institute for Information &amp; communications
Technology Promotion(IITP) grant funded by the Korea government(MSIP) (No.
R0101-150054, WiseKB: Big data based self-evolving knowledge base and reasoning platform)</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Etzioni</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          , et al.:
          <article-title>Open Information Extraction from the Web</article-title>
          .
          <source>Communications of the ACM</source>
          <volume>51</volume>
          (
          <issue>12</issue>
          ),
          <fpage>68</fpage>
          -
          <lpage>74</lpage>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Weld</surname>
            ,
            <given-names>D. S.</given-names>
          </string-name>
          :
          <article-title>Open Information Extraction using Wikipedia. Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (</article-title>
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Etzioni</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          , et al.:
          <article-title>Open Information Extraction: The Second Generation</article-title>
          .
          <source>IJCAI</source>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Tseng</surname>
            ,
            <given-names>Y. H.</given-names>
          </string-name>
          , et al.:
          <article-title>Chinese Open Relation Extraction for Knowledge Acquisition</article-title>
          .
          <source>EACL</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Nguyen</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Olivier</surname>
            <given-names>B.</given-names>
          </string-name>
          , and Amit S.:
          <string-name>
            <surname>Don't Like</surname>
          </string-name>
          RDF Reification?
          <article-title>Making Statements about Statements using Singleton Property</article-title>
          .
          <source>In Proceedings of the 23rd international conference on World wide web (</source>
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>