<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Analysis of Distant Supervision for Relation Extraction Dataset</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Kijong Han</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sangha Nam</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>YoungGyun Hahm</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jiseong Kim</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jiho Kim</string-name>
          <email>hogajihog@kaist.ac.kr</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jin-Dong Kim</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Key-Sun Choi</string-name>
          <email>kschoi@kaist.ac.kr</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Database Center for Life Science</institution>
          ,
          <country country="JP">Japan</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Korea Advanced Institute of Science and Technology</institution>
          ,
          <country country="KR">South Korea</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Deep learning techniques have been applied to relation extraction task, and demonstrated remarkable performances. However, the results of these approaches are di cult to interpret and are sometimes counter-intuitive. In this paper, we analyze the ontological and linguistic features of a relation extraction dataset and the pros and cons of existing methods for each feature type. This analysis result could help design an improved method for relation extraction by providing more insights into the dataset and models.</p>
      </abstract>
      <kwd-group>
        <kwd>information extraction</kwd>
        <kwd>relation extraction</kwd>
        <kwd>dataset analysis</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Relation Extraction (RE) is to extract semantic triples consisting of entity pairs
and relation between the entity pairs from non-structured natural language text.
Supervised learning approaches for RE require a large amount of labelled
training data, which requires considerable human e ort. To address this problem,
a distant supervision (DS) method [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] is widely used these days. Despite its
usefulness, there is a problem in DS for RE. The distant supervision method
automatically generates labelled data, so there are wrongly labelled data which
can cause noise.
      </p>
      <p>
        Statistical machine learning and deep learning have been applied to solve
these problems, and have demonstrated a remarkable performance
improvement [
        <xref ref-type="bibr" rid="ref4 ref7">4,7</xref>
        ]. However, the results of these approaches are di cult to interpret
and are sometimes counter-intuitive.
      </p>
      <p>
        Thus, to interpret the results of existing RE methods, we analyze ontological
and linguistic features of a RE dataset and the pros and cons of existing methods
for each feature type. This analysis result provides more insights into the datasets
and characteristics of existing RE methods, so it could help design an improved
RE method. A convolutional neural network (CNN)-based method [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], and a
Markov logic network (MLN)-based method [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] are selected for analysis. Our
implementations are available at http://github.com/machinereading/re-cnn for
CNN, and http://github.com/machinereading/re-mln for MLN.
      </p>
    </sec>
    <sec id="sec-2">
      <title>Background</title>
      <p>
        We are inspired by the study that constructs and analyze the dataset for
recognizing textual entailment (RTE) task [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. This study categorized an RTE dataset
according to linguistic phenomenon, and analyzed it by applying a previous RTE
methods. We conducted a type of analysis suitable for a RE task.
      </p>
      <p>
        We selected two existing RE methods for analysis. One is a CNN-based
method [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], the other is an MLN-based method [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. We select CNN as a
representative of deep learning-based methods, and MLN as a representative of
methods that utilize logic rules and hand-crafted features. MLN is a model that
combines a Markov random eld and weighted logic rules [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. It represents
information as rst-order logic predicates and formulas. (e.g. HasF ea(Di; write) )
Label(Di; author) . If data Di has the feature word 'write' then the relation label
of Di is author). Each formula has a weight that represents the con dence, and
the weight is trained statistically from the dataset. To utilize this weight, this
model can calculate the probability that ground predicate is true. This model
has logic rules with weights, and this information is very useful for analyzing the
dataset.
3
3.1
      </p>
    </sec>
    <sec id="sec-3">
      <title>Analysis Setup</title>
      <sec id="sec-3-1">
        <title>Dataset</title>
        <p>
          We used the Korean DS for RE dataset [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], which was constructed from the
Korean Wikipedia (2017. 07) sentences and K-box triples. K-Box is a knowledge
base extended from the Korean DBpedia. We randomly sampled these datasets.
A total of 13,489 DS training data instances, 4,096 gold test data instances, and
the top 30 most frequent relations were used for this study. Gold test data was
constructed by the process of removing wrongly labelled data from DS by 14
part-time students hired by our research team. In this paper, we refer to one
sentence having a designated entity pair as `a data'or `a data instance'.
3.2
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Method</title>
        <p>
          First, we analyzed the overall performance of the two methods. Second, we
selected and classi ed four features by analyzing the data manually, and we
investigated how important each feature type was for the prediction of a data relation.
These features are also used as MLN features. The features are as follows:
(1) Entity type: Fine-grained entity type de ned by a K-box, which are
originated from DBpedia ontology classes.(e.g. An entity type of the Lionel Messi
is Athelete.) (2) Entity modi er: A modi er is a sentence component
modifying another component. For example, in the sentence `John who is the author
of ...', `author'is the clausal modi er of the entity `John'. (3) Lemmas in a
dependency path: Lemmas in a dependency path between entity pairs is an
important feature. Many previous studies also leveraged this feature [
          <xref ref-type="bibr" rid="ref1 ref4">1,4</xref>
          ]. (4)
Context lemmas: Context lemmas are lemmas of words that not dependencies
or modi ers in sentences.
Overall Performance. The best F1-score was 0.616 (CNN), 0.611 (MLN),
and the accuracy was 0.584 (CNN), 0.594 (MLN). The F1-score was measured
in terms of how accurately the method extracts triples, as calculated by other
studies [
          <xref ref-type="bibr" rid="ref1 ref4">1,4</xref>
          ]. Accuracy was measured considering only the best prediction for
the data instance. Both models showed similar performance overall.
        </p>
        <p>Importance of each Feature Type. We investigated the importance of
each feature type by measuring the precision per weight of the formulas in MLN.
In the MLN method, each prediction of a relation label for each data has a list of
weighted formulas a ecting the prediction as described in Section 2. The higher
the weight, the more important the feature in the formula based on the training
data statistics. Each graph in Figure 1 is drawn considering only the weight of
a speci c feature type in the calculation. In the X axis, the X% point represents
the portion of data that has top (X-10%,X%) weight for a speci c feature type.
The Y axis represents the accuracy for that portion of the data. Precision was
measured by considering only the best prediction for the data instance. For all
graphs in Figure 1, the MLN curve shows a lower performance than CNN for a
range with a low weight, and higher performance than CNN for a range with a
high weight. The MLN curve shows a stronger weight correlation than the CNN
curve. Thus, all four features are meaningful to some degree. The MLN curve
in entity type (a), entity modi er (b), and dependency lemmas (c) graph shows
close to a 1.0 precision for the top 0-20% highest weight dataset. This means
that these three features are crucial for speci c RE sentences.</p>
        <p>MLN
CNN
0.594
0.584</p>
        <p>Simple N of N Pattern. We found a data pattern that is very intuitive for
determining a relation, but does not work well with CNN. This pattern simply
consists of entity1 which is part of an `N of N'phrase, and entity2 modi ed
by an `N of N'phrase. Examples are shown in Figure 2. There is a total of 71
data instances of this pattern. In this pattern, the clue word (e.g. `capital'in
Figure 2) is strong evidence for inferring a relation. MLN also utilizes this clue
word as strong evidence, because this word acts as both an entity modi er and
dependency lemma feature. Thus, MLN shows a higher performance for this
pattern than its overall performance as shown in Figure 3.
5</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>
        We analyzed the ontological and linguistic features of RE datasets, as well as
the pros and cons of existing methods for each feature type. We expect that
these insights into the RE dataset analyzed in this study could help design an
improved RE method. For example, we can use important feature(e.g. entity
type) as an additional discrete feature vector for input, or combine high
precision rule derived from the pattern(e.g. simple N-of-N pattern in Section 4.) into
the neural net architecture by utilizing the model such as [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>Acknowledgement. This work was supported by Institute for Information &amp;
communications Technology Promotion(IITP) grant funded by the Korea government(MSIT)
(2013-0-00109, WiseKB: Big data based self-evolving knowledge base and reasoning
platform)</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Han</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sun</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Global distant supervision for relation extraction</article-title>
          .
          <source>In: AAAI</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Hu</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ma</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hovy</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xing</surname>
          </string-name>
          , E.:
          <article-title>Harnessing deep neural networks with logic rules</article-title>
          .
          <source>In: ACL</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Kaneko</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Miyao</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bekki</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Building japanese textual entailment specialized data sets for inference of basic sentence relations</article-title>
          .
          <source>In: ACL</source>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Mintz</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bills</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Snow</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jurafsky</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Distant supervision for relation extraction without labeled data</article-title>
          .
          <source>In: ACL-IJCNLP</source>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Nam</surname>
            , S., Han,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>E.k.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Choi</surname>
            ,
            <given-names>K.S.:</given-names>
          </string-name>
          <article-title>Distant supervision for relation extraction with multi-sense word embedding</article-title>
          .
          <source>In: GWC workshop on Wordnets and Word Embeddings</source>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Richardson</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Domingos</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Markov logic networks</article-title>
          .
          <source>Machine learning 62</source>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Zeng</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lai</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhou</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>Relation classi cation via convolutional deep neural network</article-title>
          .
          <source>In: COLING</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>