<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Building Large Scale Relation KB from Text</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>APEX Data &amp; Knowledge Management Lab Shanghai Jiao Tong University 800</institution>
          <addr-line>Dongchuan Rd., Shanghai 200240</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Recently more and more structured data in form of RDF triples have been published and integrated into Linked Open Data (LOD). While the current LOD contains hundreds of data sources with billions of triples, it has a small number of distinct relations compared with the large number of entities. On the other hand, Web pages are growing rapidly, which results in much larger number of textual contents to be exploited. With the popularity and wide adoption of open information extraction technology, extracting entities and relations among them from text at the Web scale is possible. In this paper, we present an approach to extract the subject individuals and the object counterparts for the relations from text and determine the most appropriate domain and range as well as the most confident dependency path patterns for the given relation based on the EM algorithm. As a preliminary results, we built a knowledge base for relations extracted from Chinese encyclopedias. The experimental results show the effectiveness of our approach to extract relations with reasonable domain, range and path pattern restrictions as well as high-quality triples.</p>
      </abstract>
      <kwd-group>
        <kwd>Linked Data</kwd>
        <kwd>Relation Extraction</kwd>
        <kwd>Expectation Maximization</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>In recent years, many knowledge bases, such as DBpedia[1], YAGO[6], Zhishi.me[5],
have been published on the Web in the form of linked data, which are very useful for
both human reading and machine consumption. Comparing with the number of different
entities in these knowledge bases, there are only a few distinct relations. Furthermore,
these knowledge bases only extract data from structured or semi-structured data sources
without considering implicit knowledge from unstructured text, which is in a huge and
increasing amount on the Web. On the other hand, open information extraction, such as
Machine Reading[4] and Never-Ending Language Learning[2], focuses on extracting
entities and their relations from text at the Web scale.</p>
      <p>In this paper, we are motivated to build a knowledge base of relations extracted
from text by leveraging open information extraction techniques. Moreover, for each
relation, we extract not only subject-object examples but also high level restrictions
such as the domains and ranges from text. Both information are quite useful to
describe relations, which can be used for further natural language processing training or
high-quality ontologies for additional extraction. To extract such information, we
adapt a novel algorithm based on Expectation Maximization (EM). And an experimental
relation knowledge base is built to show the effectiveness of our algorithm.</p>
    </sec>
    <sec id="sec-2">
      <title>Building the Relation Knowledge Base</title>
      <p>As shown in Figure 1(a), the process of building relation knowledge base has three main
steps: text annotation, candidates extraction and iterative score estimation.
(a) Architecture Overview</p>
      <p>(b) Text Annotation / Candidate Extraction</p>
      <p>The text annotation step includes tokenization (word segmentation for Chinese),
POS tagging and dependency parsing on top of the raw input text. The annotated text is
used as the input for the candidate extraction step.</p>
      <p>According to the concept of dependency grammar, we can directly extract candidate
examples from the annotated text. We identify the relations in the sentences according
to POS tags, and for each relation, we enumerate nouns in its subtrees as subject or
object candidates. We also extract candidate restrictions in the candidates extraction
step. For each candidate example, we tag the head nouns of the subjects and objects to
form the candidate domain and range of the relations. We also extract candidate path
patterns from subject-relation-object paths on the dependency trees. Figure 1(b) shows
two candidates, including examples and restrictions, are extracted from an annotated
Chinese sentence means the Municipal Museum is Located in the Museum Square.</p>
      <p>In the score estimation step, we group the candidate examples and restrictions by
relation, and process each relation independently. For each relation v, we group the
candidates from one sentence into one candidate group. Suppose there are n candidate
groups for v and mi candidates in group ci. In one group, each candidate cij contains the
extracted candidate subject sij with tags T (sij ), object oij with tags T (oij ), and path
pattern pij . Figure 1(b) can be seen as a two-candidate group for relation LocatedIn.</p>
      <p>
        We need a score function fE (sij ; oij jv) to judge whether the example (sij ; oij ) is
a credible subject-object pair for v. It is reasonable that the credibility of the example
is positively correlated to the credibility of its tags and path pattern so we use Equation
(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) to estimate the score. Here Nw(t) is the number of words having tag t. And we also
introduce three functions fD(tdjv), fR(trjv), fP (pjv) to indicate the credibility of the
tag td in the domain, the tag tr in the range, and the path pattern p for v.
fE (sij ; oij jv) = fP (pij jv)(
      </p>
      <p>∑
td2T (sij)
fD(tdjv)
Nw(td)
)(</p>
      <p>∑
tr2T (oij)
fR(trjv)</p>
      <p>
        Intuitively, fD(tdjv) should be the real count of td over the maximum possible
count. We assume that each candidate group is worth 1 count (which implies the
maximum possible count is n) and allocate it for each candidate in one group according to
fE (sij ; oij jv). The way to compute fR and fP is almost the same as fD. Details are
given in Equation (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ).
      </p>
      <p>fL(tjv) =
1 ∑
n
cond</p>
      <p>
        fE (sij ; oij jv)
∑km=i1 fE (sik; oikjv)
(
        <xref ref-type="bibr" rid="ref2">2</xref>
        )
(L; t; cond) 2 f(D; td; td 2 T (sij )); (R; tr; tr 2 T (oij )); (P; p; p = pij )g
According to Equation (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) and (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ), the scores of the examples and the restrictions
are interdependent. It is natural to design an EM[3] algorithm to estimate them
simultaneously:
1. Initialize fE (sij ; oij jv) = m1i .
2. Update fD, fR, fP , fE iteratively until convergence:
(a) E-step: For every tag and path pattern, update fD(tdjv), fR(trjv), fP (pjv)
using Equation (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ).
(b) M-step: For every example, update fE (sij ; oij jv) using Equation (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ).
      </p>
      <p>Finally we can build the relation knowledge base by the examples (subject-object
pairs) and the restrictions (tags for domain and range as well as path patterns) for each
relation according to the scores of these candidates.
3</p>
      <p>Preliminary Results on Chinese Encyclopedias and Zhishi.me
We built an experimental relation knowledge base from the abstract text of all the entries
in three online Chinese encyclopedias, Wikipedia in Chinese, Baidu Baike and Hudong
Baike. There are 2,517,826 pieces of text and 20,637,524 sentences in total.</p>
      <p>In addition to the text, we use the property zhishi:category at a Chinese
Linked Open Data, Zhishi.me which extracts the categories for the entities from the
above three online encyclopedias. We used them to tag the extracted entities. There are
a total of 985 tags and 3,109,448 words that have at least one of these tags.</p>
      <p>Our experimental relation knowledge base contains 7,097 relations from the text in
total. For one relation, there are 107.46 tags in domain, 103.13 tags in range, 9.68 path
patterns and 276.03 subject-object pairs in average.</p>
      <p>We sampled 40 relations for human evaluation. For each relation, the domain, the
range and the examples can be seen as three ranked lists according to the scores of the
elements in them. We use the average precision (average of the precision value obtained
for the top k elements, each time a relevant/correct element is retrieved) as the metric.
The mean average precision is 0.788 for the domain (top 5), 0.865 for the range (top 5),
and 0.657 for the examples (top 15). The average precision distribution on the sample
relations is shown in Figure 2.</p>
      <p>
        We have found errors are mainly caused by the following reasons: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) Some words
has wrong tags; (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) Some relation phrases are not actually relations (maybe just
modifiers to nouns) in the sentence; (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) Some extracted noun phrases are incomplete or linked
to wrong entities. These kinds of errors indicate how to improve the quality of our
relation knowledge base in the future (e.g. to improve tag cleaning, relation identification,
entity linking and disambiguation, etc).
      </p>
      <p>We also provide a simple web site for users to browse our experimental relation
knowledge base at http://relationkb.apexlab.org. There are two services
in the web site, a lookup service and a SPARQL query service.
4</p>
    </sec>
    <sec id="sec-3">
      <title>Conclusion and Future Work</title>
      <p>In this paper, we leverage open information extraction and propose an EM algorithm
to build a knowledge base which contains the examples and restrictions of the
relations from text. Also we give some preliminary results in Chinese to show effectiveness
of our algorithm. In the future, we are planning to use advanced method to identify
the relations and better entity linking algorithm to improve the quality of the relation
knowledge base. In additional, it would be better if we added some structures about the
relations, such as relation clusters or relation hierarchies. Finally we are also planning
to use our relation knowledge to populate more and more linked data from text and
update the relation knowledge base itself when populating.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Auer</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bizer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kobilarov</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lehmann</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cyganiak</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ives</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          :
          <article-title>Dbpedia: a nucleus for a web of open data</article-title>
          .
          <source>In: ISWC/ASWC</source>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Carlson</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Betteridge</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kisiel</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Settles</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jr.</surname>
            ,
            <given-names>E.R.H</given-names>
          </string-name>
          ., Mitchell, T.M.:
          <article-title>Toward an architecture for never-ending language learning</article-title>
          .
          <source>In: AAAI</source>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Dempster</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Laird</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rubin</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Maximum likelihood from incomplete data via the em algorithm</article-title>
          .
          <source>Journal of the Royal Statistical Society. Series B (Methodological)</source>
          (
          <year>1977</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Etzioni</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Banko</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cafarella</surname>
            ,
            <given-names>M.J.:</given-names>
          </string-name>
          <article-title>Machine reading</article-title>
          . In: AAAI Spring Symposium (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Niu</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sun</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rong</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Qi</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Zhishi.me: weaving chinese linking open data</article-title>
          .
          <source>In: ISWC</source>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Suchanek</surname>
            ,
            <given-names>F.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kasneci</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weikum</surname>
          </string-name>
          , G.:
          <article-title>Yago: a core of semantic knowledge</article-title>
          .
          <source>In: WWW</source>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>