<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Wikipedia Mining for Triple Extraction Enhanced by Co-reference Resolution</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>The Center for Knowledge Structuring The University of Tokyo</institution>
          ,
          <addr-line>7-3-1 Hongo, Bunkyo-ku, Tokyo, 113-8656, Japan TEL:</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>Since Wikipedia has become a huge scale database storing wide-range of human knowledge, it is a promising corpus for knowledge extraction. A considerable number of researches on Wikipedia mining have been conducted and the fact that Wikipedia is an invaluable corpus has been confirmed. Wikipedia's impressive characteristics are not limited to the scale, but also include the dense link structure, URI for word sense disambiguation, well structured Infoboxes, and the category tree. In previous researches on this area, the category tree has been widely used to extract semantic relations among concepts on Wikipedia. In this paper, we try to extract triples (Subject, Predicate, Object) from Wikipedia articles, another promising resource for knowledge extraction. We propose a practical method which integrates link structure mining and parsing to enhance the extraction accuracy. The proposed method consists of two technical novelties; two parsing strategies and a co-reference resolution method.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Even though the importance of ontology construction is widely recognized and
a considerable number of Semantic Web implementations based on standardized
formats (such as RDF and OWL) are being built/published on the WWW,
what seems lacking is the mapping of ontologies due to the nature of distributed
environments. Since it is difficult to map local ontologies one by one, an approach
based on the global ontology approach seems a solution having capability to
intermediate local ontologies. However, previous methods for constructing huge
scale ontologies faced technical difficulties, since it was impossible to manage a
huge scale global ontology due to the lack of human resources.</p>
      <p>
        Meanwhile, Wikipedia, a collaborative wiki-based encyclopedia, has become a
phenomenon among Internet users. According to statistics of Nature, Wikipedia
is about as accurate in covering scientific topics as the Encyclopedia Britannica.
It covers concepts of various fields such as Arts, Geography, History, Science,
Sports, Games. Wikipedia contains more than 2 million articles (Oct. 2007,
English Wikipedia) and is becoming larger day by day while the largest
paperbased encyclopedia Britannica contains only 65,000 articles. As a corpus for
knowledge extraction, Wikipedia’s impressive characteristics are not limited to
the scale, but also include the dense link structure, sense disambiguation based
on URL, brief link texts (a. k. a. anchor texts) and well structured sentences. The
fact that these characteristics are valuable to extract accurate knowledge from
Wikipedia is strongly confirmed by a number of previous researches on Wikipedia
Mining [
        <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4 ref5 ref6 ref7 ref8">1–8</xref>
        ]. Besides, we proposed a scalable link structure mining method to
extract a huge scale association thesaurus in a previous research [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. In that
research, we developed a huge scale association thesaurus dictionary extracting
a list of related terms from any given term. Further, in a number of detailed
experiments, we proved that the accuracy of our association thesaurus achieved
notable results. However, association thesaurus construction is just the beginning
of the next ambitious research on huge scale Web ontology construction from
Wikipedia.
      </p>
      <p>
        Semantic Wikipedia [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] is an impressive solution for developing a huge scale
ontology on Wikipedia. Semantic Wikipedia is an extension of Wikipedia which
allows editors to define semantic relations among concepts manually. Another
major approach is to use Wikipedia’s category tree as an ontology [
        <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
        ]. These
researchers proved that Wikipedia’s categories are promising resources for
ontology construction by showing significant results.
      </p>
      <p>In contrast to these approaches, we propose a full-automated consistent
approach for semantic relation extraction by mining Wikipedia articles. Since a
Wikipedia article is a set of definitive sentences, the article text is yet another
valuable resource for ontology construction. However, co-reference resolution will
be one of the serious technical issues for this aim since a lot of abbreviations,
pronouns and different expressions are used to point an entity in a Wikipedia
article. Therefore, we propose a co-reference resolution method based on synonym
information and an improvement method by using important sentence detection.</p>
      <p>The rest of this paper is organized as follows. In section 2, we explain a
number of researches on Wikipedia Mining for knowledge extraction in order
to make our stance clear. In section 3, we describe our proposed integration
method based on parsing and link structure mining. We describe the results of
our experiments in section 4. Finally, we draw a conclusion in section 5.
2
2.1</p>
    </sec>
    <sec id="sec-2">
      <title>Related Works</title>
      <sec id="sec-2-1">
        <title>Relation Acquisition from Text Corpora</title>
        <p>
          In the statistical NLP research area, a significant number of researches on
relation acquisition from large scale text corpora have been conducted. For instance,
Hearst [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] is one of the researchers who has pointed out that lexico-syntactic
patterns (mainly for is-a relation) can be extracted from large scale corpora.
Berland and Charniak [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] have proposed similar methods for part-whole
relations. Kim and Baldwin [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] focused on nominal relations in compound nouns.
        </p>
        <p>These researches are targeting ordinary text corpora or Web corpora, thus
in order to apply these methods for Wikipedia, we need to consider about the
characteristics and reconstruct the methods since Wikipedia has various unique
characteristics compared with other corpora.
2.2</p>
        <p>
          Wikipedia Mining
“Wikipedia mining” is a new research area that is recently addressed. Researches
on semantic relatedness measurement are already well conducted [
          <xref ref-type="bibr" rid="ref1 ref2 ref3">1–3</xref>
          ].
WikiRelate [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] is one of the pioneers in this research area. The algorithm finds the
shortest path between categories which the concepts belong to in a category
tree. As a measurement method for two given concepts, it works well. However,
it is impossible to extract all related terms for all concepts because we have to
search all combinations of category pairs of all concept pairs (2 million × 2
million). Therefore, in our previous research, we proposed pf ibf (Path Frequency
Inversed Backward Link Frequency)1, a scalable association thesaurus
construction method to measure relatedness among concepts in Wikipedia. The basic
strategy of pf ibf is quite simple. The relativity between two articles vi and vj
is assumed to be strongly affected by the following two factors:
– the number of paths from article vi to vj ,
– the length of each path from article vi to vj .
        </p>
        <p>The relativity is strong if there are many paths (sharing of many intermediate
articles) between two articles. In addition, the relativity is affected by the path
length. In other words, if the articles are placed closely together in the graph
of the Web site, the relativity is estimated to be higher than that of farther
ones. Therefore, by using all paths from vi to vj given as T = {t1, t2, ..., tn}, the
relativity pf (Path Frequency) between them is defined as follows:
n
pf (vi, vj ) = ∑
k=1</p>
        <p>1
d(|tk|)</p>
        <p>,
pf ibf (vi, vj ) = pf (vi, vj ) · log</p>
        <p>N
bf (vj )
.</p>
        <p>
          (1)
(2)
d() denotes a function which increases the value according to the length of path
tk. N denotes the total number of articles and bf (vj ) denotes the number of
backward links of the page vj . Wikipedia Thesaurus [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] 2 is an association
thesaurus search engine that uses pf ibf in its behind. It provides over 243 million
relations for 3.8 million concepts in Wikipedia.
2.3
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Wikipedia and Web Ontology</title>
        <p>
          Semantic Wikipedia [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] is one of the pioneers that remarked the effectiveness of
Wikipedia style editing for making a huge ontology covering wide range topics.
Semantic Wikipedia is an extension of Wikipedia which allows editors to define
relations among concepts manually. The contribution of Semantic Wikipedia is
that it showed a new direction to achieve the vision of the Semantic Web. While
Semantic Wikipedia is a promising approach for a huge scale Web ontology
construction, it needs human-effort. Therefore, we try to develop a
completelyautomated method without any additional human-effort since Wikipedia articles
already include rich semantic relations.
        </p>
        <p>
          Another interesting approach is to use Wikipedia’s category tree as an
ontology [
          <xref ref-type="bibr" rid="ref12 ref7">7, 12</xref>
          ]. In previous researches on Wikipedia mining, a large number of
researches were based on category tree analysis since Wikipedia categories are a
promising resource for ontology construction. For instance, DBPedia [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] uses
several types of information on Wikipedia such as InfoBox, article texts, categories
in order to extract structured knowledge and provide Web APIs.
1 The method name was lf ibf in the past and was changed to pf ibf
2 http://wikipedia-lab.org:8080/WikipediaThesaurusV2
        </p>
        <p>In this research, in contrast to these approaches, we developed a full-automated
consistent approach for semantic relation extraction by mining Wikipedia
article texts. Wikipedia article texts are promising resources to extract semantic
relations but a small number of researches have been conducted in this area.
2.4</p>
      </sec>
      <sec id="sec-2-3">
        <title>Characteristics of Wikipedia</title>
        <p>As a Web corpus for knowledge extraction, URL for word sense disambiguation
is one of the most notable characteristics of Wikipedia. In Wikipedia, almost
every page (article) corresponds to exactly one concept and has an own URL
respectively. For example, the concept apple as a fruit has a Web page and its
own URL. Further, the computer company Apple also has its own URL and these
concepts are semantically separated. This means that it is possible to analyze
term relations avoiding ambiguous term problems or context problems.</p>
        <p>
          Hyperlinks do not just provide a jump function between pages, but have more
valuable information than we expect. There are two type of links; “forward links”
and “backward links”. A “forward link” is an outgoing hyperlink from a Web
page, an incoming link to a Web page is called “backward link”. Researches on
Web structure mining, such as Google’s PageRank [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] and Kleinberg’s HITS
[
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], emphasize the importance of backward links in order to extract objective
and trustful data. “Link texts” also contains valuable information.
        </p>
        <p>Link texts in Wikipedia have a quite brief, clear and simple form compared
with those of ordinary Web sites. Among the authors of Wikipedia, it is a
common practice to use the title of an article for the link text but users also have
the possibility to give other link texts to an article. This feature makes another
important characteristic; the “variety of link texts,” which can be used to extract
valuable information. However, what seems interesting is that link texts do not
contain any wordy information in most cases. Since no link text data is available
on Wikipedia dump data, we customized the Wiki parser engine on Wikipedia
to extract the link text data.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Proposed method</title>
      <p>In order to extract semantic relations from Wikipedia, we propose a method
that analyzes both the Wikipedia article texts and link structure. Basically, the
proposed method extracts semantic relations by parsing texts and analyzing the
structure tree generated by a parser. However, parsing all sentences in an article
is not efficient since an article contains both valuable sentences and non-valuable
sentences. We assume that it is possible to improve accuracy and scalability by
analyzing only important sentences on the page. Furthermore, we use synonyms
to enhance co-reference resolution. In a Wikipedia article, usually a number of
abbreviations, pronouns and different expressions are used to point to an entity,
thus co-reference resolution is one of the technical issues in order to make the
parsing process accurate.</p>
      <p>
        Figure 1 shows the whole flow of the proposed method. The method consists
of three main phases; parsing, link (structure) analysis, and integration. First,
for a given Wikipedia article, the method extracts a list of related terms for an
article using pf ibf [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. At the same time, it provides synonyms by analyzing
the link texts of backward links of the article. Second, the method analyzes the
article text to extract explicit semantic relations among concepts by parsing the
sentences. Finally, in the integration phase, three steps for triple extraction are
conducted; 1) analyzing the structure tree generated by the parser, 2)
filtering important semantic information using parsing strategies, and 3) resolving
co-references by using synonyms. The main steps of the proposed method are
described as follows.
3.1
      </p>
      <sec id="sec-3-1">
        <title>Synonym Extraction</title>
        <p>We describe our co-reference resolution method by using synonyms extracted
from anchor texts. A synonym word has one meaning but various expressions.
Since backward links of a web page have a “variety of backward link texts,” this
variety can be used to extract synonyms of a concept (article). For instance,
the computer company “Apple” is sometimes referred to as “Apple”, but it is
sometimes also written as “Apple Computer, Inc,” “Apple Computers,” etc.
Table 1 shows a number of examples of randomly chosen synonym terms.</p>
        <p>The article “Apple Computer” has 1,191 backward links with the link text
“Apple Macintosh” and 301 backward links with the link text “Mac.” This shows
that both words are typical synonyms for the concept “Apple Computer.”
Statistical data unveiled that backward link texts analysis can extract high quality
synonyms by specifying a threshold to filter noisy data such as ’international
standard’ and ’ISOs’ for ISO.</p>
        <p>Synonyms are helpful information to detect whether two sentences are
describing the same subject. In other words, the information is needed for
coreference resolution. For example, there is an article about “United Kingdom”
in Wikipedia and it contains “UK” many times. However, if the machine does
not know that “UK” is a synonym of “United Kingdom,” it can not extract
many relations on the topic. Therefore, we use the extracted synonyms in the
following steps to improve the coverage.</p>
        <p>For a given article a and the synonym candidate s, we define a simple scoring
function syn(a, s) as follows;
syn(a, s) =
log num bk(a, s)
log num bk(a, ∗)
.</p>
        <p>(3)
syn(a, s) basically measures the popularity of the label for the concept by
calculating ratio of total backward links and the link texts. num bk(a, s) is the
number of backward links of a with link text s. num bk(a, ∗) is the total number
of backward links of a. We defined a threshold for syn(a, s) to filter irrelevant
synonyms by 200 training data evaluated by human effort.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Preprocessing</title>
        <p>Since the structure and syntax of a Wiki is much different from natural
languages, we need to modify and optimize the parser by considering special syntax
composed of HTML tags to achieve better accuracy. Basically, special Wiki
command tags such as triple quotation, brackets for hyperlinks and tables, prevent
correct parsing. However, it is also true that this kind of information is helpful
to analyze the content since it contains hyperlinks and helpful information to
compound words into semantic chunks. Therefore, we constructed a preprocessor
by ourselves to achieve better accuracy. The preprocessor trims the Wikipedia
article to remove unnecessary information such as HTML tags and special Wiki
commands first. It also removes table tags because contents in tables are usually
not sentences. However, it does not remove link tags (“[[...]]”) because links in
Wikipedia are explicit relations to other pages and we use the link information
in the following steps. Finally, phrases in quotations and link tags are tagged as
nouns to help the following parsing step.</p>
        <p>
          Parsing and Structure Tree Analysis After the preprocessing, it provides
partially-tagged sentences. In this step, the method parses the sentences to get
a structure tree and analyzes the structure tree to extract semantic relations. To
parse sentences, we adopted a lexcicalized probabilistic parsing method based on
the factored product model. We used the Stanford parser [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] for this purpose. It
can parse a sentence accurately if the sentence is trimmed, chunked and tagged
correctly by preprocessing. A list of main POS (Part Of Speech) tags used in
this step is shown in Table 2 (Right).
        </p>
        <p>Statistics of Wikipedia articles.
# of concept pages
(exc. redirect and category pages) 1,580,397
# of pages having
more than 100 backward links: Pa 65,391
# of pages (in Pa) begin with
is-a definition sentence: Pb 56,438
# of pages (in Pa) that
the 1st sentence has links: Pc 62,642
# of Pb ∩ Pc 56,411</p>
        <p>POS Tags.</p>
        <p>Tag Description
NN Singular or mass noun
NNS Plural noun
NNP Singular proper noun
NNPS Plural proper noun
NP Noun phrase
VB Base form verb
VBD Past tense
VBZ 3rd person singular
VBP Non 3rd person singular present
VP Verb phrase
JJ Adjective
CC Conjunction, coordinating</p>
        <p>IN Conjunction, subordinating</p>
        <p>For example, for a sentence “Lutz D. Schmadel is [[Germany]] [[astronomer]].”
about the person with the name “Lutz D. Schmadel,” the parser generates a
structure tree like this;
(S (NP (NN Lutz_D._Schmadel) (VP (VBZ is) (NP (NN [[Germany]]) (NN [[astronomer]])))))</p>
        <p>In our proposed method, the parser takes a partially tagged sentence made
by preprocessing and generates a structure tree from the sentence. After that,
the structure tree is analyzed in order to extract triples (Subject, Predicate,
Object) in the following steps:
1. Extract “(NP ...) (VP (VBZ/VBD/VBP ...) (NP ...))” pattern from the
parsed sentence.
2. For both NP, replace the NP by the last NN/NNS in the NP if the NP parts
consist of JJ and NN/NNS.
3. For both NP, split the NP into two NP parts if the NP contains CC. After
that, perform step 2 again.
4. If the 1st NP is a synonym of the concept representing the article, replace
the NP part by the title of the main subject.
5. Finally, extract the 1st NP part as a subject, VB part as a predicate, the
2nd NP part as an object.</p>
        <p>In the first step, we extract “(NP ...) (VP (VBZ/VBD/VBP ...) (NP ...))”
and assume that the 1st NP part is the subject, the VB part is the predicate,
the 2nd NP part is the object respectively.</p>
        <p>In the second step, for both NP parts, we replace NP by the last NN/NNS
term (or hyperlink) because the last term is the mainstay of the phrase. For
instance, the 2nd NP in the sentence about “Lutz D. Schmadel” consists of two
NN and both of them have a hyperlink to other pages and the 1st NN has a
link to a country “Germany”. So in this case, it obtains “[[astronomer]]” as the
mainstay of the object part.</p>
        <p>In the third step, NP will be separated if it contains CC such as “and” and
“or”. In the fourth step, if the 1st NP is a literal and it is a synonym of the
concept representing the article, then the NP is replaced by the concept of the
article. Finally, the first NP part is extracted as a subject, the VB part as a
predicate, the 2nd NP part as an object.</p>
        <p>The first step’s POS tag pattern can be replaced by other alternatives.
Currently, we prepared following three for the first step.
1. (NP ...) (VP (VBZ/VBD/VBP ...) (NP ...))</p>
        <p>Normal pattern. E. g. “is-a”
2. (NP ...) (VP (NP (NP ...) (PP (IN ...) ...))</p>
        <p>Subordinating pattern. E. g. “is-a-part-of”
3. (NP ...) (VP (VBZ ...) (VP (VPN ...) ...))</p>
        <p>Passive pattern. E. g. “was-born-in”</p>
        <p>We can prepare further POS tag patterns to improve the coverage of triples.
However, in this research, we applied these three basic patterns to confirm the
capability of this direction of research.</p>
        <p>In this research, we also extract a relation if the object part does not contain
any hyperlinks to other pages. We call it “literal” object. For example, assume
that there is a sentence “Brescia is a city” with the following structure tree;
(S (NP (NNP [[Brescia]])) (VP (VBZ is) (NP (DT a) (NN city))))</p>
        <p>
          The subject part is “a city” but it is not a hyperlink to an article about “city”
but it is just a literal. Literal objects are not machine understandable but the
literal information is useful depending on the application even if the meaning of
the term can not be specified. So we extract the literal information as well.
Co-reference Resolution In a Wikipedia article, usually a number of
abbreviations, pronouns and different expressions are used to point an entity, thus
co-reference resolution is one of the technical issues in order to make the
parsing process accurate. In several previous researches on Wikipedia Mining,
coreference resolution methods optimized for Wikipedia article are proposed [
          <xref ref-type="bibr" rid="ref16 ref4">4,
16</xref>
          ]. Gang mentioned that emphasized words are likely
        </p>
        <p>Let us assume that there is a Wikipedia article At which is describing the
topic t (the main subject of the article). At is a set of sentences and each sentence
a has triple; subject sa, predicate pa, and object oa. Co-reference resolution is
a procedure that judges whether sa is describing about same topic as the main
subject t or not. We use three co-reference resolution approaches (included one
novel approach) considering following three factors; article title (C1), frequent
pronouns (C2) and synonyms (C3).</p>
        <p>
          C1 is an approach to detect co-references if the terms used in sa are all
contained in the title of At. C2 uses pronouns for the judgment. It judges sa as
a co-reference to t if sa is the most frequently used pronoun in At. C1 and C2
were proposed in previous research [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ], but C3 is a novel approach proposed
by us. The main idea of the approach is to detect co-references if the sa is a
synonym of t. In addition, we investigated the effectiveness of combining these
three approaches in detail.
3.3
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>Parsing Strategies</title>
        <p>LSP: Lead Sentence Parsing LSP is a strategy that parses only the lead
sentences (first n sentences). After a simple inspection, we realized that a
considerable number of Wikipedia articles begin with definitive sentences containing
relations (hyperlinks) to other articles (concepts). Especially, the first sentence
often defines “is-a” relation to other article. We took detailed statistics (Table
2 Left) from the English Wikipedia (Sept. 2006) to confirm this phenomenon.</p>
        <p>
          First, we removed all redirect pages and category pages from the target of
the statistics because these pages are not concept pages but navigational pages.
After that, we removed all pages having only few backward links (less than 100)
because such pages often contain noisy information and are not structured well
[
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. Then, we investigated how many articles begin with a definitive sentence
(contain is/are/was/were). The result showed that over 86.3% (Pb/Pa) of all
pages begin with a definitive sentence.
        </p>
        <p>We also investigated whether the first sentences have hyperlinks to other
pages. The results showed that over 95.7 % ( Pc/Pa ) of all pages begin with a
sentence having hyperlinks to other pages. Further, over 85.5 % ( (Pb ∩ Pc)/Pa
) of pages begin with a definitive sentence having hyperlinks.</p>
        <p>To conclude this, the statistics unveiled that a large number of pages in
Wikipedia has a high potential for extracting “is-a” relations to other concepts
thus the first sentence analysis seems a promising approach.</p>
        <p>ISP: Important Sentence Parsing ISP detects important sentences in a page
if the sentence contains important words/phrases for the page. Our assumption
is that the sentences containing important words/phrases are likely to define
valuable relations to the main subject of the page, thus we can make the
coreference resolution accurate even if the subject of the sentence is a pronoun
or another expression for the main subject. We use pf ibf to detect important
sentences. By using pf ibf , a set of important links for each article (concept) in
Wikipedia can be extracted. ISP detects important sentences in a page from
sentences containing important words/phrases for the page. It crawls all sentences
in the article to extract sentences containing links to the associated concepts.
The extracted sentences are then parsed as the important sentences in the
article. For each links in a sentence, the parser calculates pf ibf and the max value
denotes the importance of the sentence. The importance can be used for filtering
unimportant sentences by specifying thresholds.</p>
        <p>For example, when analyzing the article about “Google,” associated concepts
such as “Search engine”, “PageRank” and “Google search” are extracted from
the association thesaurus. Therefore, ISP crawls all sentences in the article to
extract sentences containing links to the associated concepts.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Evaluation</title>
      <p>To prove the effectiveness of our proposed method, we conducted two
experiments. The first experiment was conducted to measure the co-reference
resolution accuracy. The second experiment was conducted to measure the accuracy
of the extracted triples. We describe these experiments in detail as follows.
4.1</p>
      <sec id="sec-4-1">
        <title>Experiment 1: Co-reference resolution</title>
        <p>In this experiment, we first filtered noisy pages by checking the number of
backward links of the articles and extracted 65,391 pages as a test collection. After
that, we parsed 2,508 sentences in 52 articles chosen randomly from the test
collection. Then, totally 1,002 triples were extracted by parsing patterns described
before. A list of term examples used in this experiment is shown as follows;
Niagara Falls, Root beer, Deer, Arrow, Odonata, Marie Antoinette, Germany,
Colorado, and Blizzard.</p>
        <p>We manually checked whether the subject of each sentence is a co-reference
of the main subject of the article. Totally 216 subjects of sentences were
coreferences of the article subject. We used the data set to calculate precision,
recall and f-measure. The result is shown in table 3.</p>
        <p>As we can see, not surprisingly, C1 (article title approach) achieved quite
high precision. However, the precision of C2 (frequent pronouns approach) was
rather low. We investigated the reason and realized that the approach to use
frequent pronouns is an error prone strategy. In particular, the pronouns “it”
and “he/she” are not used for representing the main subject of an article but for
different meanings. We tried all combinations and realized that the combination
of all methods achieved the highest f-measure. This means that the combination
of these three methods compensates for the weak points of each method, and is
therefore helpful to achieve a higher coverage.
4.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Experiment 2: Triple extraction</title>
        <p>In this experiment, we first randomly selected 110 articles and totally 1,016
sentences were extracted as a test set. After that, we applied the proposed method
to extract triples. We used LSP and ISP to improve the accuracy of triples. As
a baseline, we also parsed all sentences and call it “All Sentence Parsing (Hence
ASP)” method. Table 3 shows the result of the experiment.</p>
        <p>First of all, we would like to mention that the accuracy of the LSP method
is quite high. It achieved high quality relation extraction for both literal objects
and non-literal objects. This means that our conviction that the first sentence
is useful information is strongly confirmed. We have no strong evidence but we
think that this is because of the reliability of the sentences. Usually, the top
part of a page attracts much more attention than the bottom part. Thus, the
top part is edited by many authors and structured well in most cases. Several
parsing misses happened when the sentence is too complicated which was the
cause of accuracy loss.</p>
        <p>Second, the ISP method also achieved better results than ASP. In particular
for literal objects, the accuracy significantly improved. Furthermore, by using the
ISP method, we can determine whether a sentence contains important concepts
before parsing it, decreasing the analysis time significantly. We also believe that
the combination of LSP and ISP is a balanced method because it achieves high
coverage and high precision at the same time.</p>
        <p>Table 4 shows some examples of explicit relations extracted by LSP. “Explicit
relation” means a relation where the object part is a hyperlink to another
article. As we can see, the extracted relations are very accurate. As we mentioned
before, almost all articles of Wikipedia begin with a definitive sentence, so LSP
extracted mainly “is-a” relations. While is-a relation is one of the most basic
(and important) relations in Semantic Web, the result shows the capability of
this approach for ontology construction and the possibility for making practical
approach to achieve next generation WWW technologies.</p>
        <p>Table 4 shows some examples of explicit relations extracted by ISP. Since ISP
analyzes important sentences in the article, it extracts various relations such as
“was born in,” “founded” and “has”. However, machines cannot understand the
meaning “was born in” without any instruction from humans. So, in order to
make the predicate part machine understandable, we have to define the relation
between predicates. For example, “is” and “was” have the same meaning but the
tense is different. By giving this kind of knowledge, machines can infer
semantic relations between two concepts. We believe that the relations among verbs
are quite limited compared with relations between nouns, thus do not cause
enormous workload.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>In this paper, we showed that Wikipedia article is yet another invaluable corpus
for ontology extraction by showing both detailed statistics and the effectiveness
of integrating parsing and link structure mining methods. The experimental
results showed that the integration method and co-reference resolution
significantly improves the accuracy of triple extraction. Especially, the conviction that
lead sentences have rich semantic information is strongly confirmed.
Furthermore, important sentence detection by using link structure analysis was helpful
to filter inaccurate results.</p>
      <p>More than anything else, what we are trying to show in this paper is the
possibility and capability of semantic relation extraction using Wikipedia knowledge.
We believe that this direction will be an influential approach for Semantic Web
in near future since Wikipedia has great capability for constructing a global
ontology. The extracted association thesaurus and semantic relations are available
on our Web site.</p>
      <p>Wikipedia Lab : http://wikipedia-lab.org
Wikipedia Thesaurus : http://wikipedia-lab.org:8080/WikipediaThesaurusV2
Wikipedia Ontology : http://wikipedia-lab.org:8080/WikipediaOntology
We hope the concrete results will be a helpful information to judge the capability of
this approach. Our next step is to apply the extracted semantic relations to Semantic
Web applications (esp. Semantic Web search). To do that, we need further coverage of
relations by enhancing the POS tag analysis patterns and mappings among relations.
Acknowledgment: This research was supported in part of the Microsoft Research
IJARC Core Project. We appreciate helpful comments and advices from Prof. Yutaka
Matsuo at the University of Tokyo as well as from Prof. Takahiro Hara and Prof.
Shojiro Nishio at Osaka University.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>E.</given-names>
            <surname>Gabrilovich</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Markovitch</surname>
          </string-name>
          , “
          <article-title>Computing semantic relatedness using wikipedia-based explicit semantic analysis</article-title>
          .
          <source>,” in Proc. of International Joint Conference on Artificial Intelligence (IJCAI</source>
          <year>2007</year>
          ), pp.
          <fpage>1606</fpage>
          -
          <lpage>1611</lpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>K.</given-names>
            <surname>Nakayama</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Hara</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Nishio</surname>
          </string-name>
          , “
          <article-title>Wikipedia mining for an association web thesaurus construction,”</article-title>
          <source>in Proc. of IEEE International Conference on Web Information Systems Engineering (WISE</source>
          <year>2007</year>
          ), pp.
          <fpage>322</fpage>
          -
          <lpage>334</lpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>M.</given-names>
            <surname>Strube</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Ponzetto</surname>
          </string-name>
          , “WikiRelate!
          <article-title>Computing semantic relatedness using Wikipedia,”</article-title>
          <source>in Proc. of National Conference on Artificial Intelligence (AAAI-06)</source>
          , pp.
          <fpage>1419</fpage>
          -
          <lpage>1424</lpage>
          ,
          <year>July 2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>G.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhu</surname>
          </string-name>
          , “Pore:
          <article-title>Positive-only relation extraction from wikipedia text</article-title>
          ,” in International Semantic Web Conference,
          <source>Asian Semantic Web Conference (ISWC/ASWC)</source>
          , pp.
          <fpage>580</fpage>
          -
          <lpage>594</lpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>S.</given-names>
            <surname>Auer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          , G. Kobilarov,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lehmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Cyganiak</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Z. G.</given-names>
            <surname>Ives</surname>
          </string-name>
          , “
          <article-title>Dbpedia: A nucleus for a web of open data</article-title>
          ,” in International Semantic Web Conference,
          <source>Asian Semantic Web Conference (ISWC/ASWC)</source>
          , pp.
          <fpage>722</fpage>
          -
          <lpage>735</lpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>S. N.</given-names>
            <surname>Kim</surname>
          </string-name>
          and
          <string-name>
            <given-names>T.</given-names>
            <surname>Baldwin</surname>
          </string-name>
          , “
          <article-title>Interpreting semantic relations in noun compounds via verb semantics,”</article-title>
          <source>in Proc. of Conference on Applied Computational Linguistics (ACL)</source>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>F. M.</given-names>
            <surname>Suchanek</surname>
          </string-name>
          , G. Kasneci, and G. Weikum, “
          <article-title>Yago: a core of semantic knowledge,”</article-title>
          <source>in Proc. of International Conference on World Wide Web</source>
          , pp.
          <fpage>697</fpage>
          -
          <lpage>706</lpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>D. N.</given-names>
            <surname>Milne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Medelyan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and I. H.</given-names>
            <surname>Witten</surname>
          </string-name>
          , “
          <article-title>Mining domain-specific thesauri from wikipedia: A case study,”</article-title>
          <source>in Proc. of ACM International Conference on Web Intelligence (WI)</source>
          , pp.
          <fpage>442</fpage>
          -
          <lpage>448</lpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>V¨olkel</article-title>
          , M. Kr¨otzsch, D. Vrandecic,
          <string-name>
            <given-names>H.</given-names>
            <surname>Haller</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Studer</surname>
          </string-name>
          , “Semantic wikipedia,”
          <source>in Proc. of International Conference on World Wide Web (WWW</source>
          <year>2006</year>
          ), pp.
          <fpage>585</fpage>
          -
          <lpage>594</lpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>A</article-title>
          . Hearst, “
          <article-title>Automatic acquisition of hyponyms from large text corpora,”</article-title>
          <source>in Proc. of COLING</source>
          , pp.
          <fpage>539</fpage>
          -
          <lpage>545</lpage>
          ,
          <year>1992</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <given-names>M.</given-names>
            <surname>Berland</surname>
          </string-name>
          and E. Charniak, “
          <article-title>Finding parts in very large corpora,”</article-title>
          <source>in Proc. of Conference on Applied Computational Linguistics (ACL)</source>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12. S. Chernov,
          <string-name>
            <given-names>T.</given-names>
            <surname>Iofciu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Nejdl</surname>
          </string-name>
          , and
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhou</surname>
          </string-name>
          , “
          <article-title>Extracting semantics relationships between wikipedia categories,”</article-title>
          <source>in Proc. of Workshop on Semantic Wikis (SemWiki</source>
          <year>2006</year>
          ),
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <given-names>P.</given-names>
            <surname>Lawrence</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Sergey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rajeev</surname>
          </string-name>
          , and W. Terry, “
          <article-title>The pagerank citation ranking: Bringing order to the web</article-title>
          ,
          <source>” Technical Report, Stanford Digital Library Technologies Project</source>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>J. M. Kleinberg</surname>
          </string-name>
          , “
          <article-title>Authoritative sources in a hyperlinked environment</article-title>
          ,
          <source>” Journal of the ACM, no. 5</source>
          , pp.
          <fpage>604</fpage>
          -
          <lpage>632</lpage>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <given-names>D.</given-names>
            <surname>Klein</surname>
          </string-name>
          and
          <string-name>
            <given-names>C. D.</given-names>
            <surname>Manning</surname>
          </string-name>
          , “
          <article-title>Accurate unlexicalized parsing,” in Proc. of Meeting of the Association for Computational Linguistics (ACL</article-title>
          <year>2003</year>
          ), pp.
          <fpage>423</fpage>
          -
          <lpage>430</lpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>D. P. T. Nguyen</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Matsuo</surname>
            , and
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Ishizuka</surname>
          </string-name>
          , “
          <article-title>Relation extraction from wikipedia using subtree mining,”</article-title>
          <source>in Proc. of National Conference on Artificial Intelligence (AAAI-07)</source>
          , pp.
          <fpage>1414</fpage>
          -
          <lpage>1420</lpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>