<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Towards Automatic Knowledge Acquisition from Text Based on Ontology-centric Knowledge Representation and Acquisition</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Yu-Sheng Lai</string-name>
          <email>laiys@itri.org.tw</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ren-Jr Wang</string-name>
          <email>rjwang@itri.org.tw</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Natural</string-name>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Morphological</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Figure 1. The proposed framework for knowledge</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Analysis</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Industrial Technology Research Institute</institution>
          ,
          <addr-line>Hsinchu, Taiwan</addr-line>
          ,
          <country country="CN">R.O.C.</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Industrial Technology Research Institute</institution>
          ,
          <addr-line>Tainan</addr-line>
          ,
          <country country="TW">Taiwan, R.O.C.</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>acquisition.</institution>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>language input</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2003</year>
      </pub-date>
      <fpage>23</fpage>
      <lpage>26</lpage>
      <abstract>
        <p>With the development of the Semantic Web and ontology technologies, many ontologies have been built or will be built before long. Based on the ontologies, we attempt to investigate the technology of automatic knowledge acquisition from text. This paper presents an ontologycentric framework for knowledge representation and acquisition, called iOkra. By combining NLP technologies with replaceable ontologies, the framework is able to acquire different domain knowledge from natural language input. The acquired knowledge is represented in the form of instances and statements associated with the ontologies.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Natural language processing</kwd>
        <kwd>knowledge representation</kwd>
        <kwd>knowledge acquisition</kwd>
        <kwd>ontology</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>In [3], Berners-Lee and the co-authors claimed that "the
Semantic Web is not a separate Web but an extension of
the current one, in which information is given well-defined
meaning, better enabling computers and people to work in
cooperation." It indicates that the data on the Semantic
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. To copy
otherwise, or republish, to post on servers or to redistribute to lists,
requires prior specific permission and/or a fee.</p>
      <p>Ontologies</p>
    </sec>
    <sec id="sec-2">
      <title>THE FRAMEWORK</title>
      <p>As illustrated in Fig. 1, the framework called iOkra is
expected to automatically acquire knowledge from natural
language input, to represent the knowledge in the form of
instances and statements associate with the ontologies, and
to store the acquired knowledge into knowledge base.</p>
      <p>The central ontologies comprise two kinds of
ontologies: linguistic ontologies and domain ontologies.
The main characteristic of linguistic ontologies is that they
are bound to the semantics of grammatical units, such as
words, nominal groups, etc. [5]. The domain ontologies
provide varied ontological information, which might be
domain-specific, task-oriented, or use-desirable.
In the framework, the natural language input is processed
through several modules including morphological,
syntactic, semantic, and discourse analyses and arbitration
module.</p>
      <sec id="sec-2-1">
        <title>The morphological analysis splits the input text into</title>
        <p>words and connects to the ontologies for each word.
The connections provide syntactic and semantic
information for the following analyses.</p>
        <p>The syntactic analysis performs a semantic case
frame parsing. The information-based case grammar
[4] is adopted to suggest parts of the thematic roles,
such as agent, patient, theme, goal, etc., in each
sentence.</p>
        <p>The semantic analysis finds the remaining roles out
and identifies the statements, cf. RDF statements,
namely the concept for each word and the relations
between the word concepts, according to the
ontologies.</p>
        <p>The discourse analysis addresses the contextual
issues, such as ellipsis and anaphora resolutions,
which is currently an initial and on-going task and
will be not presented in the following of this paper.
The arbitration module quantifies all possible
statements to reconcile conflicts, produces final result
statements, and stores the results into a knowledge
base, which is in a form of statements associated with
the ontologies.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>ONTOLOGY-BASED KNOWLEDGE</title>
    </sec>
    <sec id="sec-4">
      <title>REPRESENTATION</title>
      <p>What is knowledge representation (KR)? Allen considered
that "knowledge representation means different things to
different researchers [2]." For some, it concerns the
structure of the language used to express the knowledge.
For others, it concerns the content of sentences. Herein we
are interesting in the meaning representation of sentences.</p>
      <p>Stevens et al. presented an ontology-based knowledge
representation system for bioinformatics since they
believed that "the combination of an ontology with
associated instances is what is known as a knowledge base
[9]," in which the instances indicate the things represented
by concepts. Similar to the notion, we represent knowledge
in an ontology-based representation system.</p>
    </sec>
    <sec id="sec-5">
      <title>The Ontologies</title>
      <p>An ontology basically consists of a set of concepts that
represent classes of objects, and a set of binary relations
defined on concepts. A special transitive relation
subClassOf represents a subsumption relationship between
concepts. The subsumption relations structure a taxonomy
for the ontology. In addition to the taxonomy, an ontology
typically contains a set of axioms explicitly or implicitly.
The axioms enhance the ontology for reasoning.</p>
      <p>Maedche and Staab proposed an ontology-learning
framework [8] for the Semantic Web. In their case, they
formally defined an ontology as an 8-tuple &lt;L, C, HC, R,
HR, F, G, A&gt;, in which the first primitive L denotes a set of
strings that describe lexical entries for concepts and
relations, the middle 6 primitives structure the taxonomy of
the ontology, and the last primitive A is a set of axioms that
describe additional constraints on the ontology. The axioms
make implicit facts more explicit. Based on the same
definition, two ontologies: a linguistic ontology and a
domain ontology, are currently in iOkra.</p>
      <sec id="sec-5-1">
        <title>Linguistic Ontology</title>
        <p>Following the DAML+OIL specification, Lai et al.
constructed a Chinese lexical ontology call CLO [7]. To
improve the ability in Traditional Chinese language
processing, we define an amended version that has altered
by a wide margin. Major amendments are as follows:
1.</p>
        <p>The approach to real world applications such as
information extraction and knowledge acquisition, we
make an adjustment in taxonomy. "人 (person)," "事
(affair)," "時 (time)," "地 (place)," "物 (thing)" are
five basic entities in documents (Chen et al., 1998).
Therefore we define the five entities plus two
additional concepts " 屬 性 (attribute)" and " 數 量
(quantity)" as the upmost concepts.</p>
        <p>To increase the compatibility with other ontology
editors, such as OilEdit, the concept Lexicon in CLO is
eliminated from the amendment. Some of the lexical
entries are changed into instances. Others are moved to
new, more proper position.</p>
        <p>To enhance the expression power in linguistics, some
thematic roles, such as theme, goal, range, etc., are
interpreted as relations between concepts and added to
the ontology.</p>
      </sec>
      <sec id="sec-5-2">
        <title>Domain Ontology</title>
        <p>For different domains, one term could be interpreted as
many different meanings. For example, "大陸 (mainland)"
means a country - China in a hard news article, but also
means a corporation name - CEC in a stock news article. It
means different ontologies are required for different
domains, even for different tasks.</p>
        <p>Addressing the problem of knowledge representation
and acquisition from the news articles of Taiwan stock
market, we create an ontology that aims at the terminology
of Taiwan stock market, such as industrial categories,
corporation names, product names, people names, proper
nouns, etc. Most of them are collected from the WWW and
are organized into the domain ontology automatically. A
small number are reorganized or modified manually.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Instance and Statement</title>
      <p>The iOkra represents ontology-based knowledge consisting
of two components: instance and statement. An instance is
a specific description of a concept. For example, "台積電
(TSMC)" is an instance of concept "公司 (corporation)." A
statement specifies a relationship between instances. For
example, the concept "公司 (corporation)" has a "董事長
(board chairman)" relation to the concept "自然人 (natural
person)," and "張忠謀 (Morris C.M. Chang)" is the board
chairman of the corporation "台積電 (TSMC)." Fig. 2 is a
conceptual graph describing the relationships.</p>
      <p>Concepts
Instances</p>
    </sec>
    <sec id="sec-7">
      <title>ONTOLOGY-CENTRIC KNOWLEDGE</title>
    </sec>
    <sec id="sec-8">
      <title>ACQUISITION FROM NATURAL</title>
    </sec>
    <sec id="sec-9">
      <title>LANGUAGE INPUT</title>
      <p>In the following, we will describe the three NLP modules:
morphological, syntactic, and semantic analysis, and their
cooperation with ontologies.</p>
    </sec>
    <sec id="sec-10">
      <title>Morphological Analysis</title>
      <p>A word segmentation algorithm is used for morphological
analysis. It splits a sentence into a sequence of words. The
words are possibly the words in the general ontology, the
proper nouns in the domain ontologies, or compound
words from a grammatically word-formation process. For
example, the sentence "聯電1月29日至2月27日處分聯發
科股票150 張 (UMC sold 150 kilo-shares of MediaTek
stocks during 1/29 to 2/27.) " can be split into words: "聯
電 (UMC)," "1月29日 (1/29)," "至 (to)," "2月27日 (2/27),"
" 處 分 (sold)," " 聯 發 科 (MediaTek)," " 股 票 (stock),"
"150," and "張 (kilo-shares)."</p>
      <p>The corporation names "聯電 (UMC)" and "聯發科
(MediaTek)" come from the domain ontology, the dates, "1
月29 日 (1/29)" and "2月27日 (2/27)" and the numeral
T1
*/null
*/null
*/null
determinatives (ND), "150" and "9678萬 (96.78 million),"
from a word-formation process.</p>
    </sec>
    <sec id="sec-11">
      <title>Syntactic Analysis</title>
      <p>
        A shallow syntactic analysis is performed in this module
due to the lack of full Chinese grammar. The analysis is
divided into two phases. In the first phase, a
phraseformation process is performed. A parser based on the
CYK algorithm [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] is used to concatenate words into
phrases. For example, the three words "1月29日 (1/29)," "
至 (to)," "2月27日 (2/27)" in Table 1 can be combined to
form the phrase "1月29日至2月27日 (1/29 to 2/27)."
      </p>
      <p>In the second phase, we use the Information-based Case
Grammar (ICG) to recognize some of the thematic roles of
each of the words in a sentence. The thematic roles are
defined in the general ontology and are represented as
relations. For example, a basic pattern in the ICG
AGENT[{NP, PP[由]}]&lt;VC2&lt;GOAL[NP] denotes that a
verbal head with the syntactic category "VC2" has two
thematic roles: agent and goal. The agent could be an NP
(noun phrase) or a PP[由] (preposition phrase led by "由
(by or through)"), and should occur on the left-hand side of
the head. The goal could be an NP (noun phrase) and
should occur on the right-hand side of the head.</p>
      <p>A head-driven approach is performed to recognize the
thematic roles using the basic patterns. We design an
automaton, called ICG-machine, to perform the recognition
process. It is somewhat different from the Mealy machine.
An enhanced scanning algorithm enables the ICG-machine
to scan an input and output all acceptance paths. Besides, it
is able to scan a fragmental input and output partially
matched paths while no fully matched paths exist. For the
basic patterns of each of the syntactic categories, we create
an ICG-machine to perform the recognition process. For
example, there are five basic patterns for the syntactic
category "VC2." The basic patterns can be used to create
an ICG-machine as illustrated in Fig. 3.</p>
      <p>NP/AGENT</p>
      <p>NP/GOAL
*/null PP[由]/AGENT */null</p>
      <p>S4 VC2/HEAD T2 NP/THEME T5
S0
*/null
*/null
*/null
S1
S5
*/null</p>
      <p>NP/GOAL</p>
      <p>For example, the machine scans an input: "NP1 NP2
VC2 NP3 NP4 DM," then four matched paths shown in
Table. 2 are outputted.</p>
    </sec>
    <sec id="sec-12">
      <title>Semantic Analysis</title>
      <p>In case the word "處分(sold)" with the syntactic category
"VC2" is a head of the sentence, the agent is probably "聯
電 (UMC)" or "1月29日至2月27日 (1/29 to 2/27)" and the
goal is probably "聯發科 (MediaTek)" or "股票 (stock)"
since all of them are noun phrases. (See Table 2) In other
words, there are two candidates respectively to the agent
and the goal. However, the agent and the goal are unique to
the head in this case. Here are syntactic ambiguities.</p>
      <p>The ontologies are used to resolve the ambiguities. In
the general ontology, the concept "sell" has an "agent"
relation to the concept "corporation." The phrase "1月29日
至2月27日(1/29 to 2/27)" cannot be an instance of the
T1
T1
T1
T1
Stock
股票
(stock)
goal
goal
concept "corporation." Therefore the word "聯電(UMC)"
is the agent. The same as the reason, the goal is "股票
(stock)." Some of syntactic ambiguities can be resolved due
to specific constraints of the ontology. The domain
ontology affords the same functionality too. After
recognition of the thematic roles, the sentence can be
interpreted as a conceptual graph shown in Fig.4.</p>
      <p>Presently, three unknown roles: "1月29日至2月27日
(1/29 to 2/27)," "聯發科 (MediaTek)," and "150張(150
kilo-shares)" have not been identified yet. A common
characteristic of languages – “local dependency” exists in
text everywhere. Using the characteristic, we find the
nearest relations between unrecognized and recognized
words. Thus, three additional relationships can be found.
The head "處分(sold)" has a "time" relation to the duration
"1 月29 日至2 月27 日 (1/29 to 2/27)." The word "股票
(stock)" has a "corporation-of-issue" relation to the word "
聯發科 (MediaTek)" and a "quantity" relation to the phrase
"150 張 (150 kilo-shares)." Fig. 5 shows the full
relationships among the members of the sentence.
聯電
(UMC)</p>
      <p>agent
time
1月29日至2月27日
(1/29 to 2/27)
goal
處分
(sold)
聯發科
(MediaTek)
company of
issue
股票
(stock)</p>
      <p>quantity
150張
(150 kilo-shares)</p>
    </sec>
    <sec id="sec-13">
      <title>EXPERIMENTS AND DISCUSSION</title>
      <p>To initially evaluate the performance of iOkra in automatic
knowledge acquisition, we conduct an experiment on a
collection of 501 news titles randomly selected from
Yahoo!股市 (tw.stock.yahoo.com). Each of the titles may
consist of one or more clauses and is manually annotated as
a set of instances and statements. The evaluation metrics
used in this experiment includes: recall rate, precision rate,
and F-measure. The experiment is conducted on the titles,
statements, and concepts, in which a correct title means all
the statements in the title must be fully recognized. The
experimental result is shown in Table 3.
F-measure</p>
      <p>Title
65.86%
66.00%
65.93%</p>
      <p>Statement</p>
      <p>Concept
78.21%
83.73%
80.88%
86.80%
91.85%
89.25%</p>
      <p>The test data contains some titles that cannot be split
into words correctly by the automatic word segmentation
process. Therefore we conduct an additional experiment on
the titles that can be split correctly. There are totally 391
tiles in this set. The experimental result is shown in Table
4.
Recall rate
Precision rate
F-measure</p>
      <p>Title
70.84%
71.02%
70.93%</p>
      <p>Statement</p>
      <p>Concept
81.46%
86.45%
83.88%
89.35%
94.60%
91.90%</p>
      <p>By an analysis on errors, we summarize the errors in
two aspects: NLP technologies and ontology engineering.
In NLP, there are three major problems as follows:</p>
      <sec id="sec-13-1">
        <title>Ellipsis and anaphora problem. Many titles consist</title>
        <p>of several clauses. Some of the clauses share a
common word.</p>
        <p>Unknown word problem. Many new created words,
translated names, loanwords, etc. occur in the title.</p>
      </sec>
      <sec id="sec-13-2">
        <title>Word segmentation problem. As shown in Tables 3</title>
        <p>and 4, many errors result from the word segmentation.</p>
        <p>For iOkra, several derived research topics on the
ontology field are described as follows:</p>
      </sec>
      <sec id="sec-13-3">
        <title>Consistency between different ontologies. In a multi</title>
        <p>ontology-supported system, how to maintain the
consistency between different ontologies is a
wellknown important issue.</p>
      </sec>
      <sec id="sec-13-4">
        <title>2. Integration between ontology and knowledge base.</title>
        <p>In an ontology-based knowledge system, one, either
ontology or knowledge base, is changed, another
should do something to correspond to the change.</p>
      </sec>
      <sec id="sec-13-5">
        <title>Cross-domain knowledge. Text knowledge may be</title>
        <p>cross two or more domains. How to acquire and
represent such knowledge is still a problem.</p>
      </sec>
    </sec>
    <sec id="sec-14">
      <title>CONCLUSION AND FUTURE WORK</title>
      <p>This paper presents an ontology-centric knowledge
representation and acquisition framework, called iOkra.
Combining NLP technologies with replaceable ontologies,
the framework is able to automatically acquire knowledge
from natural language input. Based on iOkra, a prototypical
document annotation system is constructed. By using
different domain ontologies, the system is able to
automatically annotate text documents of different
domains.</p>
      <p>A preliminary experimental result shows the system
performance at title level achieves 65.93% in F-measure,
80.88% at statement level, and 89.25% at concept level.
Without considering the errors from word segmentation,
the performance is as follows: 70.93% at title level, 83.88%
at statement level, and 91.90% at concept level. In the
future, we will work on the research topics mentioned
above.</p>
    </sec>
    <sec id="sec-15">
      <title>ACKNOWLEDGEMENTS</title>
      <p>This paper is a partial result of Project A321XS1A10
conducted by ITRI under sponsorship of the Ministry of
Economic Affairs, R.O.C. The authors would like to thank
the CKIP Group of Sinica, R.O.C. for providing the ICG.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Aho</surname>
            ,
            <given-names>A.V.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Ullman</surname>
            ,
            <given-names>J.D.</given-names>
          </string-name>
          <article-title>The Theory of Parsing, Translation, and</article-title>
          <string-name>
            <surname>Compiling</surname>
            , Prentice Hall, Englewood Cliffs,
            <given-names>N.J.</given-names>
          </string-name>
          ,
          <year>1972</year>
          Allen,
          <string-name>
            <given-names>J. Natural</given-names>
            <surname>Language</surname>
          </string-name>
          <string-name>
            <surname>Understanding</surname>
          </string-name>
          , The Benjamin/Cummings Publishing Company, Inc., Redwood City, CA,
          <year>1994</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Berners-Lee</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hendler</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Lassila</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          <article-title>The Semantic Web</article-title>
          , Scientific American,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>K.J.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>C.R.</given-names>
          </string-name>
          <article-title>Information-based Case Grammar</article-title>
          ,
          <source>In Proceedings of the 13th International Conference on Computational Linguistics (COLING '90)</source>
          , University of Helsinki, Finland,
          <volume>2</volume>
          (
          <year>1990</year>
          ),
          <fpage>54</fpage>
          -
          <lpage>59</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Gomez-Perez</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fernandez-Lopez</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Corcho</surname>
          </string-name>
          , O.
          <source>Technical Roadmap D.1.1</source>
          .2,
          <issue>OntoWeb</issue>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Heflin</surname>
            ,
            <given-names>J. Web</given-names>
          </string-name>
          <string-name>
            <surname>Ontology Language (OWL) Use</surname>
          </string-name>
          <article-title>Cases and Requirements-working draft 3</article-title>
          ,
          <issue>W3C</issue>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Lai</surname>
            ,
            <given-names>Y.S.</given-names>
          </string-name>
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>R.J.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Hsu</surname>
          </string-name>
          ,
          <string-name>
            <surname>W.K. A DAML+OILCompliant Chinese Lexical Ontology</surname>
          </string-name>
          ,
          <source>In Proceedings of the 19th International Conference on Computational Linguistics</source>
          ,
          <volume>2</volume>
          (
          <year>2002</year>
          ),
          <fpage>1238</fpage>
          -
          <lpage>1242</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Maedche</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Staab</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <article-title>Ontology Learning for the Semantic Web</article-title>
          ,
          <source>IEEE intelligent Systems</source>
          ,
          <volume>16</volume>
          ,
          <issue>2</issue>
          (
          <year>2001</year>
          ),
          <fpage>72</fpage>
          -
          <lpage>79</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <article-title>Ontology-based Knowledge Representation for Bioinformatics</article-title>
          , Briefings in Bioinformatics,
          <volume>1</volume>
          ,
          <issue>4</issue>
          (
          <year>2000</year>
          ),
          <fpage>398</fpage>
          -
          <lpage>416</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>