Towards Automatic Knowledge Acquisition from Text Based on Ontology-centric Knowledge Representation and Acquisition Yu-Sheng Lai Ren-Jr Wang Industrial Technology Research Institute, Industrial Technology Research Institute, Tainan, Taiwan, R.O.C. Hsinchu, Taiwan, R.O.C. laiys@itri.org.tw rjwang@itri.org.tw ABSTRACT Web is not only human-readable but also machine- With the development of the Semantic Web and ontology understandable. technologies, many ontologies have been built or will be Technically, researchers use ontologies to describe the built before long. Based on the ontologies, we attempt to semantics of websites. The W3C defines the Semantic Web investigate the technology of automatic knowledge as "the abstract representation of data on the World Wide acquisition from text. This paper presents an ontology- Web, based on the RDF standards and other standards to be centric framework for knowledge representation and defined" and has been developing it in collaboration with acquisition, called iOkra. By combining NLP technologies many researchers and organizations. A document that with replaceable ontologies, the framework is able to specifies usage scenarios, goals and requirements for a web acquire different domain knowledge from natural language ontology language (OWL) has been proposed [6]. input. The acquired knowledge is represented in the form OntoWeb Network involving most European Union (EU) of instances and statements associated with the ontologies. members has been integrating academic and industrial resources to promote interdisciplinary work and strengthen Categories and Subject Descriptors the European influence on Semantic Web standardization I.2.4 Knowledge Representation Formalisms and Methods efforts such as those based on RDF and XML. I.2.7 Natural Language Processing The Semantic Web relies heavily on formal ontologies Keywords to structure data for comprehensive and transportable machine understanding [8]. With the development of the Natural language processing, knowledge representation, Semantic Web and ontology technologies, many ontologies knowledge acquisition, ontology have been built or will be built before long. This paper INTRODUCTION proposes an ontology-centric framework (see Fig. 1) that Knowledge acquisition traditionally requires various integrates natural language processing (NLP) technologies specialists in logic, linguistic, philosophy, etc. Although and the ontologies to automatically acquire knowledge many facilities have been developed for enabling these from natural language input. specialists to collaborate, large number of handcrafting task Natural is still unavoidable. Automatic knowledge acquisition from language input text seems to be a pleasant aspect because of wealthy Knowledge base textual documents and data. However, no fully satisfactory approaches to automatic knowledge acquisition from text have been proposed. Arbitration Morphological Analysis Analysis In [3], Berners-Lee and the co-authors claimed that "the Semantic Web is not a separate Web but an extension of Ontologies the current one, in which information is given well-defined Discourse Syntactic meaning, better enabling computers and people to work in Analysis Analysis cooperation." It indicates that the data on the Semantic Semantic Permission to make digital or hard copies of all or part of this work for Analysis personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that Figure 1. The proposed framework for knowledge copies bear this notice and the full citation on the first page. To copy acquisition. otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. K-CAP’03, October 23-26, 2003, Sanibel Island, FL, USA. Copyright 2003 ACM 1-58113-000-0/00/0000…$5.00 THE FRAMEWORK [9]," in which the instances indicate the things represented As illustrated in Fig. 1, the framework called iOkra is by concepts. Similar to the notion, we represent knowledge expected to automatically acquire knowledge from natural in an ontology-based representation system. language input, to represent the knowledge in the form of The Ontologies instances and statements associate with the ontologies, and An ontology basically consists of a set of concepts that to store the acquired knowledge into knowledge base. represent classes of objects, and a set of binary relations The central ontologies comprise two kinds of defined on concepts. A special transitive relation ontologies: linguistic ontologies and domain ontologies. subClassOf represents a subsumption relationship between The main characteristic of linguistic ontologies is that they concepts. The subsumption relations structure a taxonomy are bound to the semantics of grammatical units, such as for the ontology. In addition to the taxonomy, an ontology words, nominal groups, etc. [5]. The domain ontologies typically contains a set of axioms explicitly or implicitly. provide varied ontological information, which might be The axioms enhance the ontology for reasoning. domain-specific, task-oriented, or use-desirable. Maedche and Staab proposed an ontology-learning In the framework, the natural language input is processed framework [8] for the Semantic Web. In their case, they through several modules including morphological, formally defined an ontology as an 8-tuple , in which the first primitive L denotes a set of module. strings that describe lexical entries for concepts and The morphological analysis splits the input text into relations, the middle 6 primitives structure the taxonomy of words and connects to the ontologies for each word. the ontology, and the last primitive A is a set of axioms that The connections provide syntactic and semantic describe additional constraints on the ontology. The axioms information for the following analyses. make implicit facts more explicit. Based on the same definition, two ontologies: a linguistic ontology and a The syntactic analysis performs a semantic case domain ontology, are currently in iOkra. frame parsing. The information-based case grammar [4] is adopted to suggest parts of the thematic roles, Linguistic Ontology such as agent, patient, theme, goal, etc., in each Following the DAML+OIL specification, Lai et al. sentence. constructed a Chinese lexical ontology call CLO [7]. To The semantic analysis finds the remaining roles out improve the ability in Traditional Chinese language and identifies the statements, cf. RDF statements, processing, we define an amended version that has altered namely the concept for each word and the relations by a wide margin. Major amendments are as follows: between the word concepts, according to the 1. The approach to real world applications such as ontologies. information extraction and knowledge acquisition, we The discourse analysis addresses the contextual make an adjustment in taxonomy. "人 (person)," "事 issues, such as ellipsis and anaphora resolutions, (affair)," "時 (time)," "地 (place)," "物 (thing)" are which is currently an initial and on-going task and five basic entities in documents (Chen et al., 1998). will be not presented in the following of this paper. Therefore we define the five entities plus two The arbitration module quantifies all possible additional concepts " 屬 性 (attribute)" and " 數 量 statements to reconcile conflicts, produces final result (quantity)" as the upmost concepts. statements, and stores the results into a knowledge 2. To increase the compatibility with other ontology base, which is in a form of statements associated with editors, such as OilEdit, the concept Lexicon in CLO is the ontologies. eliminated from the amendment. Some of the lexical entries are changed into instances. Others are moved to ONTOLOGY-BASED KNOWLEDGE new, more proper position. REPRESENTATION 3. To enhance the expression power in linguistics, some What is knowledge representation (KR)? Allen considered thematic roles, such as theme, goal, range, etc., are that "knowledge representation means different things to interpreted as relations between concepts and added to different researchers [2]." For some, it concerns the the ontology. structure of the language used to express the knowledge. For others, it concerns the content of sentences. Herein we Domain Ontology are interesting in the meaning representation of sentences. For different domains, one term could be interpreted as Stevens et al. presented an ontology-based knowledge many different meanings. For example, "大陸 (mainland)" representation system for bioinformatics since they means a country - China in a hard news article, but also believed that "the combination of an ontology with means a corporation name - CEC in a stock news article. It associated instances is what is known as a knowledge base means different ontologies are required for different determinatives (ND), "150" and "9678萬 (96.78 million)," domains, even for different tasks. from a word-formation process. Addressing the problem of knowledge representation Syntactic Analysis and acquisition from the news articles of Taiwan stock A shallow syntactic analysis is performed in this module market, we create an ontology that aims at the terminology due to the lack of full Chinese grammar. The analysis is of Taiwan stock market, such as industrial categories, divided into two phases. In the first phase, a phrase- corporation names, product names, people names, proper formation process is performed. A parser based on the nouns, etc. Most of them are collected from the WWW and CYK algorithm [1] is used to concatenate words into are organized into the domain ontology automatically. A phrases. For example, the three words "1月29日 (1/29)," " small number are reorganized or modified manually. 至 (to)," "2月27日 (2/27)" in Table 1 can be combined to Instance and Statement form the phrase "1月29日至2月27日 (1/29 to 2/27)." The iOkra represents ontology-based knowledge consisting In the second phase, we use the Information-based Case of two components: instance and statement. An instance is Grammar (ICG) to recognize some of the thematic roles of a specific description of a concept. For example, "台積電 each of the words in a sentence. The thematic roles are (TSMC)" is an instance of concept "公司 (corporation)." A defined in the general ontology and are represented as statement specifies a relationship between instances. For relations. For example, a basic pattern in the ICG example, the concept "公司 (corporation)" has a "董事長 AGENT[{NP, PP[由]}]