A Semantic Annotation Tool to Extract Instances
                     from Korean Web Documents
      Hai-tao Zheng, Bo-Yeong Kang, Sang-Ok Koo, Hee-Chul Choi, Kwang-Sub Kim, Hong-Gee Kim1
                         Biomedical Knowledge Engineering Laboratory,Seoul National University,
                                  Yeongeon-dong, Jongro-gu, Seoul, Korea, 110-749
                 {quickly,comeng99,ironyjk,kss,hgkim}@snu.ac.kr, sangokkoo@gmail.com


ABSTRACT                                                             semi-structured contents. Another tool, MnM[3] is an annotation
Although there has been extensive research on developing seman-      tool which provides both automated and semi-automated support
tic annotation tools recently, only few systems support automatic    for annotating web pages with semantic contents as well. Com-
information extraction. In this paper, we propose a semantic anno-   pared to KIM and MnM, our system is more suitable for domain
tation system named SARM, which has an automatic instance            and language specific annotation task, thus has been tested for the
extraction module based on two machine learning techniques,          task domain of searching restaurant information in Korean.
Bayesian Classifier and Support Vector Machine. SARM has been           In this paper, we firstly propose the system architecture of
tested to make a Korean Restaurant ontology evolve by automati-      SARM. In the next step, we elaborate the learning methods for
cally extracting instances from Web documents in Korean. The         instance extraction and mechanism of the instance extraction for
automatic instance extraction module can accelerate the annota-      the Korean restaurant ontology, which can accelerate the annota-
tion work which is very time-consuming and involves a lot of         tion work which is time-consuming. Thirdly, the experiment result
human labor. We describe the implementation of our system and        of SARM will be described.
also compare the performances of the two machine learning meth-
ods we used.                                                         2. INSTANCE EXTRACTION FOR
                                                                     KOREAN RESTAURANT ONTOLOGY
Categories and Subject Descriptors                                   In this section, we explain the proposed semantic annotation sys-
H.4.3 [Communications Applications], I.2.6 [Learning], I.2.7         tem (SARM), which has the automatic instance extraction module
[Natural Language Processing]                                        based on two machine learning techniques. SARM has been tested
                                                                     to make a Korean Restaurant ontology evolve by automatically
General Terms                                                        extracting instances from Web documents in Korean
Performance, Design, Experimentation                                 2.1 Overall Architecture
                                                                     In the Fig 1, we describe the whole architecture of SARM. SARM
Keywords                                                             consists of automatic instance extraction module (Web Document
SARM, Bayesian Classifier, SVM, Information Extraction, Ko-          Crawler, Web RawDB, HTML Parser, Morphology Analyzer,
rean Restaurant Ontology                                             Bayesian Classifier and SVM Classifier), Semantic Annotator
                                                                     (with the API for remote access, embedding, and integration),
                                                                     Domain Ontology , Annotated Results (the semantic annotated
1. INTRODUCTION                                                      web contents, RDF or OWL statements) and front-ends (the user
   This paper proposes a semantic annotation system named            interface with web browser control and knowledge explore for
SARM, which has the automatic instance extraction module based       Ontology navigation).
on two machine learning techniques, Bayesian Classifier and Sup-        Firstly, the crawler extracts the domain specific web documents
port Vector Machine. SARM has been tested to make a Korean           as input for the Bayesian classifier and SVM Classifier. After
Restaurant ontology evolve by automatically extracting instances     preprocessing of HTML parsing and morphology analysis on the
from Web documents in Korean.                                        crawled web documents (Web RawDB), the classifier learns the
   Recently, there has been extensive research to develop ontology   features of the restaurant instances in the preprocessed web docu-
based annotation tools that facilitate annotation of web document    ments. Thus, given the new web documents, the classifier can
items in manual or automatic ways. KIM Semantic Annotation           extract the restaurant domain instances based on the training data.
Platform[2], for example, provides a Knowledge and Information       The user interface with web browser control[1] supports the func-
Management (KIM) infrastructure and services for automatic se-       tionalities such as ontology import/export, manual annotation
mantic annotation, indexing, and retrieval of unstructured and       editing by user, annotation browsing with instance extraction etc.


1
    Corresponding author. Tel: +82-20-740-8796
Email:hgkim@snu.ac.kr(Hong-Gee Kim)
                                                                      Automatic instance
                                                                      extraction module
                                                                                                             www          Web Documents    3. EXPERIMENT RESULTS
                                                                                                           Restaurant
                                                                                                                                           The proposed method was applied to the 467 web pages crawled[6]
                  (Web contents,
               RDF or OWL Statements)                                    Web RawDB
                                                                                              Crawler
                                                                                                            Contents
                                                                                                                                           from the JoyFood.Com 2 of restaurant domain. Before restaurant
    User
                                                   Domain                                                                                  instances were extracted for learning data construction, a series of
                                                   Ontology
                                  Import
                                                                                                                                           preprocessing steps had to be done: morphological analysis[4] and
                                                                            HTML Parser
                                                                                                                                           html parsing[5] of the 467 web pages. Then we extracted a set of
                                   Export

               User Interface
                                                                                                                 Text & Multimedia Data
                                                                                                                                           1,260 restaurant instances that were divided into two sets: one for
                                                                       Morphology Analysis
                                                                                                                                           training the classifier and the other for actual validation. The per-
                         Sh


                                                        Impo


                                                                                                                  RDF stored documents
                           ow


                                                                                                                                           formance of the proposed method was evaluated by accuracy mi-
                              An


                                                            rt


                      Output                                           Bayesian Classifier
                                no


                                                                                                                                           cro average after conducting 5-fold cross validation on the 1,260
                                  tat


                                                                       &SVM Classifier
                                     ion


                                                                                         ce
                                                                                       an                         Distributed ontologies   extracted restaurant instances. The train and the test sets used for
                                         Re


                                                                                     st
               Annotated                                                           In
                                           su


                                                                              ed
                                                                            ct                                                             the 5-fold cross validation consist of 1,008 restaurant instances
                                             lts


              Documents                                                  tra                      Import
              & Metadata                                               Ex
                                                         Semantic                                                                          and 252 restaurant instances, respectively.
           (Semantic Web Annotated                       Annotator
                Web contents,                                                      Input New Web Page
            RDF or OWL Statements)
                                                                                                                                                In Bayesian Classifier, the performance was 98% in accuracy
 Fig. 1. The Architecture of the proposed semantic annotation                                                                              when the train set was used for both learning and testing. And the
                                                                                                                                           performance decreased to 94% when the test set was used after
                                                                                                                                           learning on the train set. In SVM, when the train set was used for
2.2 Restaurant Domain Instance Extraction                                                                                                  both learning and testing, the performance was 96% in accuracy
Based on Compound Word Learning                                                                                                            whereas it decreased to 92% when the test set was used after
Most restaurant names in Korean are composed of compound                                                                                   learning on the train set.
words. For example, a restaurant name, 강-변-식당(rive-nearby-
restaurant) is composed of three words: 강(river), 변(nearby), and                                                                           4. CONCLUSION AND FUTURE WORK
식당(restaurant). Another example is 서울-집(Seoul-house) that                                                                                     In this paper, we provide a system of semantic annotation with
is composed of two words, 서울(Seoul) and 집(house). We also                                                                                  instance extraction for Korean restaurant ontology. Although there
found that most Korean restaurant instances are composed of                                                                                are some semantic annotations tools, there is few semantic annota-
some combination of concepts such as house, location and dish.                                                                             tion with automatic instance extraction that is suitable for domain
Therefore, restaurant instances can be recognized successfully by                                                                          and language specific annotation task. Our annotation tool,
applying a machine learning technique on the known combination                                                                             SARM is expected to help the user to make the annotation more
of concepts for restaurant names. To annotate compound words as                                                                            effectively with respect to the restaurant domain in Korean lan-
a restaurant instance, we used an ontology that contains concepts                                                                          guage.
of restaurant, dish, beverage, and food stuff. The restaurant class
instantiates a set of instances which represent the restaurant name,                                                                       5. ACKNOWLEDGMENTS
and the dish class instantiates a set of instances which represent                                                                         This research is supported by Ministry of Information and Com-
the dish name.                                                                                                                             munication Republic of KOREA - National Project (Project man-
    Based on the observation of data, the multi-word decomposi-                                                                            agement of Institute for Information Technology Advancement).
tion was processed for each word in training data. Then we make
a word vector V = ( w1 , w2 ,..., wn ) for representing the name of                                                                        6. REFERENCES
                                                                                                                                           [1] Microsoft WebBrowser control. http://msdn.microsoft.com/
restaurant instances. Here, wi is a single noun that is decomposed                                                                         workshop/browser /webbrowser/browser_control_ovw_entry.asp
by a Korean noun dictionary. We can apply naïve Bayesian classi-                                                                           [2] Borislav Popov, A.K., Angel Kirilov, Dimitar Manov, Damyan
fication using MAP(Maximum a posteriori) decision rule as fol-                                                                             Ognyanoff, Miroslav Goranov. Kim-Semantic Annotation Plat-
lowing equation.                                                                                                                           form. 2nd International Semantic Web Conference (ISWC2003),
                                                                                                                                           Vol. 2870. Springer, Verlag Berlin Heidelberg (2003) 834-849
                                                                                                                                            [3] Maria Vargas-Vera, Enrico Motta, John Domingue, Mattia
                                                                               n
                                                                                                                             (3)            Lanzoni, Arthur Stutt and Fabio Ciravegna :MnM: Ontology
            classify(Vk ) = arg max p(C = c) ∏ p( wi | C = c)
                                                    c                         i =1                                                          Driven Semi-Automatic and Automatic Support for Semantic
                                                                                                                                            Markup,The 13th International Conference on Knowledge Engi-
    We can also apply the word vectors V , into SVM classifier                                                                             neering and Management (EKAW 2002), ed Gomez-Perez, A.,
to find the OHP that best separates a set of training examples as                                                                          Springer Verlag, 2002
following equation. Here, the OHP can be achieved by minimiz-                                                                              [4] HAM. http://nlp.kookmin.ac.kr/HAM/kor/index.html
ing the objective function Ol. Then Vk (∈RN) is the k-th input vec-                                                                        [5] HTML parser. http://htmlparser.sourceforge.net/
tor and yk ∈{+1,-1} is the corresponding label for Vk in a two-class                                                                       [6] WIRE.http://www.cwr.cl/projects/WIRE/index.htm
classification problem. W denotes the perpendicular vector to the
OHP.(cf. Equation 4)


                                                                 1                                              (4)
                                          Ol =                     w ⋅ w,
                                                                 2
  subject        to       y k ( w ⋅ V k + b ) − 1 ≥ 0 , k = 1 ,..., m                                                                      2
                                                                                                                                               Joyfood.Com. http://www.joyfood.com