A Semantic Annotation Tool to Extract Instances from Korean Web Documents Hai-tao Zheng, Bo-Yeong Kang, Sang-Ok Koo, Hee-Chul Choi, Kwang-Sub Kim, Hong-Gee Kim1 Biomedical Knowledge Engineering Laboratory,Seoul National University, Yeongeon-dong, Jongro-gu, Seoul, Korea, 110-749 {quickly,comeng99,ironyjk,kss,hgkim}@snu.ac.kr, sangokkoo@gmail.com ABSTRACT semi-structured contents. Another tool, MnM[3] is an annotation Although there has been extensive research on developing seman- tool which provides both automated and semi-automated support tic annotation tools recently, only few systems support automatic for annotating web pages with semantic contents as well. Com- information extraction. In this paper, we propose a semantic anno- pared to KIM and MnM, our system is more suitable for domain tation system named SARM, which has an automatic instance and language specific annotation task, thus has been tested for the extraction module based on two machine learning techniques, task domain of searching restaurant information in Korean. Bayesian Classifier and Support Vector Machine. SARM has been In this paper, we firstly propose the system architecture of tested to make a Korean Restaurant ontology evolve by automati- SARM. In the next step, we elaborate the learning methods for cally extracting instances from Web documents in Korean. The instance extraction and mechanism of the instance extraction for automatic instance extraction module can accelerate the annota- the Korean restaurant ontology, which can accelerate the annota- tion work which is very time-consuming and involves a lot of tion work which is time-consuming. Thirdly, the experiment result human labor. We describe the implementation of our system and of SARM will be described. also compare the performances of the two machine learning meth- ods we used. 2. INSTANCE EXTRACTION FOR KOREAN RESTAURANT ONTOLOGY Categories and Subject Descriptors In this section, we explain the proposed semantic annotation sys- H.4.3 [Communications Applications], I.2.6 [Learning], I.2.7 tem (SARM), which has the automatic instance extraction module [Natural Language Processing] based on two machine learning techniques. SARM has been tested to make a Korean Restaurant ontology evolve by automatically General Terms extracting instances from Web documents in Korean Performance, Design, Experimentation 2.1 Overall Architecture In the Fig 1, we describe the whole architecture of SARM. SARM Keywords consists of automatic instance extraction module (Web Document SARM, Bayesian Classifier, SVM, Information Extraction, Ko- Crawler, Web RawDB, HTML Parser, Morphology Analyzer, rean Restaurant Ontology Bayesian Classifier and SVM Classifier), Semantic Annotator (with the API for remote access, embedding, and integration), Domain Ontology , Annotated Results (the semantic annotated 1. INTRODUCTION web contents, RDF or OWL statements) and front-ends (the user This paper proposes a semantic annotation system named interface with web browser control and knowledge explore for SARM, which has the automatic instance extraction module based Ontology navigation). on two machine learning techniques, Bayesian Classifier and Sup- Firstly, the crawler extracts the domain specific web documents port Vector Machine. SARM has been tested to make a Korean as input for the Bayesian classifier and SVM Classifier. After Restaurant ontology evolve by automatically extracting instances preprocessing of HTML parsing and morphology analysis on the from Web documents in Korean. crawled web documents (Web RawDB), the classifier learns the Recently, there has been extensive research to develop ontology features of the restaurant instances in the preprocessed web docu- based annotation tools that facilitate annotation of web document ments. Thus, given the new web documents, the classifier can items in manual or automatic ways. KIM Semantic Annotation extract the restaurant domain instances based on the training data. Platform[2], for example, provides a Knowledge and Information The user interface with web browser control[1] supports the func- Management (KIM) infrastructure and services for automatic se- tionalities such as ontology import/export, manual annotation mantic annotation, indexing, and retrieval of unstructured and editing by user, annotation browsing with instance extraction etc. 1 Corresponding author. Tel: +82-20-740-8796 Email:hgkim@snu.ac.kr(Hong-Gee Kim) Automatic instance extraction module www Web Documents 3. EXPERIMENT RESULTS Restaurant The proposed method was applied to the 467 web pages crawled[6] (Web contents, RDF or OWL Statements) Web RawDB Crawler Contents from the JoyFood.Com 2 of restaurant domain. Before restaurant User Domain instances were extracted for learning data construction, a series of Ontology Import preprocessing steps had to be done: morphological analysis[4] and HTML Parser html parsing[5] of the 467 web pages. Then we extracted a set of Export User Interface Text & Multimedia Data 1,260 restaurant instances that were divided into two sets: one for Morphology Analysis training the classifier and the other for actual validation. The per- Sh Impo RDF stored documents ow formance of the proposed method was evaluated by accuracy mi- An rt Output Bayesian Classifier no cro average after conducting 5-fold cross validation on the 1,260 tat &SVM Classifier ion ce an Distributed ontologies extracted restaurant instances. The train and the test sets used for Re st Annotated In su ed ct the 5-fold cross validation consist of 1,008 restaurant instances lts Documents tra Import & Metadata Ex Semantic and 252 restaurant instances, respectively. (Semantic Web Annotated Annotator Web contents, Input New Web Page RDF or OWL Statements) In Bayesian Classifier, the performance was 98% in accuracy Fig. 1. The Architecture of the proposed semantic annotation when the train set was used for both learning and testing. And the performance decreased to 94% when the test set was used after learning on the train set. In SVM, when the train set was used for 2.2 Restaurant Domain Instance Extraction both learning and testing, the performance was 96% in accuracy Based on Compound Word Learning whereas it decreased to 92% when the test set was used after Most restaurant names in Korean are composed of compound learning on the train set. words. For example, a restaurant name, 강-변-식당(rive-nearby- restaurant) is composed of three words: 강(river), 변(nearby), and 4. CONCLUSION AND FUTURE WORK 식당(restaurant). Another example is 서울-집(Seoul-house) that In this paper, we provide a system of semantic annotation with is composed of two words, 서울(Seoul) and 집(house). We also instance extraction for Korean restaurant ontology. Although there found that most Korean restaurant instances are composed of are some semantic annotations tools, there is few semantic annota- some combination of concepts such as house, location and dish. tion with automatic instance extraction that is suitable for domain Therefore, restaurant instances can be recognized successfully by and language specific annotation task. Our annotation tool, applying a machine learning technique on the known combination SARM is expected to help the user to make the annotation more of concepts for restaurant names. To annotate compound words as effectively with respect to the restaurant domain in Korean lan- a restaurant instance, we used an ontology that contains concepts guage. of restaurant, dish, beverage, and food stuff. The restaurant class instantiates a set of instances which represent the restaurant name, 5. ACKNOWLEDGMENTS and the dish class instantiates a set of instances which represent This research is supported by Ministry of Information and Com- the dish name. munication Republic of KOREA - National Project (Project man- Based on the observation of data, the multi-word decomposi- agement of Institute for Information Technology Advancement). tion was processed for each word in training data. Then we make a word vector V = ( w1 , w2 ,..., wn ) for representing the name of 6. REFERENCES [1] Microsoft WebBrowser control. http://msdn.microsoft.com/ restaurant instances. Here, wi is a single noun that is decomposed workshop/browser /webbrowser/browser_control_ovw_entry.asp by a Korean noun dictionary. We can apply naïve Bayesian classi- [2] Borislav Popov, A.K., Angel Kirilov, Dimitar Manov, Damyan fication using MAP(Maximum a posteriori) decision rule as fol- Ognyanoff, Miroslav Goranov. Kim-Semantic Annotation Plat- lowing equation. form. 2nd International Semantic Web Conference (ISWC2003), Vol. 2870. Springer, Verlag Berlin Heidelberg (2003) 834-849 [3] Maria Vargas-Vera, Enrico Motta, John Domingue, Mattia n (3) Lanzoni, Arthur Stutt and Fabio Ciravegna :MnM: Ontology classify(Vk ) = arg max p(C = c) ∏ p( wi | C = c) c i =1 Driven Semi-Automatic and Automatic Support for Semantic Markup,The 13th International Conference on Knowledge Engi- We can also apply the word vectors V , into SVM classifier neering and Management (EKAW 2002), ed Gomez-Perez, A., to find the OHP that best separates a set of training examples as Springer Verlag, 2002 following equation. Here, the OHP can be achieved by minimiz- [4] HAM. http://nlp.kookmin.ac.kr/HAM/kor/index.html ing the objective function Ol. Then Vk (∈RN) is the k-th input vec- [5] HTML parser. http://htmlparser.sourceforge.net/ tor and yk ∈{+1,-1} is the corresponding label for Vk in a two-class [6] WIRE.http://www.cwr.cl/projects/WIRE/index.htm classification problem. W denotes the perpendicular vector to the OHP.(cf. Equation 4) 1 (4) Ol = w ⋅ w, 2 subject to y k ( w ⋅ V k + b ) − 1 ≥ 0 , k = 1 ,..., m 2 Joyfood.Com. http://www.joyfood.com