Web Information Extraction systems for Web Semantization? Jan Dedek Department of Software Engineering, Faculty of Mathematics and Physics Charles University in Prague, Czech Republic dedek@ksi.mff.cuni.cz Institute of Computer Science, Academy of Science of the Czech Republic Looking for information on the Web Prague, Czech Republic Abstract. In this paper we present a survey of web infor- mation extraction systems and semantic annotation plat- forms. The survey is concentrated on the problem of employment of these tools in the process of web semanti- zation. We compare the approaches with our own solutions and propose some future directions in the development of the web semantization idea. 1 Introduction There exist many extraction tools that can process web pages and produce structured machine under- standable data (or information) that corresponds with the content of a web page. This process is often called Fig. 1. The Semantic/Semantized Web in use. Web Information Extraction (WIE). In this paper we present a survey of web information extraction systems and we connect these systems with the problem of web This example from interesting article [16] by Ian Hor- semantization. rocks shows the big difference between use of a seman- The paper is structured as follows. First we sketch tic query language instead of keywords. In the seman- the basic ideas of semantic web and web semantiza- tic case you should be given exactly the list of names tion. In the next two sections methods of web infor- you were requesting without having to pore through mation extraction will presented. Then description of results of (probably more then one) keyword queries. our solutions (work in progress) will continue. And Of course the user have to know the syntax of the finally just before the conclusion we will discuss the semantic query language or have a special GUI1 at connection of WIE systems with the problem of web hand. semantization. The last and the most important possibility (in the semantic or semantized setting) is to use some (per- sonalized) software agent that is specialized to tasks of 1.1 The Semantic Web in use some kind like planning a business trip or finding the The idea of the Semantic Web [4] (World Wide Web most optimal choice from all the relevant job offers, dedicated not only to human but also to machine – flats for rent, cars for sale, etc. software agents) is very well known today. Let us just Both the semantic querying and software agents shortly demonstrate its use with respect to the idea of engagement is actually impossible to realize without Web Semantization (see in next section). any kind of adaptation of the web of today in the se- mantic direction. The Fig. 1 shows a human user using the (Seman- tic) Web in three possible manners: a keyword query, a semantic query and by using a software agent. The 1.2 Web Semantization difference between the first two manners (keyword and semantic query) can be illustrated with the question: The idea of Web Semantization [9] consist in grad- “Give me a list of the names of E.U. heads of state.” ual enrichment of the current web content as an au- ? tomated process of third party annotation for mak- This work was partially supported by Czech projects: 1 IS-1ET100300517, GACR-201/09/H057, GAUK 31009 Such handy GUI can be found for example in the KIM and MSM-0021620838. project [20]. 2 Jan Dedek Structure of General Document Domain Specific Regexp Level Applicable e.g. HTML tables Web Information Extraction Method Deep Linguistic Specific Form Specific Text Analysis Fig. 2. Division of extraction methods. ing at least a part of today’s web more suitable for The distinguishing between general applicable machine processing and hence enabling it intelligent methods and the others that have meaningful appli- tools for searching and recommending things on the cation only in some specific setting (specific domain, web (see [3]). specific form of input) is very important for Web Se- The most strait forward idea is to fill a seman- mantization because when we try to produce anno- tic repository with some information that is automat- tations in large scale, we have to control which web ically extracted from the web and make it available to resource is suitable for which processing method (see software agents so they could access to the web of to- in Sect. 5). day in semantic manner (e.g. through semantic search engine). The idea of a semantic repository and a public ser- 2.1 General applicable vice providing semantic annotations was experimen- tally realized in the very recognized work of IBM Al- The most significant (and probably the only one) gen- maden Research Center: the SemTag [13]. This work erally applicable IE task is so called Instance Resolu- demonstrated that an automated semantic annotation tion Task. The task can be described as follows: Given can be applied in a large scale. In their experiment a general ontology, find all the instances from the on- they annotated about 264 million web pages and gen- tology that are present in the processed resource. This erated about 434 millions of semantic tags. They also task is usually realized in two steps: (1) Named En- provided the annotations as a Semantic Label Bureau tity Recognition (see in Sect. 3.1), (2) Disambiguation – a HTTP server providing annotations for web doc- of ontology instances that can be connected with the uments of 3rd parties. found named entities. Success of the method can be strongly improved with coreference resolution (see in Sect. 3.1). 2 Web information extraction Let us mention several good representatives of this approach: the SemTag application [13], the KIM The task of a web information extraction system is to project [20] and the PANKOW annotation method [7] transform the web pages into program-friendly struc- based on smart formulation of Google API queries. tures such as a relational database. There exists a rich variety of Web Information Extraction systems. The results generated by distinct tools usually can not be 2.2 Domain specific directly compared since the addressed extraction tasks are different. The extraction tasks can be distinguished Domain and from specific IE approaches are the typ- according several dimensions: the task domain, the au- ical cases. More specific information is more precise, tomation degree, the techniques used, etc. These di- more complex and so more useful and interesting. But mensions are analyzed in detail in the recent publica- the extraction method has to be trained to each new tions [6] and [18]. Here we will concentrate on a lit- domain separately. This usually means indispensable tle bit more specific division of WIE according to the effort. needs of the Web Semantization (see in Sect. 5). The A good example of domain specific information ex- division is demonstrated on the Fig. 2 and should traction system is SOBA [5]. This complex system is not be considered as disjoint division of the methods capable to integrate different IE approaches and ex- but rather as emphasization of different aspects of the tract information from heterogeneous data resources, methods. For example many extraction methods are including plain text, tables and image captions but domain and form specific at the same time. the whole system is concentrated on the single domain WIE systems for Web Semantization 3 of football. Next similarly complex system is ArtE- Named Entity Recognition: This task recognizes quAKT [1], which is entirely concentrated on the do- and classifies named entities such as persons, loca- main of art. tions, date or time expression, or measuring units. More complex patterns may also be recognized as structured entities such as addresses. 2.3 Form specific Template Element Construction: Populates tem- plates describing entities with extracted roles (or Beyond general applicable extraction methods there attributes) about one single entity. This task exist many methods that exploit specific form of the is often performed stepwise sentence by sentence, input resource. The linguistic approaches usually which results in a huge set of partially filled tem- process text consisting of natural language sentences. plates. The structure-oriented approaches can be strictly ori- Template Relation Construction: As each temp- ented on tables [19] or exploit repetitions of structural late describes information about one single entity, patterns on the web page [21] (such algorithm can be this tasks identifies semantic relations between en- only applicable to pages that contain more than one tities. data record), and there are also approaches that use Template Unification: Merges multiple elementary the structure of whole site (e.g. site of single web shop templates that are filled with information about with summary pages with products connected with identical entities. links to pages with details about single product) [17]. Scenario Template Production: Fits the results of Template Element Construction and Template Relation Construction into templates describing 3 Information extraction from pre-specified event scenarios (pre-specified“queries text-based resources on the extracted data”). In this section we will discuss the information extrac- Appelt and Israel [2] wrote an excellent tutorial tion from textual resources. summarizing these traditional IE tasks and systems built on them. 3.1 Tasks of information extraction 3.2 Information extraction benchmarks There are classical tasks of text preprocessing and lin- Contrary to the WIE methods based on the web page guistic analysis like structure, where we (the authors) do not know about any well established benchmark for these methods2 , Text Extraction – e.g from HTML, PDF or DOC, the situation in the domain of text based IE is fairly Tokenization – detection of words, spaces, punctua- different. There are several conferences and events con- tions, etc., centrated on the support of automatic machine pro- Segmentation – sentence and paragraph detection, cessing and understanding of human language in text POS Tagging – part of speech assignment, often in- form. Different research topics as text (or information) cluding lemmatization and morphological analy- retrieval3 , text summarization4 are involved. sis, On the filed of information extraction, we have to Syntactic Analysis (often called linguistic parsing) mention the long tradition of the Message Understand- – assignment of the grammatical structure to given ing Conference5 [15] starting in 1987. In 1999 the event sentence with respect to given linguistic formalism of Automatic Content Extraction (ACE) Evaluation 6 (e.g. formal grammar), started, which is becoming a track in the Text Analysis Coreference Resolution (or anaphora resolution) – Conference (TAC)7 this year (in 2009). resolving what a pronoun, or a noun phrase refers 2 to. These references often cross boundaries of It is probably at least partially caused by the vital devel- opment of the presentation techniques on the web that a single sentence. is still well in progress. 3 e.g. Text REtrieval Conference (TREC) Besides these classical general applicable tasks, there http://trec.nist.gov/ are further well defined tasks, which are more closely 4 e.g. Document Understanding Conferences related to the information extraction. These tasks are http://duc.nist.gov/ domain dependent. These tasks were widely developed 5 Briefly summarized in http://en.wikipedia.org/ in the MUC-6 conference 1995 [15] and considered as wiki/Message Understanding Conference. semantic evaluation in the first place. These informa- 6 http://www.itl.nist.gov/iad/mig/tests/ace/ 7 tion extraction tasks are: http://www.nist.gov/tac 4 Jan Dedek All these events prepare several specialized data- and stores the data in an ontology. We have made ini- sets together with information extraction tasks and tial experiments in the domain of reports of traffic ac- play an important role as information extraction cidents. The results showed that this method can e.g. benchmarks. aid summarization of the number of injured people. To avoid the need of manual design of extraction rules we focused on the data extraction phase and 4 Our solutions made some promising experiments [8] with the ma- chine learning procedure of Inductive Logic Program- 4.1 Extraction based on structural similarity ming for automated learning of the extraction rules. Our first approach for the web information extraction This solution is directed to extraction of informa- is to use the structural similarity in web pages con- tion which is closely connected with the meaning of taining large number of table cells and for each cell text or meaning of a sentence. a link to detailed pages. This is often presented in web shops and on pages that presents more than one object (product offer). Each object is presented in a similar 5 The Web Semantization setting way and this fact can be exploited. As web pages of web shops are intended for hu- In this section we will discuss possibilities and obstruc- man usage creators have to make their comprehension tions connected with the employment of web informa- easier. Acquaintance with several years of web shops tion extraction systems in the process of web seman- has converged to a more or less similar design fashion. tization. There are often cumulative pages with many products One aspect of the realization of the web seman- in a form of a table with cells containing a brief de- tization idea is the problem of integration of all the scription and a link to a page with details about each components and technologies starting with web crawl- particular product. ing, going through numerous complex analyses (docu- Our main idea is to use a DOM tree representation ment preprocessing, document classification, different of the summary web page and by breadth first search extraction procedures), output data integration and encounter similar subtrees. The similarity of these sub- indexing, and finally implementation of query and pre- trees is used to determine the data region – a place sentation interface. This elaborate task is neither easy where all the objects are stored. It is represented as nor simple but today it is solved in all the extensive a node in the DOM tree, underneath it there are the projects and systems mentioned above. similar sub-trees, which are called data records. The novelty that web semantization brings into ac- We8 have developed and implemented this idea [14] count is the cross domain aspect. If we do not want to on the top of Mozilla Firefox API and experimentally stay with just general ontologies and general applica- tested on table pages from several domains (cars, note- ble extraction methods then we need a methodology books, hotels). Similarity between subtrees was Lev- how to deal with different domains. The system has to enshtein editing distance (for a subtree considered as support extension to a new domain in generic way. So a linear string), learning thresholds for decision were we need a methodology and software to support this trained. action. This can for example mean: to add a new on- tology for the new domain, to select and train proper extractors and classifiers for the suitable input pages. 4.2 Linguistic information extraction Our second approach [11, 12, 10] for the web informa- tion extraction is based on deep linguistic analysis. We 5.1 User initiative and effort have developed a rule-based method for extraction of information from text-based web resources in Czech An interesting point is the question: Whose effort will and now we are working on its adaptation to Eng- be used in the process of supporting new domain in lish. The extraction rules correspond to tree queries on the web semantization process? How skilled such user linguistic (syntactic) trees made form particular sen- has to be? There are two possibilities (demonstrated tences. We have experimented with several linguistic on the Fig 3). The easier one is that we have to em- tools for Czech, namely Tools for machine annotation ploy very experienced expert who will decide about – PDT 2.0 and the Czech WordNet. the new domain and who will also realize the support Our present system captures text of web-pages, an- needed for the new domain. In the Fig 3 this situation notates it linguistically by PDT tools, extracts data is labeled as Provider Initiated and Provider Trained because the expert works on the side of the system 8 Thanks go mainly to Dušan Maruščák and Peter Vojtáš. that provides the semantics. 6 Jan Dedek mentation of tables. In SIGMOD ’04: Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, New York, NY, USA, ACM, 2004, 119–130. 18. B. Liu: Web Data Mining. Springer-Verlag, 2007. 19. D. Pinto, A. Mccallum, X. Wei, and B.W. Croft: Table extraction using conditional random fields. In SIGIR ’03: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in in- formaion retrieval, New York, NY, USA, ACM Press, 2003, 235–242. 20. B. Popov, A. Kiryakov, D. Ognyanoff, D. Manov, and A. Kirilov: Kim – a semantic platform for informa- tion extraction and retrieval. Nat. Lang. Eng., 10, 3-4, 2004, 375–392. 21. H. Zhao, W. Meng, Z. Wu, V. Raghavan, and C. Yu: Fully automatic wrapper generation for search engines. In WWW Conference, 2005, 66–75.