Web Information Extraction systems for Web Semantization?

                                                    Jan Dedek

                      Department of Software Engineering, Faculty of Mathematics and Physics
                                    Charles University in Prague, Czech Republic
                                              dedek@ksi.mff.cuni.cz
                      Institute of Computer Science, Academy of Science of the Czech Republic
                                                            Looking for information on the Web
                                              Prague, Czech Republic

Abstract. In this paper we present a survey of web infor-
mation extraction systems and semantic annotation plat-
forms. The survey is concentrated on the problem of
employment of these tools in the process of web semanti-
zation. We compare the approaches with our own solutions
and propose some future directions in the development of
the web semantization idea.


1     Introduction

There exist many extraction tools that can process
web pages and produce structured machine under-
standable data (or information) that corresponds with
the content of a web page. This process is often called            Fig. 1. The Semantic/Semantized Web in use.
Web Information Extraction (WIE). In this paper we
present a survey of web information extraction systems
and we connect these systems with the problem of web         This example from interesting article [16] by Ian Hor-
semantization.                                               rocks shows the big difference between use of a seman-
    The paper is structured as follows. First we sketch      tic query language instead of keywords. In the seman-
the basic ideas of semantic web and web semantiza-           tic case you should be given exactly the list of names
tion. In the next two sections methods of web infor-         you were requesting without having to pore through
mation extraction will presented. Then description of        results of (probably more then one) keyword queries.
our solutions (work in progress) will continue. And          Of course the user have to know the syntax of the
finally just before the conclusion we will discuss the       semantic query language or have a special GUI1 at
connection of WIE systems with the problem of web            hand.
semantization.                                                   The last and the most important possibility (in the
                                                             semantic or semantized setting) is to use some (per-
                                                             sonalized) software agent that is specialized to tasks of
1.1    The Semantic Web in use                               some kind like planning a business trip or finding the
The idea of the Semantic Web [4] (World Wide Web             most optimal choice from all the relevant job offers,
dedicated not only to human but also to machine –            flats for rent, cars for sale, etc.
software agents) is very well known today. Let us just           Both the semantic querying and software agents
shortly demonstrate its use with respect to the idea of      engagement is actually impossible to realize without
Web Semantization (see in next section).                     any kind of adaptation of the web of today in the se-
                                                             mantic direction.
    The Fig. 1 shows a human user using the (Seman-
tic) Web in three possible manners: a keyword query,
a semantic query and by using a software agent. The 1.2 Web Semantization
difference between the first two manners (keyword and
semantic query) can be illustrated with the question: The idea of Web Semantization [9] consist in grad-
“Give me a list of the names of E.U. heads of state.” ual enrichment of the current web content as an au-
 ?
                                                        tomated process of third party annotation for mak-
    This work was partially supported by Czech projects:
                                                             1
    IS-1ET100300517, GACR-201/09/H057, GAUK 31009                Such handy GUI can be found for example in the KIM
    and MSM-0021620838.                                          project [20].
2         Jan Dedek


                                                                                Structure of
                               General                                           Document
                                                    Domain Specific                                Regexp Level
                              Applicable
                                                                            e.g. HTML tables

     Web Information
    Extraction Method


                                                                                                  Deep Linguistic
                               Specific              Form Specific                 Text
                                                                                                     Analysis


                                      Fig. 2. Division of extraction methods.


ing at least a part of today’s web more suitable for           The distinguishing between general applicable
machine processing and hence enabling it intelligent       methods and the others that have meaningful appli-
tools for searching and recommending things on the         cation only in some specific setting (specific domain,
web (see [3]).                                             specific form of input) is very important for Web Se-
    The most strait forward idea is to fill a seman-       mantization because when we try to produce anno-
tic repository with some information that is automat-      tations in large scale, we have to control which web
ically extracted from the web and make it available to     resource is suitable for which processing method (see
software agents so they could access to the web of to-     in Sect. 5).
day in semantic manner (e.g. through semantic search
engine).
    The idea of a semantic repository and a public ser-    2.1   General applicable
vice providing semantic annotations was experimen-
tally realized in the very recognized work of IBM Al-    The most significant (and probably the only one) gen-
maden Research Center: the SemTag [13]. This work        erally applicable IE task is so called Instance Resolu-
demonstrated that an automated semantic annotation       tion Task. The task can be described as follows: Given
can be applied in a large scale. In their experiment     a general ontology, find all the instances from the on-
they annotated about 264 million web pages and gen-      tology that are present in the processed resource. This
erated about 434 millions of semantic tags. They also    task is usually realized in two steps: (1) Named En-
provided the annotations as a Semantic Label Bureau      tity Recognition (see in Sect. 3.1), (2) Disambiguation
– a HTTP server providing annotations for web doc-       of ontology instances that can be connected with the
uments of 3rd parties.                                   found named entities. Success of the method can be
                                                         strongly improved with coreference resolution (see in
                                                         Sect. 3.1).
2 Web information extraction                                 Let us mention several good representatives of this
                                                         approach: the SemTag application [13], the KIM
The task of a web information extraction system is to project [20] and the PANKOW annotation method [7]
transform the web pages into program-friendly struc- based on smart formulation of Google API queries.
tures such as a relational database. There exists a rich
variety of Web Information Extraction systems. The
results generated by distinct tools usually can not be 2.2 Domain specific
directly compared since the addressed extraction tasks
are different. The extraction tasks can be distinguished Domain and from specific IE approaches are the typ-
according several dimensions: the task domain, the au- ical cases. More specific information is more precise,
tomation degree, the techniques used, etc. These di- more complex and so more useful and interesting. But
mensions are analyzed in detail in the recent publica- the extraction method has to be trained to each new
tions [6] and [18]. Here we will concentrate on a lit- domain separately. This usually means indispensable
tle bit more specific division of WIE according to the effort.
needs of the Web Semantization (see in Sect. 5). The         A good example of domain specific information ex-
division is demonstrated on the Fig. 2 and should traction system is SOBA [5]. This complex system is
not be considered as disjoint division of the methods capable to integrate different IE approaches and ex-
but rather as emphasization of different aspects of the tract information from heterogeneous data resources,
methods. For example many extraction methods are including plain text, tables and image captions but
domain and form specific at the same time.               the whole system is concentrated on the single domain
                                                                         WIE systems for Web Semantization          3

of football. Next similarly complex system is ArtE-         Named Entity Recognition: This task recognizes
quAKT [1], which is entirely concentrated on the do-           and classifies named entities such as persons, loca-
main of art.                                                   tions, date or time expression, or measuring units.
                                                               More complex patterns may also be recognized as
                                                               structured entities such as addresses.
2.3   Form specific                                         Template Element Construction: Populates tem-
                                                               plates describing entities with extracted roles (or
Beyond general applicable extraction methods there             attributes) about one single entity. This task
exist many methods that exploit specific form of the           is often performed stepwise sentence by sentence,
input resource. The linguistic approaches usually              which results in a huge set of partially filled tem-
process text consisting of natural language sentences.         plates.
The structure-oriented approaches can be strictly ori-      Template Relation Construction: As each temp-
ented on tables [19] or exploit repetitions of structural      late describes information about one single entity,
patterns on the web page [21] (such algorithm can be           this tasks identifies semantic relations between en-
only applicable to pages that contain more than one            tities.
data record), and there are also approaches that use        Template Unification: Merges multiple elementary
the structure of whole site (e.g. site of single web shop      templates that are filled with information about
with summary pages with products connected with                identical entities.
links to pages with details about single product) [17].     Scenario Template Production: Fits the results
                                                               of Template Element Construction and Template
                                                               Relation Construction into templates describing
3     Information extraction from                              pre-specified event scenarios (pre-specified“queries
      text-based resources                                     on the extracted data”).

In this section we will discuss the information extrac-        Appelt and Israel [2] wrote an excellent tutorial
tion from textual resources.                                summarizing these traditional IE tasks and systems
                                                            built on them.

3.1   Tasks of information extraction
                                                            3.2   Information extraction benchmarks
There are classical tasks of text preprocessing and lin-
                                                       Contrary to the WIE methods based on the web page
guistic analysis like                                  structure, where we (the authors) do not know about
                                                       any well established benchmark for these methods2 ,
Text Extraction – e.g from HTML, PDF or DOC, the situation in the domain of text based IE is fairly
Tokenization – detection of words, spaces, punctua- different. There are several conferences and events con-
   tions, etc.,                                        centrated on the support of automatic machine pro-
Segmentation – sentence and paragraph detection, cessing and understanding of human language in text
POS Tagging – part of speech assignment, often in- form. Different research topics as text (or information)
   cluding lemmatization and morphological analy- retrieval3 , text summarization4 are involved.
   sis,                                                    On the filed of information extraction, we have to
Syntactic Analysis (often called linguistic parsing) mention the long tradition of the Message Understand-
   – assignment of the grammatical structure to given ing Conference5 [15] starting in 1987. In 1999 the event
   sentence with respect to given linguistic formalism of Automatic Content Extraction (ACE) Evaluation 6
   (e.g. formal grammar),                              started, which is becoming a track in the Text Analysis
Coreference Resolution (or anaphora resolution) – Conference (TAC)7 this year (in 2009).
   resolving what a pronoun, or a noun phrase refers 2
   to. These references often cross boundaries of         It is probably at least partially caused by the vital devel-
                                                          opment of the presentation techniques on the web that
   a single sentence.
                                                              is still well in progress.
                                                            3
                                                              e.g. Text REtrieval Conference (TREC)
Besides these classical general applicable tasks, there
                                                              http://trec.nist.gov/
are further well defined tasks, which are more closely      4
                                                              e.g. Document Understanding Conferences
related to the information extraction. These tasks are        http://duc.nist.gov/
domain dependent. These tasks were widely developed         5
                                                              Briefly summarized in http://en.wikipedia.org/
in the MUC-6 conference 1995 [15] and considered as           wiki/Message Understanding Conference.
semantic evaluation in the first place. These informa-      6
                                                              http://www.itl.nist.gov/iad/mig/tests/ace/
                                                            7
tion extraction tasks are:                                    http://www.nist.gov/tac
4      Jan Dedek

    All these events prepare several specialized data-   and stores the data in an ontology. We have made ini-
sets together with information extraction tasks and      tial experiments in the domain of reports of traffic ac-
play an important role as information extraction         cidents. The results showed that this method can e.g.
benchmarks.                                              aid summarization of the number of injured people.
                                                             To avoid the need of manual design of extraction
                                                         rules we focused on the data extraction phase and
4 Our solutions                                          made some promising experiments [8] with the ma-
                                                         chine learning procedure of Inductive Logic Program-
4.1 Extraction based on structural similarity ming for automated learning of the extraction rules.

Our first approach for the web information extraction        This solution is directed to extraction of informa-
is to use the structural similarity in web pages con-    tion  which is closely connected with the meaning of
taining large number of table cells and for each cell    text  or meaning of a sentence.
a link to detailed pages. This is often presented in web
shops and on pages that presents more than one object
(product offer). Each object is presented in a similar 5 The Web Semantization setting
way and this fact can be exploited.
    As web pages of web shops are intended for hu- In this section we will discuss possibilities and obstruc-
man usage creators have to make their comprehension tions connected with the employment of web informa-
easier. Acquaintance with several years of web shops tion extraction systems in the process of web seman-
has converged to a more or less similar design fashion. tization.
There are often cumulative pages with many products          One aspect of the realization of the web seman-
in a form of a table with cells containing a brief de- tization idea is the problem of integration of all the
scription and a link to a page with details about each components and technologies starting with web crawl-
particular product.                                      ing, going through numerous complex analyses (docu-
    Our main idea is to use a DOM tree representation ment preprocessing, document classification, different
of the summary web page and by breadth first search extraction procedures), output data integration and
encounter similar subtrees. The similarity of these sub- indexing, and finally implementation of query and pre-
trees is used to determine the data region – a place sentation interface. This elaborate task is neither easy
where all the objects are stored. It is represented as nor simple but today it is solved in all the extensive
a node in the DOM tree, underneath it there are the projects and systems mentioned above.
similar sub-trees, which are called data records.            The novelty that web semantization brings into ac-
    We8 have developed and implemented this idea [14] count is the cross domain aspect. If we do not want to
on the top of Mozilla Firefox API and experimentally stay with just general ontologies and general applica-
tested on table pages from several domains (cars, note- ble extraction methods then we need a methodology
books, hotels). Similarity between subtrees was Lev- how to deal with different domains. The system has to
enshtein editing distance (for a subtree considered as support extension to a new domain in generic way. So
a linear string), learning thresholds for decision were we need a methodology and software to support this
trained.                                                 action. This can for example mean: to add a new on-
                                                         tology for the new domain, to select and train proper
                                                         extractors and classifiers for the suitable input pages.
4.2 Linguistic information extraction

Our second approach [11, 12, 10] for the web informa-
tion extraction is based on deep linguistic analysis. We   5.1   User initiative and effort
have developed a rule-based method for extraction of
information from text-based web resources in Czech           An interesting point is the question: Whose effort will
and now we are working on its adaptation to Eng-             be used in the process of supporting new domain in
lish. The extraction rules correspond to tree queries on     the web semantization process? How skilled such user
linguistic (syntactic) trees made form particular sen-       has to be? There are two possibilities (demonstrated
tences. We have experimented with several linguistic         on the Fig 3). The easier one is that we have to em-
tools for Czech, namely Tools for machine annotation         ploy very experienced expert who will decide about
– PDT 2.0 and the Czech WordNet.                             the new domain and who will also realize the support
    Our present system captures text of web-pages, an-       needed for the new domain. In the Fig 3 this situation
notates it linguistically by PDT tools, extracts data        is labeled as Provider Initiated and Provider Trained
                                                             because the expert works on the side of the system
8
  Thanks go mainly to Dušan Maruščák and Peter Vojtáš. that provides the semantics.
6       Jan Dedek

    mentation of tables. In SIGMOD ’04: Proceedings
    of the 2004 ACM SIGMOD International Conference
    on Management of Data, New York, NY, USA, ACM,
    2004, 119–130.
18. B. Liu: Web Data Mining. Springer-Verlag, 2007.
19. D. Pinto, A. Mccallum, X. Wei, and B.W. Croft: Table
    extraction using conditional random fields. In SIGIR
    ’03: Proceedings of the 26th annual international ACM
    SIGIR conference on Research and development in in-
    formaion retrieval, New York, NY, USA, ACM Press,
    2003, 235–242.
20. B. Popov, A. Kiryakov, D. Ognyanoff, D. Manov, and
    A. Kirilov: Kim – a semantic platform for informa-
    tion extraction and retrieval. Nat. Lang. Eng., 10, 3-4,
    2004, 375–392.
21. H. Zhao, W. Meng, Z. Wu, V. Raghavan, and C. Yu:
    Fully automatic wrapper generation for search engines.
    In WWW Conference, 2005, 66–75.