A Navigational and Structural Approach for Extracting Contents from Web Portals Débora A. Corrêa1, Ana Maria de C. Moura2, Maria Claudia Cavalcanti1 1 Department of Computer Engineering Military Institute of Engineering (IME) Praça General Tibúrcio 80 Praia Vermelha, Urca, Rio de Janeiro, RJ, Brazil Extreme Data Lab (DEXL Lab) 2 National Laboratory of Scientific Computing (LNCC) Petrópolis, RJ, Brazil deboradac@gmail.com,yoko@ime.eb.br, anamoura@lncc.br Abstract. In a semantic Web portal, contents are described and organized based on domain ontologies, and are usually extracted from traditional portals. However, with the increasing amount of information generated each day on the Web, updating semantic portals still represents a major challenge, since this task lacks mechanisms to extract and integrate information dynamically. This paper proposes a strategy to help promoting the interoperability between portals. It consists on the extraction of contents from different Web sites on a specific domain, aiming at the instantiation of a domain ontology, and then use it to update and/or populate a semantic portal. This is carried out through the analysis of the navigational and structural characteristics of traditional portals endowed with some semantic potentiality. In order to evaluate this strategy, a tool named NECOW was implemented. NECOW performance was compared to the Google advanced search mode, and showed promising results. 1. Introduction Due to the explosive growth, popularity and heterogeneity of the Web, current traditional portals have difficulties to deal with the maintenance of their pages. They are still very limited for exchanging, reusing and integrating contents of other portals, as well as they rarely present efficient information extraction strategies and metadata maintenance. More recently, many efforts have been devoted in the area of information extraction (IE), whose main goal is to produce structured data from Web pages, so that they become ready for post-processing. Semantic Portals (SP) arose as an evolution of traditional portals [Brickley et al. 2002][Lausen et al. 2005] [Mäkelä et al. 2004], and emerged as an attempt to provide an informational infrastructure with semantic meaning. They are characterized by the use of ontologies, with the aim of providing more semantic expressiveness to their informational contents. This is achieved by the improvement of some tasks performed over their contents such as search, organization and classification, sharing, publishing and inference. Hence, besides using the same technologies usually used in the construction of traditional portals, 23 they additionally use ontological languages (RDF1 and OWL2) to better organize structure and provide information semantic meaning in the portal pages [Reynolds et al. 2004]. Despite the advantages of SP, additional techniques are required to ensure these portals can be automatically populated and updated, since many of them still depend on manual update mechanisms. Automatically updating a SP with contents from other traditional portals (sites) depends strongly on Web information extraction (IE) techniques. Due to the heterogeneity and lack of structure of Web traditional portals, access to this huge collection of information is still a challenge, and has been limited to browsing and searching. Consider, for example, a SP on the education domain that provides information about academic institutions and their courses. When a student wants to collect information about courses from different institutions in Rio de Janeiro, such as UFRJ 3 and IME4, usually she/he has to navigate through their respective portals. In order to have access to the UFRJ courses, it is necessary to navigate through a list of Web pages, structured in a completely different way from that of IME portal. In fact, this scenario illustrates how difficult it is to extract information from such portals, and consequently, how hard it is to exchange information among them and maintain an up to date semantic portal. In the literature, some works have been developed in this direction. Makella et al. (2004) use the idea of multi-facets to improve search mechanism in SPs, supported by ontological reasoning capabilities. In [Lachtim et al. 2009], a light ontology on the educational domain is used as the basis for integrating information, developing and populating semantic portals. Although these works aim at enriching portals, and at providing contents with more semantic meaning, they do not contemplate automatic information capturing from other traditional portals (or sites) available on the open Web. In the latter work, an architecture was proposed to retrieve information from semantic Web sites based on domain ontologies, which is then used to integrate contents collected from different SPs. In the present work we extend this idea, since the focus here is on extracting information from traditional portals on a specific domain. This information is transformed into structured data and used to instantiate a domain ontology, which serves as the main basis to automatically instantiate a SP on a specific domain, contributing to its maintenance [Corrêa 2012]. This paper proposes a strategy to deal with the interoperability between portals, also considering the possibility to automatically instantiate a SP. This strategy is based on the instances found along the navigational and structural analysis of Web portals. In order to achieve these goals, we assume that the portals we are going to deal with have some semantic potentiality. This term is used here to refer to traditional portals, whose contents are organized according to a hierarchical structure, helping users to navigate through the subject categories of their interest. These portals, claimed to be potentially semantic, use a somewhat controlled vocabulary, and terms typically appear as links and menu items throughout the portal. Examples of such portals are DMOZ5, Wikipédia6, and IME7. The 1 http:// www.w3.org/RDF/. 2 http://www.w3.org/TR/owl-features/ 3 Universidade Federal do Rio de Janeiro - www.ufrj.br. 4 Instituto Militar de Engenharia - www.ime.eb.br. 5 http://www.dmoz.org/ 6 http://pt.wikipedia.org/ 7 http://www.ime.eb.br/ 24 main contributions of this paper are: (a) the specification of a navigational strategy to facilitate the identification of new instances to feed a SP; and (b) the evaluation of the proposed strategy. To the best of our knowledge, it is the first work in the ontology-based IE field that follows a navigational strategy. The remainder of this paper is structured as follows. Section 2 gives a brief description of some essential concepts that are used throughout the paper. Section 3 presents some related work. Section 4 describes our navigational strategy for automatically extracting contents from portals and populating an ontology, with a brief description of its main functionalities. Section 5 presents NECOW, an extraction tool that has been developed according to the strategy proposed, with an example to demonstrate its usage. Section 6 is dedicated to the tool evaluation, and finally section 7 concludes the paper with suggestion for future work. 2 Extracting Information from Web Portals with Semantic Potentiality Some traditional portals do organize their contents according to a hierarchical structure, helping users to navigate through the subject categories of their interest. However, in this work, we develop a navigational strategy to extract information based on structures found on portals that present some semantic potentiality. We define such a portal as the one that contains one of the following characteristics: (i) has links, lists or tables and benefits from any kind of organization and hierarchy in its structure; and/or (ii) some of its pages are presented as a taxonomy, although not all of them. While DMOZ and some academic portals such as those of IME and UFRJ are classified in this category, others such as DBLife8, DBPedia9, FreeBase10 are considered more comprehensive collaborative portals, since they provide a wide set of services that help dissemination and sharing information. Semantic portals make use of semantic Web technologies to improve important functionalities in a portal, such as search and organization. Among these technologies, ontologies are considered as the most significant ones, since they enable common understanding and sharing of a domain between humans, agents and applications. Ontologies are also crucial to organize SPs, grouping sites and documents in pre-defined sets, according to their contents. Due to the great heterogeneity of structures embedded in Web pages, extracting relevant data from them is still a challenge. IE is a classic text mining technique, whose goal is to find some specific information in texts, by identifying information contained in non- structured information source. This information should be in agreement to a predefined semantics, so that it could be later stored and/or manipulated by several other sources. In the literature, three important IE techniques are identified [Silva A.S. 2012]: i) wrappers; ii) those based on Natural Language Processing (NLP); and iii) those based on the Deep Web (DW). The first one aims at extracting information from structured or semi- structured data (such as HTML). They are based on their format, delimiters, typography and frequency of words. NLP aims at extracting information directly from unstructured texts, and depends on the natural language pre-processing such as in Ondux [Cortez et al. 2010] and JUDIE [Cortez et al. 2011]. Finally, those based on the DW aim at extracting 8 http://dblife.cs.wisc.edu 9 http://dbpedia.org 10 http://www.freebase.com 25 information from forms and/or hidden tables that are not visible to the user, as in DeepPeep [Barbosa and Freire 2005] and DeepBot [Álvarez et al. 2007]. Wilmalasuriya and Dou (2010) wrote an interesting overview about ontology-based IE technologies, also exploring some related tools. However, to the best of our knowledge, none of the discussed works used a Web navigational strategy, nor focused on the maintenance of semantic portals. 3. Related Work A challenging research topic for the Web researcher’s community is the interoperability between portals and their automatic instantiation. The literature points out to some works that use semantic Web technologies to exploit this topic, although in different contexts. Lachtim et al. (2009a, 2009b) created an educational semantic portal, which is populated and integrated with contents extracted from Web semantic pages within the same domain. In [Suominen et al.2009] metadata and documents are obtained from contents published in Content Management Systems or from those manually annotated by the metadata editor SAHA [Kurki and Hyvönen 2010]. Later these metadata are submitted to an ontology to be validated and published in a semantic portal. The portal presented in [Hyvönen et al. 2009] creates its contents by making use of a set of metadata schemas and some specific tools. This population process enables producing and extracting contents from museums, libraries, files and other organizations, besides getting information from citizens as individuals and from national and international Web sources. When compared with these works, our great differential consists on the semantic portal update with contents hosted in sites and/or Web portals with some semantic potentiality, and considering only their presentation and navigational structure, such as links, lists and tables. Hence, the update task in these portals allows these pages to be transformed from simple user pages into ones that are able to integrate and instantiate contents based on domain ontologies. 4. An Approach for Navigating and Extracting Information This work extends the architecture proposed by Lachtim et al. (2009a). The latter aimed at creating a semantic portal, integrating and instantiating a domain ontology that supported a SP with contents extracted from Web semantic pages within the same domain. However, that architecture did not consider contents extracted from traditional portals or sites in the open Web. This work proposes a strategy to fill in this gap, as described along this section. Figure 1 presents an overview of the proposed strategy. Mainly, the idea is to navigate through a list of sites with some semantic potentiality, on a specific domain. The navigation is guided by a subset (cropping) of a domain ontology (OB), which is represented in OWL. For each site in the list, useful11 information is extracted to enrich the ontology, i.e., new potential instances of the OB classes, as well as new potential relationships between them (instances of OB object properties), are identified. A user validation of such new information is needed in order to remove eventual false positives. All this information is then transformed into RDF triples, which compose a new version of the OB ontology, here called OB´. The OB´ ontology may be used as input for the alignment with the already existing information in the current semantic portal. The main component of 11 In the context of this paper, useful means all kind of information that is pertinent with the current domain. 26 this architecture concerns the IE, which gives the required support for populating portals. This component, as illustrated in Figure 1, is composed of modules, whose characteristics and functionalities are described next. i.OB cropping: this step is responsible for loading a list of classes, instances and properties of the OB domain ontology, which will be the basis for the search of the useful instances of each page visited in a portal with semantic potentiality. The relationships between classes of the OB are also considered, since they guide the navigation along the portal pages. The navigation always starts from the most general class, defined by the user, and proceeds to the more specific ones. Additionally, the real name of each class, its label and its equivalent classes are very important for the navigation between the pages of the portal (see step iii). The instances of each class, as well as their equivalent instances (defined by the property same as) are also considered; ii.Pre-categorization and identification of the initial page: a pre-categorization based on the title will be performed to limit the navigation defined by the step (i). If the initial page contains in the title a name similar to an instance of any OB class, the navigation will start from the next class of the OB. If this situation does not occur, the navigation will start from the first OB class; iii. Navigation: this module is responsible for the navigation through the pages of each site previously defined by the user (stored in a configuration file). Its main goal is to search for links, within table or menu lists, through which OB’ classes can be identified and corresponding new instances can be retrieved. It is composed of four sub-modules: Figure 1. Navigation and Information Extraction Figure 2. Fragment of the OBEDU Module. ontology (OB) A.Class retrieval: once the navigation starts, the system will search for links and labels that are similar to the desired OB’ classes defined in step (i). These links will be considered as priority for navigation. Whenever a similar link is identified, the system verifies if it has already been visited. In the affirmative case, it will go on through the next link; otherwise, the link will be visited and its instances will be retrieved as described in the next step B; B.Instance retrieval: for each OB’ class similar link identified in step A, the corresponding target page is traversed in order to identify potential instances to that class. These instances should appear between tags, denoting links, lists and/or table items. Additionally, their label should have some similarity with the existing instances of the 27 corresponding OB’ class. During the navigation, all the information extracted is saved for later validation by the user (step iv); C.Hierarchy analysis: in order to avoid duplicated instantiation in the OB´, hierarchies should be verified. This duplication typically occurs, for example, with a class and its subclasses. As an instance can instantiate a class and also its superclasses, the most specific one is chosen; D.Relationship retrieval: associations between instances should be in accordance with the existing OB relationships. Hence, for example, in the ontology shown in Figure 2, the instances of “Education_Program” are associated with those of “Academic_Research_Institution” through the property “provided_By_Program”. Among the new set of instances, such new associations are also identified, and later transformed into RDF triples (step v); iv. Validation: this module is responsible for allowing the user to validate all the information extracted by the system during navigation. Even that one that may be considered as invalid is also saved, in order to be used later in a pre-validation process. This information can be confronted with the one that is retrieved later, during a posterior navigation; v. RDF transformation: this module converts valid information into RDF triples, which will be included in the new ontology, the OB´. Actually, this corresponds to an empty crop of the OB, which is updated with the new instances extracted during navigation. OB´ triples can be submitted to an ontology alignment process with the OB ontology, and its instances will then be used to populate a semantic portal having the OB as its domain ontology. This alignment step is not in the scope of this paper. 5. NECOW: a Prototype Tool This section describes the prototype tool, named NECOW (Navigation and Extraction of COntents on the Web), developed with the objective to evaluate and test the strategy proposed in this work. It is a Web friendly tool developed in Java 1.6, and supported by some libraries (Jena12, Jericho parser HTML13, etc.). Navigation in NECOW starts from a portal Web link defined by the user, with the support of the base ontology (OB), which is loaded in memory and will help during all the navigation process for the search of classes and instances. It is worth observing that the strategy presented in section 4 is a generic proposal, and may be applied to other domains, for which there is a domain ontology. However, in order to show how this strategy is performed using NECOW, we will use an example in the educational domain, which is supported by the OBEDU ontology [Lachtim et al.2009a], which provides English and Portuguese vocabulary. A fragment of this ontology is presented in Figure 2. We also start our use case example with the IME institution, described through its portal, as shown in Figure 6. When navigation starts through this portal, the html page source code of each page visited along the process is analyzed to verify if the label corresponding to the tags title, link, item list and HTML tables (