Web based Knowledge Extraction and Consolidation for Automatic Ontology Instantiation Harith Alani, Sanghee Kim, David E. Millard, Mark J. Weal Wendy Hall, Paul H. Lewis, Nigel Shadbolt I.A.M. Group, ECS Dept. University of Southampton Southampton, UK {ha, sk, dem, mjw, wh, phl, nrs}@ecs.soton.ac.uk ABSTRACT provide a variety of knowledge services. Automatic instan- tiation of ontologies and building knowledge bases (KB) The Web is probably the largest and richest information with knowledge extracted from the web corpus is therefore repository available today. Search engines are the common very beneficial. Artequakt is concerned with automating access routes to this valuable source. However, the role of ontology instantiation with knowledge triples (subject - these search engines is often limited to the retrieval of lists relation - object) about the life and work of artists, and pro- of potentially relevant documents. The burden of analysing viding this knowledge for biography generation services. the returned documents and identifying the knowledge of interest is therefore left to the user. The Artequakt system When analysing and extracting information from multi aims to deploy natural language tools to automatically ex- sourced documents, it is inevitable that duplicated and con- tract and consolidate knowledge from web documents and tradictory information will be extracted. Handling such in- instantiate a given ontology, which dictates the type and formation is challenging for automatic extraction and ontol- form of knowledge to extract. Artequakt focuses on the ogy instantiation approaches [18]. Artequakt applies a set of domain of artists, and uses the harvested knowledge to gen- heuristics and reasoning methods in an attempt to distin- erate tailored biographies. This paper describes the latest guish conflicting information, to verify it, and to identify developments of the system and discusses the problem of and merge duplicate assertions in the KB automatically. knowledge consolidation. This paper describes the main components of the Artequakt system, focusing on the latest development with respect to Categories and Subject Descriptors knowledge consolidation and ontology instantiation. I.2.6 Learning – Knowledge acquisition I.2.7 Natural Language Processing – Text analysis, Lan- RELATED WORK guage parsing and understanding Extracting information from web pages to generate various reports is becoming the focus of much research. The closest Keywords work we found to Artequakt is the area of text summarisa- Information Extraction, Ontology Instantiation, and Knowl- tion. A number of summarisation techniques have been de- edge Consolidation. scribed to help bring together important pieces of informa- tion from documents and present them to the user in a com- INTRODUCTION pact form. Web pages are the source of vast amounts of knowledge. Even though most summarisation systems deal with single This knowledge is often buried by layers of text and scat- documents, some have targeted multiple resources [12][23]. tered over numerous sites. Associating web pages with an- Statistical based summarisations tend to be domain inde- notations to identify their knowledge content is the ambition pendent, but lack the sophistication required for merging of the Semantic Web [3]. Much research is now focused on information from multiple documents [17]. On the other developing ontologies to manipulate this knowledge and hand, Information Extraction (IE) based summarisations are more capable of extracting and merging information from Permission to make digital or hard copies of all or part of this work for various resources, but due to the use of IE, they are often personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that domain dependent. copies bear this notice and the full citation on the first page. To copy Radev developed the SUMMONS system [17] to extract otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. information and generate summaries of individual events K-CAP’03, October 23-25, 2003, Sanibel Island, FL, USA. from MUC (Message Understanding Conferences) text cor- Copyright 2003 ACM 1-58113-000-0/00/0000…$5.00 puses. The system compares information extracted from multiple resources, merges similar content and highlights jects; Sculpteur1, Equator2, and AKT3. The main compo- contradictions. However, like most IE based systems; in- nents of Artequakt are described in the following sections. formation merging is often based on linguistics and timeline comparison of single events [17][23] or multiple events System Overview [18]. Figure 1 illustrates Artequakt’s architecture which com- Artequakt’s knowledge consolidation is based on the com- prises of three key areas. The first concerns the knowl- parison of individual knowledge fragments, rather than lin- edge extraction tools used to extract factual information guistic analyses or timeline comparison. Furthermore, Arte- from documents and pass it to the ontology server. The quakt’s consolidation is more fine-grained, focusing on the second key area is information management and storage. The information is stored by the ontology server and comparison and merging of individual entities (e.g. places, consolidated into a KB which can be queried via an in- people, dates). ference engine. The final area is the narrative genera- Most traditional IE systems are domain dependent due to tion. The Artequakt server takes requests from a reader the use of linguistic rules designed to extract information of via a simple Web interface. The request will include an specific content (e.g. bombing events (MUC systems), artist and the style of biography to be generated (chro- earthquake news [23], sports matches [18]). Adaptive IE nology, summary, fact sheet, etc.). The server uses story systems [4] can ease this problem by identifying new ex- templates to render a narrative from the information stored in the KB using a combination of original text traction rules induced from example annotations supplied fragments and natural language generation. by users. However, training such tools can be difficult and time consuming. Promising results are offered by more ad- vanced adaptive IE tools, such as Armadillo [6], which dis- covers new linguistic and structural patterns automatically, thus requiring limited bootstrapping. Using ontologies to back up IE is hoped to support informa- tion integration [2][18] and increase domain portability [10][11]. Poibeau [16] investigated increasing domain in- dependency by using clustering methods on text corpuses to aid users construct primitive ontologies to represent the main corpus topics. Templates could then be generated from the ontology and guide the IE process. Ontologies produced by this approach are limited to the content of the corpus, rather than representing a specific domain. In some cases (such as in Artequakt) the corpus is very large and diverse (e.g. the Web). Creating ontologies from such cor- pus is infeasible. Furthermore, these ontologies are likely to be rough, shallow, and include undesired concepts that hap- pen to be in the text corpus. Consequently, the cost of bringing such ontologies to shape might exceed the benefit. Instantiating ontologies with assertions from textual docu- ments can be a very laborious task. A number of tools have been developed that instantiate ontologies semi automati- cally with user driven annotations [20]. IE learning tools, such as Amilcare [4], can be used to automate part of the Figure 1. The Artequakt Architecture annotation process and speed up ontology instantiation [7][21]. The architecture is designed to allow different ap- ARTEQUAKT proaches to information extraction to be incorporated The Artequakt project has implemented a system that with the ontology acting as a mediation layer between the IE and the KB. Currently we are using textual analy- searches the Web and extracts knowledge about artists, sis tools to scrape web pages for knowledge, but with based on an ontology describing that domain, and stores the increasing proliferation of the semantic web, addi- this knowledge in a KB to be used for automatically pro- ducing personalised biographies of artists. Artequakt draws 1 from the expertise and experience of three separate pro- http://www.sculpteurweb.org/ 2 http://www.equator.ac.uk/ 3 http://www.aktors.org/ tional tools could be added that take advantage of any This ontology was modified for Artequakt and enriched semantically augmented pages passing the embedded with additional classes and relationships to represent a knowledge through the KB. variety of information related to artists, their personal information, family relations, relations with other artists, As well as keeping open the interface between the KB details of their work, etc. The Artequakt ontology and and the extraction technology, a clear separation has KB are accessible via an ontology server. been kept between the creation of a structured document from the knowledge base and the rendering of that document. In the current system, the information is ren- KNOWLEDGE EXTRACTION dered into an HTML page but alternative-rendering en- The aim of our knowledge extraction tool is to identify gines could be envisaged. For example, rather than pre- and extract knowledge triples from text documents and senting the biography as a linear textual document, the to provide it as RDF files for entry into the KB [10]. information might be rendered into a dynamic presenta- Artequakt uses an ontology coupled with a general- tion system such as SMIL, converted into an audio purpose lexical database (WordNet) [14] and an entity- stream using text to speech tools, or perhaps used to recogniser (GATE) [5] as guidance tools for identifying generate a dynamic hypertext with links referring back knowledge fragments. to queries to the KB on items such as artists names. Artequakt attempts to identify not just entities, but also their relationships following ontology relation declarations and lexical information. Extraction Procedure query is passed to selected web search engines and the search results are analysed with respect to relevancy to the Below is an example of an extracted paragraph: that ‘Pierre-Auguste Renoir‘ is a person’s name, ‘Feb- tial relations in the ontology; ‘date_of_birth’ and ‘place_of_birth’. Since both relations are associated Figure 2. RDF representation of knowledge extracted from with ‘February 25, 1841‘ and ‘Limoges‘ respectively, the paragraph: “Pierre-Auguste Renoir was born in Limoges this sentence generates the following knowledge triples on February 5, 1841. His father was a tailor.” about Renoir: • Pierre-Auguste Renoir date_of_birth Artequakt Ontology 25/2/1841 For Artequakt the requirement was to build an ontology • Pierre-Auguste Renoir place_of_birth Limoges to represent the domain of artists and artefacts. The The second sentence generates knowledge triples related main part of this ontology was constructed from selected to Renoir’s family: sections in the CIDOC Conceptual Reference Model (CRM4) ontology. The CRM ontology is designed to Pierre-Auguste Renoir has_father Person_2 represent artefacts, their production, ownership, loca- • Person_2 job_title Tailor tion, etc. • Pierre-Auguste Renoir has_mother Person_3 • Person_3 job_title Dressmaker Inaccurately extracted knowledge may reduce the qual- 4 ity of the system’s output. For this reason, our extraction http://cidoc.ics.forth.gr/index.html rules were designed to be of low risk levels to ensure Very little text generation is used in the current imple- higher extraction precision. Advanced consistency mentation (e.g. Figure 3, 1 st and last sentences), but this checks can help identify some extraction inaccuracies; will be the focus of the next phase. e.g. a date of marriage is before the date of birth, or two By storing conflicting information rather than discarding unrelated places of birth for the same person! it during the consolidation process, the opportunity ex- The extraction process terminates by sending the ex- ists to provide biographies that set out arguments as to tracted knowledge to the ontology server. Figure 2 is the the facts (with provenance, in the form of links to the RDF representation of the extracted knowledge. Arte- original sources) by juxtaposing the conflicting informa- quakt’s IE process is out of the scope of this paper, and tion and allowing the reader to make up their own mind. is fully described in [2] and [10]. Different templates can be constructed for different types of biography. Two examples are the summary bi- BIOGRAPHY GENERATION ography, which provides paragraphs about the artist ar- Once the information has been extracted, stored and ranged in a rough chronological order, and the fact consolidated, the Artequakt system repurposes it by sheet, which simply lists a number of facts about the automatically generating biographies of the artists. Fig- artist, i.e. date of birth, place of study etc. The biogra- ure 3 shows a biography of Renoir. phies also take advantage of the structure server’s ability to filter the template based on a user’s interest. If the reader is not interested in the family life of the artist the biography can be tailored to remove this information. More about Artequakt’s biography generation is avail- able at [14]. AUTOMATIC INSTANTIATION Storing knowledge extracted from text documents in KBs offers new possibilities for further analysis and reuse. Ontology instantiation refers to the insertion of information into the KB, as described by the ontology (sometimes referred to as ontology population). Instan- tiating ontologies with a high quantity and quality of knowledge is one of the main steps towards providing valuable and consistent ontology-based knowledge ser- vices. Manual ontology instantiation is very labour in- tensive and time consuming. Some semi-automatic ap- proaches have investigated creating document annota- tions and storing the results as assertions [7][20][21]. [7] and [20] describe two frameworks for user-driven ontology-based annotations, enforced with the IE learn- ing tool; Amilcare [3]. However, the two frameworks are manually driven and mainly focus on entity annota- tions. They lack the capability of identifying relation- ships reliably. In [20], relationships were added automatically between instances, but only if these instances already existed in the KB, otherwise user intervention is required. In Artequakt we investigate the possibility of moving towards a fully automatic approach of feeding the ontol- ogy with knowledge extracted from unstructured text. Figure 3. A Biography Generated Using Sentences. Information is extracted in Artequakt with respect to a The biographies are based on templates authored in the given ontology and provided as RDF or XML files using Fundamental Open Hypermedia Model (FOHM) and tags mapped directly from names of classes and rela- stored in the Auld Linky contextual structure server tionships in that ontology. When the ontology server [13]. Each section of the template is instantiated with receives a new RDF file, a feeder tool is activated to paragraphs or sentences generated from information in parse the file and adds its knowledge triples to the KB the KB. The KB informs the templates of the theme of automatically. Once the feeding process terminates, the the sentences and paragraphs (e.g. influences, family consolidation tool searches for and merges any duplica- info, painting) and the generation tool select the relevant tion in the KB. ones and structure them in the desired form and order. KNOWLEDGE BASE CONSOLIDATION Unique Name Assumpti on Automatically instantiating an ontology from diverse One basic heuristic applied in Artequakt is that artist and distributed resources poses significant challenges. names are unique; where artist instances with identical One persistent problem is that of the consolidation of names are merged. According to this heuristic, all in- duplicate information that arises when extracting similar stances with the name Rembrandt are combined into one or overlapping information from different sources. instance. This heuristic is obviously not fool proof, but Tackling this problem is important to maintain the refer- it works well in the limited domain of artists. ential integrity and quality of results of any ontology- based knowledge service. [18] relied on manually as- signed object identifiers to avoid duplication when ex- Information Overlap tracting from different documents. There are cases where the full name of an artist is not given in the source document or its extraction fails, in Little research has looked at the problem of information which case they will not be captured by the unique-name consolidation in the IE domain. This problem becomes heuristic. For example, when we extracted information more apparent when extracting from multiple docu- about Rembrandt and merged same-name artists, two ments. Comparing and merging extracted information is instances remained for this artist; Rembrandt and Rem- often based on domain dependent heuristics [17] [18] brandt Harmenszoon van Rijn. In such a case we com- [23]. Our approach attempts to identify inconsistencies pare certain attribute values, and merge the two in- and consolidate duplications automatically using a set of stances if there is sufficient overlap. For the two Rem- heuristics and term expansion methods based on Word- brandt instances, both had the same date and place of Net [22]. birth, and therefore were combined into one instance. The duplication would have not been caught if these Duplicate Information attributes had different values. There exist two main type of duplication in our KB; du- plicate instances (e.g. multiple instance representing the Att ribute Comparison same artist), and duplicate attribute values (e.g. multiple When the above heuristics are applied, merged instances dates of birth extracted for the same artists). might end up having multiple attribute values (e.g. mul- Artequakt’s IE tool treats each recognised entity (e.g. tiple dates and places of birth), which in turn need to be Rembrandt, Paris) as a new instance. This may result in analysed and consolidated. Note that some of these at- creating instances with overlapping information (e.g. tributes might hold conflicting information that should two Person instances with the same name and date of be verified and held for future comparison and use. birth). The role of consolidation in Artequakt includes Comparing the values of instance attributes is not al- analysing and comparing attribute values of the in- ways straightforward as these values are often extracted stances of each type of concept in the KB (e.g. Person, in different formats and specificity levels (e.g. synony- Date) to identify inconsistencies and duplications. mous place names, different date styles) making them The amount of overlap between the attribute values of harder to match. Artequakt applies a set of heuristics any pair of instances could indicate their duplication and expansion methods in an attempt to match these potential. However, this overlap is not always measur- values. Consider the following sentences: able. IE tools are sometimes only able to extract frag- 1. Rembrandt was born in the 17th century in Leyden. ments of information about a given entity (e.g. an artist), 2. Rembrandt was born in 1606 in Leiden, the Nether- especially if the source document or paragraph is small lands. or difficult to analyse. This leads to the creation of new 3. Rembrandt was born on July 15 1606 in Holland. instances with only one or two facts associated with each. For example two artist instances with the name These sentences provide the same information about an Rembrandt, where one instance has a location relation- artist, written in different formats and specificity levels. ship to Holland, while the other has a date of birth of Storing this information in the KB in such different for- 1606. Comparing such shallow instances will not reveal mats is confusing for the biography generator which can their duplication potential. Furthermore, neither the benefit from knowing which information is repetitive source information nor the information extraction is al- and which is contradictory. Matching the above sen- ways accurate. For example a Rembrandt instance can tences required enriching the original ontology with be extracted with the correct family attribute values, but some temporal and geographical reasoning. with the wrong date of birth, in which case this instance will be mismatched with other Rembrandt instances in Geographical Consolidation spite of referring to the same artist. There has been much work on developing gazetteers of place names, such as the Thesaurus of Geographic Names (TGN) [8] and Alexandria Digital Library [9]. Ontologies can be integrated with such sources to pro- vide the necessary knowledge about geographical hier- archies, place name variations, and other spatial infor- consistent, but the third date holds more information mation [1]. Artequakt derives its geographical knowl- than the other two. Therefore, the third date is used for edge from WordNet [14]. WordNet contains information the instance of Rembrandt. If any of the given facts is about geopolitical place names and their hierarchies, inconsistent then it will be stored for future verification providing three useful relations for the context of Arte- and use. quakt; synonym, holonym (part of), and part_meronym At the end of the consolidation process, the knowledge (sub part). The Artequakt ontology is extended to add extracted from the three sentences above will be stored this information for each new instance of place added to in the KB as the following two triples for the instance of the KB. Rembrandt: • Rembrandt date_of_birth 15 July 1606 Place Name Synonyms • Rembrandt place_of_birth Leiden The synonym relationship is used to identify equivalent place names. For example the three sentences above mention several place names were Rembrandt was born. Inconsistent Information Using the synonym relationship in WordNet, Leyden can Some of the extracted information can be inconsistent, be identified as a variant spelling for Leiden, and that for example an artist with different dates or places of Holland and The Netherlands are synonymous. birth or death, or inconsistent temporal information, such as a date of death that falls before the date of birth. Place Specificity The source of such inconsistency can be the original The part-of and sub-part relationships in WordNet are document itself, or an inaccurate extraction. Predicting used to find any hierarchical links between the given which knowledge is more reliable is not trivial. Cur- places. WordNet shows that Leiden is part of the Neth- rently we rely on the frequency in which a piece of erlands, indicating that Leiden is the more precise in- knowledge is extracted as an indicator of its accuracy; formation about Rembrandt’s place of birth. the more a particular piece of information is extracted, the more accurate it is considered to be. For example, for Renoir, two unique dates of births emerged; 25 Feb Shared Place Names 1841 and 5 Feb 1841. The former date has been ex- It is common for places to share the same name. For exam- tracted from several web sites, while the latter was ple according to the TGN, there are 22 places worldwide found in one site only, and therefore considered to be named London. This problem is less apparent with Word- less reliable. Net due to its limited geographical coverage. A more advanced approach can be based on assigning In Artequakt, disambiguation of place names is dependent levels of trust for each extracted piece of knowledge, on their specificity variations. For example after processing which can be derived from the reliability of the source the three sentences about Rembrandt, it becomes apparent document, or the confidence level of the extraction of that particular information. The knowledge consolida- that he was born in a place named Leiden in the Nether- tion process is not aimed at finding ‘the right answers’ lands. If the last two sentences were not available, it would however. The facts extracted are stored for future use, have not been possible to tell for sure which Leiden is being with references to the original material. referred to (assuming there is more than one). One possibil- ity is to rely on other information, such as place of work, place of death, to make a disambiguation decision. How- PORTABILITY TO OTHER DOMAINS ever, this is likely to produce unreliable results. The use of an ontology to back up IE is meant to increase the system’s portability to other domains. By swapping the Temporal Consolidation current artist ontology with another domain specific one, Dates need to be analysed to identify any inconsistencies the IE tool should still be able to function and extract some and locate precise dates to use in the biographies. Sim- relevant knowledge, especially if it is concerned with do- ple temporal reasoning and heuristics can be used to main independent relations expressed in the ontology, such support this task. as personal information (name, date and place of birth, fam- ily relations, etc). However, some domain specific extrac- Artequakt’s IE tool can identify and extract dates in dif- tion rules, such as painting style, will eventually have to be ferent formats, providing them as day, month, year, dec- retuned to fit the new domain. ade, etc. This requires consolidation with respect to pre- cision and consistency. Going back to our previous ex- Similarly, the generation templates are currently manually ample, to consolidate the first date (17th century), the set for biography construction. These templates may need to process checks if the years of the other dates fall within be modified if a different type of output is required. We aim the given century. If this is true, then the process tries to to investigate developing templates that can be dynamically identify the more precise date. The date in the third sen- instructed and modified by the ontology. tence is favoured over the other two dates as they are all Consolidation is often based on domain dependent heuris- differences, e.g. “25 th/2/1841” versus “25/2/1841”. This tics. However, some of the heuristics used in Artequakt can highlights the need for an additional syntactic-checking be suitable for other domain. For example, Artequakt’s ap- process that could eliminate such noise. proach for comparing and integrating place names using external gazetteers can be used in any domain. Similarly, Table 1. Consolidation rates heuristics concerning the comparison of specific facts to Class Before consld. After consld. Rate% decide whether or not two instances of people are dupli- cates is also domain independent. Further work is planned Person 1475 152 -90 to extend the scope of information integration instance Building a cross-domain system is one of the aims of this Date 83 30 -64 project, and will be fully investigated in the next stage of instance development. Place 30 505 +94 instance EVALUATION Person 4240 1562 -63 We used the system to instantiate the KB with informa- relations tion on five artists, extracted from around 50 web pages. CONCLUSIONS Extraction Performance This paper describes a system that automatically extracts Precision and recall were calculated for a set of 10 artist knowledge, instantiates an ontology with knowledge triples, relations (about birth, death, places where they worked and reassembles the knowledge in the form of biographies. or studied, who influenced them, professions of their Problems related to this task, such as the identification and parents, etc). Results showed that precision scored consolidation of duplicated knowledge and the verification higher than recall with average values of 85 and 42 re- of inconsistent knowledge, are highlighted. Artequakt’s spectively. The experiment is more detailed in [2]. approaches to tackle these problems are described. An initial experiment, using around 50 web pages and 5 Biography Evaluation artists, showed promising results, with nearly 3 thousand Although we have not conducted any formal evaluation unique knowledge triples extracted (before consolidation). of the biographies generated by the system, we are in the However, some of this knowledge was too sparse to be of position to make a few observations. In general we any clear benefit. This indicates that more pages need to be found that the system is fairly successful in reproducing processed, and further rules need to be constructed to cover text for a given artist. We are currently looking at how additional ontology concepts and relations and expand the best to perform a qualitative evaluation of the biogra- knowledge extraction scope. phies, perhaps with a task-based user evaluation, com- The generated biographies were informative and brought paring the Artequakt system with a traditional search engine. together knowledge extracted from various sources. How- ever, reusing original text to generate biographies high- lighted several problems, including co-referencing and Consolidation Rate other textual deixis (such as 'Later', or 'Nevertheless'). This Table 1 shows the reduction rate in number of instances underlines the potential benefits of regenerating text di- and relations after consolidating the KB. Applying the rectly from the extracted facts, which is part of our near heuristics described earlier in the paper lead to the re- future plans. duction in number of instances of the Person and Date classes by 90% and 64% respectively. Before consolida- Our consolidation techniques significantly decreased the tion, 283 instances representing Rembrandt were stored. number of instances in the KB by up to 90% for certain The unique-name consolidation heuristic was the most classes and 63% for attributes related to instances of Per- effective with no identified mistakes. son. Few instances remained undetected, mainly due to lack of information required for the knowledge comparison. When place instances are fed to the KB, they are ex- panded using WordNet and stored alongside their syno- Future work on Artequakt will continue to develop its nyms, holonyms (part of), and part_meronym (sub modular architecture and refine the information extraction parts). The number of Place instances created in the KB and consolidation processes. In addition we are beginning has therefore increased significantly (94% rise). This to look at how we might leverage the full power of the un- gave the consolidation the power to identify and con- derlying ontology to aid extracting information from multi- solidate relationships to places as described in the geo- ple domains and produce different type of reports. graphical consolidation section. Some instances (mainly dates) were not consolidated due to slight syntactical ACKNOWLEDGEMENTS [12] McKeown, K.R., Barzilay, R., Evans, D., Hatzivassi- This research is funded in part by EU Framework 5 IST pro- loglou, V., Klavans, J.L., Nenkova, A., Sable, C., ject “Scultpeur” IST-2001-35372, EPSRC IRC project “Equa- Schiffman, B., Sigelman, S.: Tracking and Summariz- tor” GR/N15986/01 and EPSRC IRC project “AKT” ing News on a Daily Basis with Columbia's Newsblas- GR/N15764/01 ter. Proc. Human Language Technology Conf., San Diego, CA, USA. 2002. REFERENCES [13] Michaelides, D.T., Millard, D.E., Weal, M.J., DeR- [1] Alani, H., Jones, C., Tudhope, D.: Associative and oure, D.: Auld Leaky: A Contextual Open Hypermedia Spatial Relationships in Thesaurus-Based Retrieval. Link Server. Proc. 7th Hypermedia: Openness, Struc- Proc. 4th European Conf. on Digital Libraries, pages tural Awareness, and Adaptivity, pages 59--70, 45--58, Lisbon, Portugal, Sept. LNCS, 2000. Springer Verlag, Heidelberg, 2001. [2] Alani, H., Kim, S., Millard, D., Weal, M., Lewis, P., [14] Millard, D.E., Alani, H., Kim, S., Weal, M.J., Lewis, Hall, W., Shadbolt, N.: Automatic Extraction of P., Hall, W., DeRoure, D., Shadbolt, N.: Generating Knowledge from Web Documents. Workshop on Hu- Adaptive Hypertext Content from the Semantic Web. man Language Technology for the Semantic Web and 1st International Workshop on Hypermedia and the Web Services, 2nd Int. Semantic Web Conf. Sanibel Is- Semantic Web, HyperText'03, Nottingham, UK. 2003. land, Florida, USA, 2003. [15] Miller, G., Beckwith, R., Fellbaum, C., Gross, D., [3] Berners-Lee, T., Hendler, J., Lassila, O.: The Semantic Miller, K.: Introduction to wordnet: An on-line lexical Web. Scientific American, 2001. database. Int. J. Lexicography, 3(4):235--312, 1993. [4] Ciravegna, F.: Adaptive Information Extraction from [16] Poibeau, T.: Deriving a multi-domain information ex- Text by Rule Induction and Generalisation. Proc.17th traction system from a rough ontology. Proc. 17th Int. Int. Joint Conf. on Artificial Intelligence (IJCAI), Conf. on Artificial Intelligence, Seattle. USA, 2001. pages 1251--1256, Seattle, USA, 2001. [17] Radev, D. R., McKeown. K. R.: Generating natural [5] Cunningham, H., Maynard, D., Bontcheva, K., Tablan, language summaries from multiple on-line sources. V.: GATE: a framework and graphical development Computational Linguistics, 24(3): 469—500, 1998. environment for robust NLP tools and applications. Proc. 40th Anniversary Meeting of the Association for [18] Reidsma, D., Kuper, J., Declerck, T., Saggion, H., Computational Linguistics, Phil, USA, 2002. Cunningham, H.: Cross document annotation for mul- timedia retrieval. EACL Workshop on Language Tech- [6] Dingli, A., Ciravegna, F., Guthrie, D., Wilks, Y.: Min- nology and the Semantic Web, Budapest, 2003. ing Web Sites Using Unsupervised Adaptive Informa- tion Extraction. Proc. 10th Conf. of the European [19] Staab, S., Maedche, A., Handschuh, S.: An Annotation Chapter of the Association for Computational Linguis- Framework for the Semantic Web. Proc. 1st Int. Work- tics, Budapest, Hungary, 2003. shop on MultiMedia Annotation, Tokyo, 2001. [7] Handschuh, S., Staab, S., Ciravegna, F.: S-CREAM – [20] Vargas-Vera, M., Motta, E., Domingue, J., Bucking- Semi Automatic Creation of Metadata. Semantic Au- ham Shum, S., Lanzoni, M.: Knowledge Extraction by thoring, Annotation and Markup Workshop, 15th Euro- using an Ontology-based Annotation Tool. Proc. pean Conf. Artificial Intelligence, France, Lyon, 2002. Workshop on Knowledge Markup & Semantic Annota- tion, 1st Int. Conf. on Knowledge Capture, pp 5--12, [8] Harpring, P.: Proper Words in Proper Places: The The- Victoria, B.C., Canada, 2001. saurus of Geographic Names. MDA Info. 2(3), 1997. [21] Vargas-Vera, M., Motta, E., Domingue, J., Lanzoni, [9] Hill, L.L., Frew, J., Zheng, Q.: Geographic Names. The M., Stutt, A., Ciravegna, F.: MnM: Ontology Driven Implementation of a Gazetteer in a Georeferenced Semi-Automatic and Automatic Support for Semantic Digital Library. Digital Library Magazine, 5(1), 1999. Markup. 13th Int. Conf. on Knowledge Engineering and [10] Kim, S., Alani, H., Hall, W., Lewis, P.H., Millard, Management (EKAW), Spain, 2002. D.E., Shadbolt, N., Weal, M.J.: Artequakt: Generating [22] Voorhees, E.M.: Using WordNet for Text Retrieval. Tailored Biographies with Automatically Annotated Fellbaum (edt.) WordNet: An Electronic Lexical Data- Fragments from the Web. Workshop on Semantic Au- base, pages 285--303, MIT Press, 1998. thoring, Annotation & Knowledge Markup, 15th Europ. Conf. on Artificial Intelligence, pp 1--6, France, 2002. [23] White, M., Korelsky, T., Cardie, C., Ng, V., Pierce, D., Wagstaff, K.: Multidocument Summarization via In- [11] Maedche, A., Neumann, G., Staab, S.: Bootstrapping formation Extraction. Proc. of Human Language Tech- an Ontology-based Information Extraction System. In- nology Conf. (HLT 2001), San Diego, CA, 2000. telligent Exploration of the Web. P. Szczepaniak, et al., Heidelberg, Springer 2002.