Artequakt: Generating Tailored Biographies with Automatically Annotated Fragments from the Web Sanghee Kim and Harith Alani and Wendy Hall and Paul H. Lewis and David E. Millard Nigel R. Shadbolt and Mark J. Weal Abstract. The Artequakt project is working towards automatically there has been a focus on dynamic presentation decisions as opposed generating narrative biographies of artists from knowledge that has to narrative ones [14]. Where dynamic narrative is present it has been been extracted from the Web and maintained in a knowledge base. An based around robust story-schema such as the format of a news pro- overview of the system architecture is presented here and the three gram (a sequence of atomic bulletins) [12]. key components of that architecture are explained in detail, namely It is our belief that by building a story-schema layer on top of knowledge extraction, information management and biography con- an ontology we can create dynamic stories within a certain domain. struction. Conclusions are drawn from the initial experiences of the By populating the ontolgy through automatic annotation software we project and future plans are described. could allow those stories to be constructed from the vast wealth of information that exists on the World Wide Web. 1 INTRODUCTION 1.1 The Artequakt Project The growth of the World Wide Web (Web) and the corpus of doc- uments that it covers increased the demand for content to be anno- The Artequakt project aims to implement such a system around the tated. Such annotation facilitates systematic search and discovery of domain of artists and their paintings, automatically producing tai- knowledge and intelligent information processing. Annotating exist- lored biographies of artists from fragments of information extracted ing Web documents forms one of the basic barriers towards realising from the Web. This is not an attempt to out-perform hand-crafted the Semantic Web ([11], [25]). biographies, but rather to gather information from a wide variety Annotations can be roughly classified into two types. The first is of sources and target it specifically at the interests of a particular concerned with identifying textual entities in documents that match reader. The first stage of this project consists of developing an ontol- information already existing in a knowledge base, e.g. the word ogy for the domain of artists and paintings. A selection of informa- ‘Rembrandt’ in the document is matched to a painter’s name in the tion extraction tools and techniques are being developed and applied knowledge base. Such annotations are normally restricted to the type that attempt to automatically generate annotated content from online and amount of information held in the knowledge base. The other documents based on the project’s ontology and WordNet lexicons. type of annotation is involved in locating new factual information The annotations are stored in a knowledge base and will be analysed in documents based on a given domain classification structure, e.g. for duplications. In the second stage, narrative construction tools are ‘Rembrandt’ in the document is the ‘name’ of a ‘Painter’, where being developed to query the knowledge base through an ontology- Painter is a class in the ontology with the relation name. This new server to search and retrieve relevant facts or textual paragraphs and fact can be asserted in the knowledge base. This second type is the generate a specific biography. The automatic generation of tailored main approach taken to annotation in the Artequakt project. biographies is concerned with two areas of focus. Firstly, providing Previous work on annotation has demonstrated the value of cou- biographies for artists where there is sparse information available, pling Natural Language Processing (NLP) with ontologies ([13], distributed across the web. This may mean constructing text from [23]). The ontology can guide the annotation task by restricting it basic factual information gleaned, or combining text from a number to a specific domain and, unlike “rigid templates”, can provide it of sources with differing interests in the artist. Secondly, the project with knowledge inference and conceptual browsing facilities [13]. aims to provide biographies that are tailored to the particular interests An ontology-based approach for annotation needs to deal with the and requirements of a given reader. These might range from rough issues of duplicate information across documents, managing ontol- stereotyping such as “A biography suitable for a child” to specific ogy change, and redundant annotations [22]. reader interests such as “I’m interested in the artist’s use of colour in Annotation can exist in different forms and be used in a variety of their oil paintings”. ways. One interesting possibility is to use it to restructure the origi- The expertise and experience of three separate projects are drawn nal source material in new ways, producing a dynamic presentation together under the umbrella of the Artequakt project. These are: tailored to the users needs. The Artiste project - A European project working on a distributed Previous work on the production of dynamic presentations has database of art images in collaboration with partners that include highlighted the difficulties of maintaining a rhetorical structure the Louvre, the Uffizzi Gallery, the National Gallery and the Vic- across a dynamically assembled sequence [20], as a consequence toria and Albert Museum. The Equator IRC - An EPSRC funded Interdisciplinary Research  Intelligence, Agents, Multimedia Group, University of Southampton, SO17 1BJ, UK Centre that, amongst many other activities, is investigating the use of narrative techniques in information structuring and presenta- 3 KNOWLEDGE EXTRACTION tion. The AKT IRC - An EPSRC funded Interdisciplinary Research The aim of the knowledge extraction section is to extract and identify Centre looking at all aspects of the knowledge lifecycle. factual information from the Web-based documents and to structure it appropriately for entry into the knowledge base. Much of the infor- Although focussing on artists and their paintings, the techniques mation from the Web is in the form of natural language documents. being developed could be applied to other domains. One of the promising approaches to providing easy access to such This paper will examine the overall proposed Artequakt architec- documents is centred on information extraction that reduces them ture, looking at the three main component parts, namely, knowledge into tabular structures from which the fragments of documents can be extraction, knowledge representation and storage and narrative gen- retrieved as answers to queries. However, the effort and time needed eration. for annotating a large number of texts and the prerequisite of acquir- ing background knowledge that stipulates which types of information 2 ARCHITECTURE OVERVIEW are extractable, are major challenges toward exploiting such extrac- tion techniques for practical purposes([25]). Work such as ([4],[13]) Figure 1 illustrates the systems architecture used for the initial Arte- investigated the application of machine learning techniques in order quakt demonstrator. Three key areas can be identified. to automatically identify patterns from annotated example texts. The first concerns the knowledge extraction tools. These are to be In particular, whilst many attempts have been made to extract in- used to extract factual information items together with sentences and formation from the Web by using manually annotated texts, no robust paragraphs from web documents that might be manually selected or and reliable methodologies are yet available. Documents from the obtained automatically using appropriate search engine technology. Web use limitless vocabularies, structures and/or composition styles The fragments of information are passed to the ontology server along for defining approximately the same content, implying that it is of with metadata derived from the vocabulary of the ontology. little use to make efforts to locate recurrent syntactic patterns. For The second key area is the information management and storage. example, although content similarity between two biographic doc- The information is being stored by the ontology server and consoli- uments might be expected, expressions used for both sources may dated into a knowledge base, focused on artists and paintings. vary dramatically. The final key area, is the narrative generation. The Artequakt These observations have led us initially to use a natural language- servlet will take requests from a reader via a simple web interface. based extraction approach for a comparatively deeper content under- The reader request will usually include an artist for whom to generate standing from which various clues concerning semantic and syntac- a biography in a particular style (chronology, through the paintings tic features can be obtained. The use of an ontology coupled with a etc.) and also any user information; for example, the narrative might general-purpose lexical database (WordNet [17]) as a guidance tool be generated specifically for a child or an art historian. The server for creating interesting relations is another dimension of our initial then uses story templates to render a narrative from the information approach aiming at minimising reliance on domain-specific extrac- stored in the knowledge base. The rest of this paper will examine tion rules. Figure 2 shows extraction results based on the exam- these three areas in more detail. ple of ‘Rembrandt’s father was a miller who died in 1630’. Two Input Web Pages biographic pieces of information about ‘Rembrandt’s father’ (i.e. The Biography is rendered as a web page ‘job title (miller)’ and ‘date of death (1630)’), were captured as well Reader selects an artist and a biography style as the fact that ‘Rembrandt’ is a person and he is the son of a dead miller. 3.1 Natural Language Information Extraction The capability of recognising a named entity without the annotation Knowledge Artequakt effort of humans or without the need to create extraction rules is one Extraction Tools Servlet of the objectives of our approach. The idea is to make use of general- purpose lexical databases and to exploit the knowledge from syntac- tical and semantic analysis to clarify the types and structures of given Contextual information. Although the proposed approach may not be as sophis- Structure Server ticated as manually annotated definitions, its contribution lies in its Ontology Server extensibility and practical nature (acceptable performance). We use a paragraph as a unit of semantic analysis instead of a sentence, since much of the critical information used for interpreting text is scattered in different sentences (as observed in [3]). Downloaded documents Biography Templates from the Web are first divided into paragraphs, which are then bro- ken down into a group of sentences. The paragraphs are analysed as Knowledge follows: Base 1. Syntactical analysis: A sentence is decomposed into a set of gram- Figure 1. The Artequakt Architecture matically related phrases (e.g. a verb-phrase, or a noun-phrase). We have used the Apple Pie Parser, which is a bottom-up proba- bilistic chart parser and is freely available [21]. 2. Semantic analysis:  Identification of main components: each compound sentence is decomposed into simplified structures, each of which contains one clause, i.e. a simple sentence. Each clause is clustered as one of three parts: subject, verb, and object. Temporal proper- ties are inferred from a verb tense (e.g. ‘past’, ‘present’), and associated with the sentence. A writing style (e.g. ‘first-person’, ‘third-person’) can be derived from the personal pronoun if it exists in the sentence’s subject.  Recognition of named entity: two resources are used for deter- mining whether or not a given word denotes a person’s name. The first is syntactical tags, which are obtained as the result of the syntactical analysis carried out by the Apple Pie Parser. The second is gazetteers of people names, which are available as part of the GATE (General Architecture for Text Engineer- ing, [6]) package. GATE provides text files which contain per- son names associated with gender attributes. A name which is not defined in GATE’s text files will still be extractable if it is tagged as a proper noun. Heuristics and grammar rules are Figure 2. An example of knowledge extraction using ontology and applied in order to extract only proper personal nouns. WordNet  Resolution of pronoun references (anaphoric references): a per- sonal pronoun refers to a specific person, and acts as a subject In Figure 2, relation extraction for both clauses is determined by (‘he’ or ‘she’), an object (‘him’ or ‘her’), or a marker of pos- the categorisation results of verbs (i.e. ‘be’ and ‘die’). The ‘be’ verb session defining who owns a particular thing (‘his’ or ‘hers’). poses a rather difficult case, since its semantic meaning is heav- Currently we are using a simple resolution function that runs ily dependent on other phrases, i.e. subject and object. According at reasonably fast speed obtaining the best-guessed referent. to WordNet definitions, one of its senses states ‘work in a specific Three attributes (gender, number, and structural information) place, with a specific subject or in a specific function’. Since its syn- are considered in determining the right referent. onyms (i.e. ‘work’ and ‘follow’) are matched with ‘work’, we ex- ploit this relation to further examine whether or not it is related to  Adding a missing subject: a clause can inherit a subject from ‘job-information’. a main clause, since it is syntactically dependent on the main In the second clause, since ‘die’ can be converted to the noun clause. format ‘death’, the verb ‘die’ matches with two potential relations In Figure 2, the given example ‘Rembrandt’s father was a miller (‘date of death’ and ‘place of death’). In this case ‘date of death’ who died in 1630’ is divided into two clauses. The same subject (i.e. was chosen since the ‘1630’ was extracted from the same sentence ‘Rembrandt’s father’) is assigned to both clauses since the second and instantiated as date information. clause is dependent on the first one. At this stage, ‘Rembrandt’ was The output from this section is an XML-formated representation successfully recognised as a person’s name. Gazetteers provided by of the facts, paragraphs, sentences and keywords identified in the GATE do not contain the name ‘Rembrandt’, whereas syntactic tags knowledge extraction process. The XML files are sent to the ontol- for this sentence mark it as a proper noun. ogy server to populate the knowledge base. 3.2 Relation Extraction 4 KNOWLEDGE REPRESENTATION AND To create a binary relationship between two extracted individual STORAGE facts, knowledge about the pre-defined semantic relations will be required. Consulting the ontology, which specifies various relation- 4.1 Artequakt Ontology ships among classes, will act as a basis for decisions concerning An ontology is a conceptualisation of a domain into a machine read- which relations to use. A query is submitted to the ontology server to able format [7]. For Artequakt the requirement is to build an ontology obtain such knowledge. to represent the domain of artists and artefacts. This ontology is be- In order to reduce the problem of linguistic variation between re- ing implemented in Protégé, which is a graphical ontology editing lations defined in the ontology and the extracted facts, we will use tool [18]. The main part of this ontology is being constructed from three lexical chains (synonyms, hypernyms, and hyponyms) as de- selected sections in the CIDOC Conceptual Reference Model (CRM fined in WordNet. For example, the concept of ‘depict’ is matched - [5]) ontology. CRM was developed by ICOM/CIDOC2 Documen- with ‘portray’ (synonym) and ‘represent’ (hypernym). In order to re- tation Standards Group to represent an ontology for cultural heritage duce over- and under-generalisation, we will consider only one-level information. It was built to facilitate the transformation of existing of hypernyms and hyponyms when a given word is a verb. disparate museum and cultural heritage information sources into one The types of information are identified by tracing the hierarchies coherent source. of hypernyms. For example, as shown in Figure 2, ‘miller’ is ex- The CRM ontology is designed to represent artefacts, their produc- tracted as the job of Rembrandt’s father since the hypernyms map to tion, ownership, location, etc. This ontology was modified for Arte- ‘worker’. Factual data, such as a date or a city name, are extracted quakt and is being enriched with additional classes and relationships by using a date parsing program coupled with a simple grammar and to represent a variety of information related to artists, their personal the hypernyms defined in WordNet. In cases, where there are multi-  ple matches, all relations are represented in outputs. http://www.cidoc.icom.org/ information, family relations, relations with other artists, details of each, for example two artist instances with the name Rembrandt, but their work, etc. The Artequakt ontology also allows the storage of one instance has a location relationship to Holland, while the other textual paragraphs or sentences along with their source URLs so that has a date of birth relationship to 1609. One heuristic to apply here is at a later point they can be reorganised using the ontology as a guide. to merge such shallow instances into one instance of Rembrandt with both location and date of birth relations, keeping the original source URLs of each fact. 4.2 Automatic Ontology Population Another heuristic is if two instances of same-name artists have There is an increasing interest in building ontologies to provide a va- equal values for their date and place of birth and death relationships, riety of knowledge services. Populating ontologies with knowledge is then these instances are likely to be duplicates, in which case they can labour intensive and time consuming. Semi-automatic approaches to be fused together as one instance, otherwise the two instances will ontology population have been followed by for example [23] where stay separate. Such a heuristic helps to distinguish between same- relationships can be added automatically between instances if these name artists. The amount and type of information overlap between instances already exist in the knowledge base, otherwise user inter- instances can be used to calculate a confidence value to indicate vention will be needed. OntoAnnotate [22] and OntoMat [8] are sup- whether certain instances can be merged or left separate. porting tools of user-driven ontology-based annotations, where the Another challenge in information consolidation is to identify exact produced annotations can be fed back to the ontology. matches. Identical information can exist in different versions. For In this project we are investigating the possibility of moving example consider the sentences: towards a fully automatic approach of feeding the ontology with knowledge extracted from the web. As mentioned in section 3.2, this  Rembrandt was born in the 17th century in Leiden. information is extracted with respect to the Artequakt ontology, and  Rembrandt was born in 1606 in the Netherlands. provided as XML files, one per document, using tags mapped directly  Rembrandt was born on July 15 1606 in Holland. from names of classes and relationships in the ontology. When a new XML file is produced (Figure 3(a)), it will be sent to the Artequakt The sentences above provide similar information about an artist, ontology server which launches a program to parse the received file written in different formats and specificity levels. To match the above and populate the ontology with the newly provided knowledge (Fig- sentences it will be necessary to enrich the current ontology with ure 3(b)). proper temporal and geographical representations. Some format va- The ontology server is based on Java sockets and connected to rieties can be dealt with at the extraction level. For example the in- the Artequakt knowledge base through the Protégé API. A limited formation extraction tools being used in this project can identify and inference engine is being built on this server to allow querying and extract dates in different formats, and provide it as day, month, year, the retrieval of specific information from the ontology, for example decade, etc. This information could be fed to the temporal ontology to get all paragraphs that mention the date of birth of a specific artist, and reasoned over to match between different time frames. get the artist of a painting, get all available facts about an artists, etc. There has been much work on developing databases and gazetteers of place names, such as the Thesaurus of Geographic Names (TGN, [9]), Alexandria Digital Library (ADL, [10]), and WordNet which 4.3 Consolidating the Knowledge-Base also provides some geo-information. Such sources can be integrated When analysing web documents about selected artists, it will be in- with the current ontology to provide knowledge on geographical hi- evitable that we extract duplicated information or even contradictory erarchies, place name variations, and other spatial information [1]. information. Handling such information is challenging for automatic ontology population approaches. Staab et al[22] stressed the prob- 5 NARRATIVE GENERATION lem of creating duplicate objects when extracting from different doc- uments. They relied on manually assigned object-identifiers to avoid While machines benefit from using structured ontologies to exchange duplication. Our approach is attempting to identify and eliminate du- information, human beings need a more intuitive interface. One of plications automatically using a two-stage consolidation process. the most natural ways to do this is by story telling. There is a wealth The first stage is for the Artequakt ontology server to add all ex- of critical and philosophical thought concerning narrative that can be tracted information to the knowledge base regardless of what is al- drawn on to assist in constructing a story (in this case a biography) ready stored. This results in the creation of multiple instances of from the raw information gathered. Figure 4 shows one way of view- artists with possibly the same information (e.g. multiple instances ing the layers that make up a narrative as proposed by Bal [2]. The of Rembrandt). The challenge is to identify which of these instances raw facts and chronological collection of events in any particular tale refer to the same artist, and which ones refer to genuinely different is called the Fabula. For any given Fabula we could present the facts artists who happen to have the same name or information. from different perspectives and in different sequences to produce a The second stage is to run a consolidation process to identify pos- Story. We could then render any given Story into several different sible duplicate instances in the knowledge base, searching for clues forms or Narratives (e.g. a film or novel). in the rest of information available about these instances. This is why In Artequakt the knowledge base can be thought of as our under- it is best to feed the new information to the knowledge base first lying fabula. To produce the eventual narrative (in our case pages (stage 1), which provide the consolidation process with more infor- of html) we need to first arrange sub-elements of the fabula into a mation to compare with. sensible sequence and produce a story. The consolidation process involves applying a set of heuristics. Information extraction tools are sometimes only able to extract frag- 5.1 Biography Templates ments of information about an artist, especially if the source docu- ment or paragraph is small or difficult to analyse. This results in the The story structures we are using are human authored biography tem- creation of new instances with only one or two facts associated with plates that contain queries into the knowledge base. http://search.ebi.eb.com/ebi/article/ 0,6101,36822,00.html Rembrandt Harmenszoon van Rijn was born on July 15, 1606, in Leiden, the Netherlands… Rembrandt left the University of Leiden to study painting. … He was influenced by the work of Caravaggio and was fascinated by the work of many other Italian artists. ….. Rembrandt Harmenszoon van Rijn was born on July 15 1606 in Leiden Rembrandt Harmenszoon van Rijn 15 july 1606 Leiden Netherlands ….. He was influenced by the work of Caravaggio rembrandt Caravaggio …… (a) (b) Figure 3. a) XML file of extracted information is sent to the ontology server, b) The server creates the relevant instances and relationships in the ontology. the consolidated ontology for specific facts and construct sentences Implementation dynamically from the results. This can be useful for facts that have     HTML Pages been inferred (and therefore there is no corresponding paragraph), or Narrative Narrative Narrative Narrative when there is no paragraph that fits the literary form of the rest of the  Story  Story Contextual Templates biography (e.g. the biography is in third person, but all the available paragraphs are in first person). Fabula Ontology + Knowledge Base The templates also contain contextual information on which parts of the biography structure are appropriate in different contexts (spec- Figure 4. The Levels of Narrative ified as a list of tag value pairs inside a context object). For exam- ple imagine that the user has specified that they do not have a good knowledge of artists. The template structure can specify that parts of the structure are only available to people with a good knowledge. Previous work has stored queries into an ontological space as the Thus, when the user queries Linky for the template, the inappropriate destination of navigational links [24], by following the links the user parts that require this are pruned away. causes the queries to be executed (and views the results). With Arte- Figure 5 shows an example structure being pruned. In this case quakt these basic links have evolved into more complex structures a query into the ontology concerning artistic influences (here it that arrange the queries into a sequence (a biography template). would resolve into a sentence about Caravaggio) is removed because The templates are written in XML using the Fundamental Open it would not make sense to a user who did not have a reasonable Hypermedia Model (FOHM) [16], which is capable of represent- knowledge of artists. The resulting paragraph reads: ing a variety of hypermedia structures including tours and links. The XML files are then loaded into the Auld Linky contextual structure ‘Rembradt Harmenszoon van Rijn was born on July 15 1606 in server [15], which provides pattern matching facilities over the struc- Leiden. Rembradt’s father was a miller who died in 1630. His early tures via HTTP. work was devoted to showing the lines, light and shade, and color of Any given biography template may be constructed from several the people he saw about him.’ sub-structures. The basic structure used is the Sequence. This repre- sents a list of queries that have to be instantiated from the knowledge In this way the biography structures will be tailored to the needs of base and inserted into the biography in order. These queries are au- each individual user. For our prototype we are concentrating on broad thored using the vocabulary of terms defined within the ontology. user classification (child/adult, etc) but it would also be possible to Other structures allow more complex effects. A Concept structure incorporate more sophisticated user modelling techniques (such as contains several queries, any of which may be used at this point in training sets [19]). the biography. A Level of Detail (LOD) structure is similar to a con- Once it has been retrieved from Linky the template has to be in- cept, but there is an ordering between the queries that corresponds stantiated, by making each query in turn and then rendering the re- to preference (i.e. preferably the highest numbered query should be sults into a html page for display. used, if that’s not possible the next highest, and so on). These struc- tures may be nested (e.g. a sequence of concepts). Some queries may retrieve paragraphs directly while others query [5] N. Crofts, D.M. Dionissiadou, and M. Stiff, ‘Definition of the cidoc Sequence object-oriented conceptual reference model’, Technical report, Interna- 1 2 3 tional Organization for Standardization, (2000). [6] H. Cunningham, K. Bontcheva, V. Tablan, C. Ursu, and M. Dimitrov, ‘Developing language processing components with gate (user’s guide)’, Technical report, University of Sheffield, U.K., (2002). available in http://www.gate.ac.uk/. [7] N. Guarino and P. Giaretta, Ontologies and Knowledge bases: towards  a terminological clarification. Towards Very Large Knowledge Bases: Concept Rembrandt Knowledge Building and Knowledge Sharing., IOS Press, 1995. Harmenszoon van Rembrandt's Rijn was born on father was a miller [8] S. Handschuh, S. Staab, and A. Maedche, ‘Cream - creating rela- July 15 1606 in who died in 1630 Leiden tional metadata with a component-based, ontology-driven annotation framework’, in In Proceedings of the First International Conference on Knowledge Capture, pp. 76–83, Canada, (2001). Context object - describing in which His early work context this part of the structure was devoted to [9] P. Harpring, ‘Proper words in proper places: The thesaurus of geo- can be seen showing the lines, He was influenced graphic names.’, MDA Information, (3), 5–12, (1997). light and shade, by the work of Data object - contains the query to and color of the Caravaggio [10] L.L. Hill, J. Frew, and Q. Zheng, ‘Geographic names. the implementa- people he saw the knowledge base (the results of around him tion of a gazetteer in a georeferenced digital library.’, Digital Library, the queries are shown here) (1), (1999). [11] J. Kahan and M.-R. Koivunen, ‘Annotea: An open rdf infrastructure for Figure 5. Template pruning: The black context (representing knowledge of shared web annotations’, in In Proceedings of The Tenth International artists) has failed, resulting in the shaded structure being pruned. World Wide Web Conference, WWW10, pp. 623–632, (2001). [12] K. Lee, D. Luparello, and J. Roudaire, ‘Automatic Construction of Per- sonalised TV News Programs’, in In Proceedings of the Seventh ACM 6 CONCLUSION & FUTURE WORK Conference on Multimedia, Orlando, Florida, pp. 323–332, (1999). [13] A. Maedche, G. Neumann, and S. Staab, Bootstrapping an Ontology- In this paper we have described the basic architecture and initial work Based Information Extraction System., Intelligent Exploration of the in the Artequakt project. Our aim is to be able to generate automat- Web, Springer / Physica Verlag, 2002. ically tailored biographies from a knowledge base which has been [14] C. Mancini, ‘From Cinematographic to Hypertext Narrative’, in In Pro- ceedings of the Eleventh ACM Conference on Hypertext and Hyperme- automatically populated by annotating text fragments extracted from dia, San Antonio, Texas, USA, pp. 236–237, (2000). Web documents. [15] D.T. Michaelides, D.E. Millard, M.J. Weal, and D. DeRoure, ‘Auld We are currently working on completing the initial prototype sys- leaky: A contextual open hypermedia link server’, in Hyperme- tem by integrating the three main components identified in the ar- dia:Openness, Structural Awareness, and Adaptivity (Proceedings of chitecture. We will then be able to assess the effectiveness over real OHS-7, SC-3 and AH-3), Published in Lecture Notes in Computer Sci- ence, (LNCS 2266), Springer Verlag, Heidelberg (ISSN 0302-9743), pp. data sources and begin the process of refining the constituant parts to 59–70, (2001). improve the overall quality of the biographies served by the system. [16] D.E. Millard, L. Moreau, H.C. Davis, and S. Reich, ‘FOHM: A Fun- Although some of the research issues in this process are partic- damental Open Hypertext Model for Investigating Interoperability Be- ularly challenging, the final objective is to have an architecture in tween Hypertext Domains’, in HT00, pp. 93–102, (2000). [17] G.A. Miller, R. Beckwith, C. Fellbaum, D. Gross, and K. Miller, ‘In- place which will allow us to explore some of the research issues that troduction to wordnet: An on-line lexical database’, Technical report, have arisen so far in more detail; for example, more comprehensive, University of Princeton, U.S.A., (1993). automatic consolidation of knowledge bases, better techniques for [18] M. A. Musen, R. W. Fergerson, W. E. Grosso, N. F. Noy, M. GrubeźY, knowledge extraction and more sophisticated narrative structuring of and J. H. Gennari, ‘Component-based support for building knowledge- the knowledge fragments. To this end, progress has been made in the acquisition systems’, in In Proceedings of the Conference on Intelligent Information Processing of the International Federation for Processing identification of an approach and the building of a prototype demon- World Computer Congress, Beijing, (2000). strator for the project. [19] M. Pazzani and D. Billsus, ‘Learning and revising user profiles:the identification of interesting web sites’, Machine Learning, 313–331, (1997). ACKNOWLEDGEMENTS [20] L. Rutledge, B. Bailey, J. V. Ossenbruggen, L. Hardman, and J. Geurts, ‘Generating Presentation Constraints from Rhetorical Structure’, in In The work presented here is part of a larger project and we would Proceedings of the Eleventh ACM Conference on Hypertext and Hyper- particularly like to note the contributions of Hugh Glaser, Srinan- media, San Antonio, Texas, USA, pp. 19–28, (2000). dan Dasmahapatra and David De Roure. This research is funded in [21] S. Sekine and R. Grishman, ‘A corpus-based probabilistic grammar with only two non-terminals’, in In Proceedings of the Fourth Inter- part by EU Framework 5 IST project “Artiste” IST-1999-11978, EP- national Workshop on Parsing Technology, pp. 216–223, (1995). SRC IRC project “Equator” GR/N15986/01 and EPSRC IRC project [22] S. Staab, A. Maedche, and S. Handschuh, ‘An annotation framework for “AKT” GR/N15764/01. the semantic web’, in In Proceedings of the First International Work- shop on MultiMedia Annotation, Japan, (2001). [23] M. Vargas-Vera, E. Motta, and J. Domingue, ‘Knowledge extraction REFERENCES by using an ontology-based annotation tool’, in In Proceedings of the Workshop on Knowledge Markup and Semantic Annotation, K- [1] H. Alani, Spatial and Thematic Ontology in Cultural Heritage Informa- CAP’01,Canada, (2001). tion Systems, Ph.D. dissertation, Computer Studies Department Univer- [24] M.J. Weal, G.J. Hughes, D.E. Millard, and L. Moreau, ‘Open Hyper- sity of Glamorgan, U.K., 2001. media as a Navigational Interface to Ontological Information Spaces’, [2] M. Bal, Narratology: Introduction to the Theory of Narrative, Univer- in In Proceedings of the Twelth ACM Conference on Hypertext and Hy- sity of Toronto Press, 1978. Trans. Christine van Boheemen. Torento. permedia, Arhus, Denmark, pp. 227–236, (2001). 1985. [25] R. Yangarber and R. Grishman, ‘Machine learning of extraction pat- [3] R. Cole, J. Mariani, H. Uszkoreit, A. Zaenen, and V. Zue. Survey of the terns from unannotated corpora: Position statement’, in In Proceedings state of the art in human language technology, 1995. of Workshop on Machine Learning for Information Extraction, pp. 76– [4] M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, 83, ECAI,Berlin, (2001). K. Nigam, and S. Slattery, ‘Learning to construct knowledge bases from the world wide web.’, Artificial Intelligence, (1-2), 69–113, (2000).