Artequakt: Generating Tailored Biographies with
           Automatically Annotated Fragments from the Web
                Sanghee Kim and Harith Alani and Wendy Hall and Paul H. Lewis and David E. Millard
                                       Nigel R. Shadbolt and Mark J. Weal


Abstract. The Artequakt project is working towards automatically              there has been a focus on dynamic presentation decisions as opposed
generating narrative biographies of artists from knowledge that has           to narrative ones [14]. Where dynamic narrative is present it has been
been extracted from the Web and maintained in a knowledge base. An            based around robust story-schema such as the format of a news pro-
overview of the system architecture is presented here and the three           gram (a sequence of atomic bulletins) [12].
key components of that architecture are explained in detail, namely              It is our belief that by building a story-schema layer on top of
knowledge extraction, information management and biography con-               an ontology we can create dynamic stories within a certain domain.
struction. Conclusions are drawn from the initial experiences of the          By populating the ontolgy through automatic annotation software we
project and future plans are described.                                       could allow those stories to be constructed from the vast wealth of
                                                                              information that exists on the World Wide Web.
1 INTRODUCTION
                                                                              1.1 The Artequakt Project
The growth of the World Wide Web (Web) and the corpus of doc-
uments that it covers increased the demand for content to be anno-            The Artequakt project aims to implement such a system around the
tated. Such annotation facilitates systematic search and discovery of         domain of artists and their paintings, automatically producing tai-
knowledge and intelligent information processing. Annotating exist-           lored biographies of artists from fragments of information extracted
ing Web documents forms one of the basic barriers towards realising           from the Web. This is not an attempt to out-perform hand-crafted
the Semantic Web ([11], [25]).                                                biographies, but rather to gather information from a wide variety
   Annotations can be roughly classified into two types. The first is         of sources and target it specifically at the interests of a particular
concerned with identifying textual entities in documents that match           reader. The first stage of this project consists of developing an ontol-
information already existing in a knowledge base, e.g. the word               ogy for the domain of artists and paintings. A selection of informa-
‘Rembrandt’ in the document is matched to a painter’s name in the             tion extraction tools and techniques are being developed and applied
knowledge base. Such annotations are normally restricted to the type          that attempt to automatically generate annotated content from online
and amount of information held in the knowledge base. The other               documents based on the project’s ontology and WordNet lexicons.
type of annotation is involved in locating new factual information            The annotations are stored in a knowledge base and will be analysed
in documents based on a given domain classification structure, e.g.           for duplications. In the second stage, narrative construction tools are
‘Rembrandt’ in the document is the ‘name’ of a ‘Painter’, where               being developed to query the knowledge base through an ontology-
Painter is a class in the ontology with the relation name. This new           server to search and retrieve relevant facts or textual paragraphs and
fact can be asserted in the knowledge base. This second type is the           generate a specific biography. The automatic generation of tailored
main approach taken to annotation in the Artequakt project.                   biographies is concerned with two areas of focus. Firstly, providing
   Previous work on annotation has demonstrated the value of cou-             biographies for artists where there is sparse information available,
pling Natural Language Processing (NLP) with ontologies ([13],                distributed across the web. This may mean constructing text from
[23]). The ontology can guide the annotation task by restricting it           basic factual information gleaned, or combining text from a number
to a specific domain and, unlike “rigid templates”, can provide it            of sources with differing interests in the artist. Secondly, the project
with knowledge inference and conceptual browsing facilities [13].             aims to provide biographies that are tailored to the particular interests
An ontology-based approach for annotation needs to deal with the              and requirements of a given reader. These might range from rough
issues of duplicate information across documents, managing ontol-             stereotyping such as “A biography suitable for a child” to specific
ogy change, and redundant annotations [22].                                   reader interests such as “I’m interested in the artist’s use of colour in
   Annotation can exist in different forms and be used in a variety of        their oil paintings”.
ways. One interesting possibility is to use it to restructure the origi-         The expertise and experience of three separate projects are drawn
nal source material in new ways, producing a dynamic presentation             together under the umbrella of the Artequakt project. These are:
tailored to the users needs.
                                                                              The Artiste project - A European project working on a distributed
   Previous work on the production of dynamic presentations has
                                                                                database of art images in collaboration with partners that include
highlighted the difficulties of maintaining a rhetorical structure
                                                                                the Louvre, the Uffizzi Gallery, the National Gallery and the Vic-
across a dynamically assembled sequence [20], as a consequence
                                                                                toria and Albert Museum.
                                                                              The Equator IRC - An EPSRC funded Interdisciplinary Research


    Intelligence, Agents, Multimedia Group, University of Southampton, SO17
    1BJ, UK                                                                     Centre that, amongst many other activities, is investigating the use
  of narrative techniques in information structuring and presenta-                    3 KNOWLEDGE EXTRACTION
  tion.
The AKT IRC - An EPSRC funded Interdisciplinary Research                              The aim of the knowledge extraction section is to extract and identify
  Centre looking at all aspects of the knowledge lifecycle.                           factual information from the Web-based documents and to structure
                                                                                      it appropriately for entry into the knowledge base. Much of the infor-
   Although focussing on artists and their paintings, the techniques                  mation from the Web is in the form of natural language documents.
being developed could be applied to other domains.                                    One of the promising approaches to providing easy access to such
   This paper will examine the overall proposed Artequakt architec-                   documents is centred on information extraction that reduces them
ture, looking at the three main component parts, namely, knowledge                    into tabular structures from which the fragments of documents can be
extraction, knowledge representation and storage and narrative gen-                   retrieved as answers to queries. However, the effort and time needed
eration.                                                                              for annotating a large number of texts and the prerequisite of acquir-
                                                                                      ing background knowledge that stipulates which types of information
2 ARCHITECTURE OVERVIEW                                                               are extractable, are major challenges toward exploiting such extrac-
                                                                                      tion techniques for practical purposes([25]). Work such as ([4],[13])
Figure 1 illustrates the systems architecture used for the initial Arte-
                                                                                      investigated the application of machine learning techniques in order
quakt demonstrator. Three key areas can be identified.
                                                                                      to automatically identify patterns from annotated example texts.
   The first concerns the knowledge extraction tools. These are to be
                                                                                          In particular, whilst many attempts have been made to extract in-
used to extract factual information items together with sentences and
                                                                                      formation from the Web by using manually annotated texts, no robust
paragraphs from web documents that might be manually selected or
                                                                                      and reliable methodologies are yet available. Documents from the
obtained automatically using appropriate search engine technology.
                                                                                      Web use limitless vocabularies, structures and/or composition styles
The fragments of information are passed to the ontology server along
                                                                                      for defining approximately the same content, implying that it is of
with metadata derived from the vocabulary of the ontology.
                                                                                      little use to make efforts to locate recurrent syntactic patterns. For
   The second key area is the information management and storage.
                                                                                      example, although content similarity between two biographic doc-
The information is being stored by the ontology server and consoli-
                                                                                      uments might be expected, expressions used for both sources may
dated into a knowledge base, focused on artists and paintings.
                                                                                      vary dramatically.
   The final key area, is the narrative generation. The Artequakt
                                                                                          These observations have led us initially to use a natural language-
servlet will take requests from a reader via a simple web interface.
                                                                                      based extraction approach for a comparatively deeper content under-
The reader request will usually include an artist for whom to generate
                                                                                      standing from which various clues concerning semantic and syntac-
a biography in a particular style (chronology, through the paintings
                                                                                      tic features can be obtained. The use of an ontology coupled with a
etc.) and also any user information; for example, the narrative might
                                                                                      general-purpose lexical database (WordNet [17]) as a guidance tool
be generated specifically for a child or an art historian. The server
                                                                                      for creating interesting relations is another dimension of our initial
then uses story templates to render a narrative from the information
                                                                                      approach aiming at minimising reliance on domain-specific extrac-
stored in the knowledge base. The rest of this paper will examine
                                                                                      tion rules. Figure 2 shows extraction results based on the exam-
these three areas in more detail.
                                                                                      ple of ‘Rembrandt’s father was a miller who died in 1630’. Two
        Input Web Pages                                                               biographic pieces of information about ‘Rembrandt’s father’ (i.e.
                                                                The Biography is
                                                             rendered as a web page   ‘job title (miller)’ and ‘date of death (1630)’), were captured as well
                                  Reader selects an artist
                                   and a biography style
                                                                                      as the fact that ‘Rembrandt’ is a person and he is the son of a dead
                                                                                      miller.


                                                                                      3.1 Natural Language Information Extraction
                                                                                      The capability of recognising a named entity without the annotation
               Knowledge
                                                       Artequakt
                                                                                      effort of humans or without the need to create extraction rules is one
               Extraction
                 Tools
                                                        Servlet                       of the objectives of our approach. The idea is to make use of general-
                                                                                      purpose lexical databases and to exploit the knowledge from syntac-
                                                                                      tical and semantic analysis to clarify the types and structures of given
                                                                       Contextual     information. Although the proposed approach may not be as sophis-
                                                                        Structure
                                                                         Server       ticated as manually annotated definitions, its contribution lies in its
                                Ontology
                                 Server                                               extensibility and practical nature (acceptable performance). We use a
                                                                                      paragraph as a unit of semantic analysis instead of a sentence, since
                                                                                      much of the critical information used for interpreting text is scattered
                                                                                      in different sentences (as observed in [3]). Downloaded documents
                                                                        Biography
                                                                        Templates     from the Web are first divided into paragraphs, which are then bro-
                                                                                      ken down into a group of sentences. The paragraphs are analysed as
                   Knowledge
                                                                                      follows:
                     Base

                                                                                      1. Syntactical analysis: A sentence is decomposed into a set of gram-
                  Figure 1.    The Artequakt Architecture                                matically related phrases (e.g. a verb-phrase, or a noun-phrase).
                                                                                         We have used the Apple Pie Parser, which is a bottom-up proba-
                                                                                         bilistic chart parser and is freely available [21].
                                                                                      2. Semantic analysis:
   

       Identification of main components: each compound sentence is
       decomposed into simplified structures, each of which contains
       one clause, i.e. a simple sentence. Each clause is clustered as
       one of three parts: subject, verb, and object. Temporal proper-
       ties are inferred from a verb tense (e.g. ‘past’, ‘present’), and
       associated with the sentence. A writing style (e.g. ‘first-person’,
       ‘third-person’) can be derived from the personal pronoun if it
       exists in the sentence’s subject.
   

       Recognition of named entity: two resources are used for deter-
       mining whether or not a given word denotes a person’s name.
       The first is syntactical tags, which are obtained as the result
       of the syntactical analysis carried out by the Apple Pie Parser.
       The second is gazetteers of people names, which are available
       as part of the GATE (General Architecture for Text Engineer-
       ing, [6]) package. GATE provides text files which contain per-
       son names associated with gender attributes. A name which is
       not defined in GATE’s text files will still be extractable if it
       is tagged as a proper noun. Heuristics and grammar rules are                Figure 2. An example of knowledge extraction using ontology and
       applied in order to extract only proper personal nouns.                                               WordNet
   

       Resolution of pronoun references (anaphoric references): a per-
       sonal pronoun refers to a specific person, and acts as a subject         In Figure 2, relation extraction for both clauses is determined by
       (‘he’ or ‘she’), an object (‘him’ or ‘her’), or a marker of pos-      the categorisation results of verbs (i.e. ‘be’ and ‘die’). The ‘be’ verb
       session defining who owns a particular thing (‘his’ or ‘hers’).       poses a rather difficult case, since its semantic meaning is heav-
       Currently we are using a simple resolution function that runs         ily dependent on other phrases, i.e. subject and object. According
       at reasonably fast speed obtaining the best-guessed referent.         to WordNet definitions, one of its senses states ‘work in a specific
       Three attributes (gender, number, and structural information)         place, with a specific subject or in a specific function’. Since its syn-
       are considered in determining the right referent.                     onyms (i.e. ‘work’ and ‘follow’) are matched with ‘work’, we ex-
                                                                             ploit this relation to further examine whether or not it is related to
   

       Adding a missing subject: a clause can inherit a subject from
                                                                             ‘job-information’.
       a main clause, since it is syntactically dependent on the main
                                                                                In the second clause, since ‘die’ can be converted to the noun
       clause.
                                                                             format ‘death’, the verb ‘die’ matches with two potential relations
   In Figure 2, the given example ‘Rembrandt’s father was a miller           (‘date of death’ and ‘place of death’). In this case ‘date of death’
who died in 1630’ is divided into two clauses. The same subject (i.e.        was chosen since the ‘1630’ was extracted from the same sentence
‘Rembrandt’s father’) is assigned to both clauses since the second           and instantiated as date information.
clause is dependent on the first one. At this stage, ‘Rembrandt’ was            The output from this section is an XML-formated representation
successfully recognised as a person’s name. Gazetteers provided by           of the facts, paragraphs, sentences and keywords identified in the
GATE do not contain the name ‘Rembrandt’, whereas syntactic tags             knowledge extraction process. The XML files are sent to the ontol-
for this sentence mark it as a proper noun.                                  ogy server to populate the knowledge base.

3.2 Relation Extraction                                                      4 KNOWLEDGE REPRESENTATION AND
To create a binary relationship between two extracted individual               STORAGE
facts, knowledge about the pre-defined semantic relations will be
required. Consulting the ontology, which specifies various relation-
                                                                             4.1 Artequakt Ontology
ships among classes, will act as a basis for decisions concerning            An ontology is a conceptualisation of a domain into a machine read-
which relations to use. A query is submitted to the ontology server to       able format [7]. For Artequakt the requirement is to build an ontology
obtain such knowledge.                                                       to represent the domain of artists and artefacts. This ontology is be-
   In order to reduce the problem of linguistic variation between re-        ing implemented in Protégé, which is a graphical ontology editing
lations defined in the ontology and the extracted facts, we will use         tool [18]. The main part of this ontology is being constructed from
three lexical chains (synonyms, hypernyms, and hyponyms) as de-              selected sections in the CIDOC Conceptual Reference Model (CRM
fined in WordNet. For example, the concept of ‘depict’ is matched            - [5]) ontology. CRM was developed by ICOM/CIDOC2 Documen-
with ‘portray’ (synonym) and ‘represent’ (hypernym). In order to re-         tation Standards Group to represent an ontology for cultural heritage
duce over- and under-generalisation, we will consider only one-level         information. It was built to facilitate the transformation of existing
of hypernyms and hyponyms when a given word is a verb.                       disparate museum and cultural heritage information sources into one
   The types of information are identified by tracing the hierarchies        coherent source.
of hypernyms. For example, as shown in Figure 2, ‘miller’ is ex-                The CRM ontology is designed to represent artefacts, their produc-
tracted as the job of Rembrandt’s father since the hypernyms map to          tion, ownership, location, etc. This ontology was modified for Arte-
‘worker’. Factual data, such as a date or a city name, are extracted         quakt and is being enriched with additional classes and relationships
by using a date parsing program coupled with a simple grammar and            to represent a variety of information related to artists, their personal
the hypernyms defined in WordNet. In cases, where there are multi-           


ple matches, all relations are represented in outputs.                           http://www.cidoc.icom.org/
information, family relations, relations with other artists, details of        each, for example two artist instances with the name Rembrandt, but
their work, etc. The Artequakt ontology also allows the storage of             one instance has a location relationship to Holland, while the other
textual paragraphs or sentences along with their source URLs so that           has a date of birth relationship to 1609. One heuristic to apply here is
at a later point they can be reorganised using the ontology as a guide.        to merge such shallow instances into one instance of Rembrandt with
                                                                               both location and date of birth relations, keeping the original source
                                                                               URLs of each fact.
4.2 Automatic Ontology Population                                                 Another heuristic is if two instances of same-name artists have
There is an increasing interest in building ontologies to provide a va-        equal values for their date and place of birth and death relationships,
riety of knowledge services. Populating ontologies with knowledge is           then these instances are likely to be duplicates, in which case they can
labour intensive and time consuming. Semi-automatic approaches to              be fused together as one instance, otherwise the two instances will
ontology population have been followed by for example [23] where               stay separate. Such a heuristic helps to distinguish between same-
relationships can be added automatically between instances if these            name artists. The amount and type of information overlap between
instances already exist in the knowledge base, otherwise user inter-           instances can be used to calculate a confidence value to indicate
vention will be needed. OntoAnnotate [22] and OntoMat [8] are sup-             whether certain instances can be merged or left separate.
porting tools of user-driven ontology-based annotations, where the                Another challenge in information consolidation is to identify exact
produced annotations can be fed back to the ontology.                          matches. Identical information can exist in different versions. For
   In this project we are investigating the possibility of moving              example consider the sentences:
towards a fully automatic approach of feeding the ontology with
knowledge extracted from the web. As mentioned in section 3.2, this
                                                                               

                                                                                   Rembrandt was born in the 17th century in Leiden.
information is extracted with respect to the Artequakt ontology, and
                                                                               

                                                                                   Rembrandt was born in 1606 in the Netherlands.
provided as XML files, one per document, using tags mapped directly
                                                                               

                                                                                   Rembrandt was born on July 15 1606 in Holland.
from names of classes and relationships in the ontology. When a new
XML file is produced (Figure 3(a)), it will be sent to the Artequakt              The sentences above provide similar information about an artist,
ontology server which launches a program to parse the received file            written in different formats and specificity levels. To match the above
and populate the ontology with the newly provided knowledge (Fig-              sentences it will be necessary to enrich the current ontology with
ure 3(b)).                                                                     proper temporal and geographical representations. Some format va-
   The ontology server is based on Java sockets and connected to               rieties can be dealt with at the extraction level. For example the in-
the Artequakt knowledge base through the Protégé API. A limited              formation extraction tools being used in this project can identify and
inference engine is being built on this server to allow querying and           extract dates in different formats, and provide it as day, month, year,
the retrieval of specific information from the ontology, for example           decade, etc. This information could be fed to the temporal ontology
to get all paragraphs that mention the date of birth of a specific artist,     and reasoned over to match between different time frames.
get the artist of a painting, get all available facts about an artists, etc.      There has been much work on developing databases and gazetteers
                                                                               of place names, such as the Thesaurus of Geographic Names (TGN,
                                                                               [9]), Alexandria Digital Library (ADL, [10]), and WordNet which
4.3 Consolidating the Knowledge-Base                                           also provides some geo-information. Such sources can be integrated
When analysing web documents about selected artists, it will be in-            with the current ontology to provide knowledge on geographical hi-
evitable that we extract duplicated information or even contradictory          erarchies, place name variations, and other spatial information [1].
information. Handling such information is challenging for automatic
ontology population approaches. Staab et al[22] stressed the prob-             5 NARRATIVE GENERATION
lem of creating duplicate objects when extracting from different doc-
uments. They relied on manually assigned object-identifiers to avoid           While machines benefit from using structured ontologies to exchange
duplication. Our approach is attempting to identify and eliminate du-          information, human beings need a more intuitive interface. One of
plications automatically using a two-stage consolidation process.              the most natural ways to do this is by story telling. There is a wealth
   The first stage is for the Artequakt ontology server to add all ex-         of critical and philosophical thought concerning narrative that can be
tracted information to the knowledge base regardless of what is al-            drawn on to assist in constructing a story (in this case a biography)
ready stored. This results in the creation of multiple instances of            from the raw information gathered. Figure 4 shows one way of view-
artists with possibly the same information (e.g. multiple instances            ing the layers that make up a narrative as proposed by Bal [2]. The
of Rembrandt). The challenge is to identify which of these instances           raw facts and chronological collection of events in any particular tale
refer to the same artist, and which ones refer to genuinely different          is called the Fabula. For any given Fabula we could present the facts
artists who happen to have the same name or information.                       from different perspectives and in different sequences to produce a
   The second stage is to run a consolidation process to identify pos-         Story. We could then render any given Story into several different
sible duplicate instances in the knowledge base, searching for clues           forms or Narratives (e.g. a film or novel).
in the rest of information available about these instances. This is why           In Artequakt the knowledge base can be thought of as our under-
it is best to feed the new information to the knowledge base first             lying fabula. To produce the eventual narrative (in our case pages
(stage 1), which provide the consolidation process with more infor-            of html) we need to first arrange sub-elements of the fabula into a
mation to compare with.                                                        sensible sequence and produce a story.
   The consolidation process involves applying a set of heuristics.
Information extraction tools are sometimes only able to extract frag-          5.1 Biography Templates
ments of information about an artist, especially if the source docu-
ment or paragraph is small or difficult to analyse. This results in the        The story structures we are using are human authored biography tem-
creation of new instances with only one or two facts associated with           plates that contain queries into the knowledge base.
                                               <Paragraph>
                                                  <url>http://search.ebi.eb.com/ebi/article/
                                               0,6101,36822,00.html </url>
                                                 <text> Rembrandt Harmenszoon van Rijn was
                                               born on July 15, 1606, in Leiden, the
                                               Netherlands… Rembrandt left the University of
                                                Leiden to study painting. … He was influenced
                                               by the work of Caravaggio and was fascinated
                                               by the work of many other Italian artists. </text>
                                                      …..
                                                    <sentence>Rembrandt Harmenszoon van Rijn
                                                              was born on July 15 1606 in Leiden
                                                       <Painter>
                                                         <name>Rembrandt Harmenszoon van
                                                                 Rijn</name>
                                                         <date_of_birth>15 july 1606</date_of_birth>
                                                         <place_of_birth>Leiden Netherlands
                                                                              </place_of_birth>
                                                       </Painter>
                                                    </sentence>
                                                …..
                                                 <sentence> He was influenced by the work of
                                                               Caravaggio
                                                   <Painter>
                                                      <name>rembrandt</name>
                                                      <inspired_by>Caravaggio</inspired_by>
                                                   </Painter>
                                                 </sentence>
                                               ……
                                               </Pragraph>
                                                                            (a)                                                  (b)

 Figure 3. a) XML file of extracted information is sent to the ontology server, b) The server creates the relevant instances and relationships in the ontology.


                                                                                                              the consolidated ontology for specific facts and construct sentences
                                                                                  Implementation
                                                                                                              dynamically from the results. This can be useful for facts that have
                                                         
                                                                                         HTML Pages           been inferred (and therefore there is no corresponding paragraph), or
        Narrative           Narrative       Narrative           Narrative
                                                                                                              when there is no paragraph that fits the literary form of the rest of the
                

                    Story
                                                    

                                                        Story                        Contextual Templates     biography (e.g. the biography is in third person, but all the available
                                                                                                              paragraphs are in first person).
                                   Fabula
                                                                                  Ontology + Knowledge Base      The templates also contain contextual information on which parts
                                                                                                              of the biography structure are appropriate in different contexts (spec-
                                 Figure 4. The Levels of Narrative                                            ified as a list of tag value pairs inside a context object). For exam-
                                                                                                              ple imagine that the user has specified that they do not have a good
                                                                                                              knowledge of artists. The template structure can specify that parts
                                                                                                              of the structure are only available to people with a good knowledge.
   Previous work has stored queries into an ontological space as the                                          Thus, when the user queries Linky for the template, the inappropriate
destination of navigational links [24], by following the links the user                                       parts that require this are pruned away.
causes the queries to be executed (and views the results). With Arte-                                            Figure 5 shows an example structure being pruned. In this case
quakt these basic links have evolved into more complex structures                                             a query into the ontology concerning artistic influences (here it
that arrange the queries into a sequence (a biography template).                                              would resolve into a sentence about Caravaggio) is removed because
   The templates are written in XML using the Fundamental Open                                                it would not make sense to a user who did not have a reasonable
Hypermedia Model (FOHM) [16], which is capable of represent-                                                  knowledge of artists. The resulting paragraph reads:
ing a variety of hypermedia structures including tours and links. The
XML files are then loaded into the Auld Linky contextual structure                                               ‘Rembradt Harmenszoon van Rijn was born on July 15 1606 in
server [15], which provides pattern matching facilities over the struc-                                       Leiden. Rembradt’s father was a miller who died in 1630. His early
tures via HTTP.                                                                                               work was devoted to showing the lines, light and shade, and color of
   Any given biography template may be constructed from several                                               the people he saw about him.’
sub-structures. The basic structure used is the Sequence. This repre-
sents a list of queries that have to be instantiated from the knowledge                                          In this way the biography structures will be tailored to the needs of
base and inserted into the biography in order. These queries are au-                                          each individual user. For our prototype we are concentrating on broad
thored using the vocabulary of terms defined within the ontology.                                             user classification (child/adult, etc) but it would also be possible to
Other structures allow more complex effects. A Concept structure                                              incorporate more sophisticated user modelling techniques (such as
contains several queries, any of which may be used at this point in                                           training sets [19]).
the biography. A Level of Detail (LOD) structure is similar to a con-                                            Once it has been retrieved from Linky the template has to be in-
cept, but there is an ordering between the queries that corresponds                                           stantiated, by making each query in turn and then rendering the re-
to preference (i.e. preferably the highest numbered query should be                                           sults into a html page for display.
used, if that’s not possible the next highest, and so on). These struc-
tures may be nested (e.g. a sequence of concepts).
   Some queries may retrieve paragraphs directly while others query
                                                                                                    [5] N. Crofts, D.M. Dionissiadou, and M. Stiff, ‘Definition of the cidoc
                                 Sequence                                                               object-oriented conceptual reference model’, Technical report, Interna-
            1                          2                           3                                    tional Organization for Standardization, (2000).
                                                                                                    [6] H. Cunningham, K. Bontcheva, V. Tablan, C. Ursu, and M. Dimitrov,
                                                                                                        ‘Developing language processing components with gate (user’s guide)’,
                                                                                                        Technical report, University of Sheffield, U.K., (2002). available in
                                                                                                        http://www.gate.ac.uk/.
                                                                                                    [7] N. Guarino and P. Giaretta, Ontologies and Knowledge bases: towards
                                                                     
                                                                                                        a terminological clarification. Towards Very Large Knowledge Bases:
                                                                         Concept
    Rembrandt                                                                                           Knowledge Building and Knowledge Sharing., IOS Press, 1995.
    Harmenszoon van           Rembrandt's
    Rijn was born on          father was a miller                                                   [8] S. Handschuh, S. Staab, and A. Maedche, ‘Cream - creating rela-
    July 15 1606 in           who died in 1630
    Leiden
                                                                                                        tional metadata with a component-based, ontology-driven annotation
                                                                                                        framework’, in In Proceedings of the First International Conference on
                                                                                                        Knowledge Capture, pp. 76–83, Canada, (2001).
           Context object - describing in which     His early work
           context this part of the structure       was devoted to
                                                                                                    [9] P. Harpring, ‘Proper words in proper places: The thesaurus of geo-
           can be seen                              showing the lines,         He was influenced        graphic names.’, MDA Information, (3), 5–12, (1997).
                                                    light and shade,           by the work of
           Data object - contains the query to      and color of the           Caravaggio          [10] L.L. Hill, J. Frew, and Q. Zheng, ‘Geographic names. the implementa-
                                                    people he saw
           the knowledge base (the results of
                                                    around him
                                                                                                        tion of a gazetteer in a georeferenced digital library.’, Digital Library,
           the queries are shown here)
                                                                                                        (1), (1999).
                                                                                                   [11] J. Kahan and M.-R. Koivunen, ‘Annotea: An open rdf infrastructure for
Figure 5. Template pruning: The black context (representing knowledge of
                                                                                                        shared web annotations’, in In Proceedings of The Tenth International
     artists) has failed, resulting in the shaded structure being pruned.
                                                                                                        World Wide Web Conference, WWW10, pp. 623–632, (2001).
                                                                                                   [12] K. Lee, D. Luparello, and J. Roudaire, ‘Automatic Construction of Per-
                                                                                                        sonalised TV News Programs’, in In Proceedings of the Seventh ACM
6 CONCLUSION & FUTURE WORK                                                                              Conference on Multimedia, Orlando, Florida, pp. 323–332, (1999).
                                                                                                   [13] A. Maedche, G. Neumann, and S. Staab, Bootstrapping an Ontology-
In this paper we have described the basic architecture and initial work                                 Based Information Extraction System., Intelligent Exploration of the
in the Artequakt project. Our aim is to be able to generate automat-                                    Web, Springer / Physica Verlag, 2002.
ically tailored biographies from a knowledge base which has been                                   [14] C. Mancini, ‘From Cinematographic to Hypertext Narrative’, in In Pro-
                                                                                                        ceedings of the Eleventh ACM Conference on Hypertext and Hyperme-
automatically populated by annotating text fragments extracted from                                     dia, San Antonio, Texas, USA, pp. 236–237, (2000).
Web documents.                                                                                     [15] D.T. Michaelides, D.E. Millard, M.J. Weal, and D. DeRoure, ‘Auld
   We are currently working on completing the initial prototype sys-                                    leaky: A contextual open hypermedia link server’, in Hyperme-
tem by integrating the three main components identified in the ar-                                      dia:Openness, Structural Awareness, and Adaptivity (Proceedings of
chitecture. We will then be able to assess the effectiveness over real                                  OHS-7, SC-3 and AH-3), Published in Lecture Notes in Computer Sci-
                                                                                                        ence, (LNCS 2266), Springer Verlag, Heidelberg (ISSN 0302-9743), pp.
data sources and begin the process of refining the constituant parts to                                 59–70, (2001).
improve the overall quality of the biographies served by the system.                               [16] D.E. Millard, L. Moreau, H.C. Davis, and S. Reich, ‘FOHM: A Fun-
   Although some of the research issues in this process are partic-                                     damental Open Hypertext Model for Investigating Interoperability Be-
ularly challenging, the final objective is to have an architecture in                                   tween Hypertext Domains’, in HT00, pp. 93–102, (2000).
                                                                                                   [17] G.A. Miller, R. Beckwith, C. Fellbaum, D. Gross, and K. Miller, ‘In-
place which will allow us to explore some of the research issues that                                   troduction to wordnet: An on-line lexical database’, Technical report,
have arisen so far in more detail; for example, more comprehensive,                                     University of Princeton, U.S.A., (1993).
automatic consolidation of knowledge bases, better techniques for                                  [18] M. A. Musen, R. W. Fergerson, W. E. Grosso, N. F. Noy, M. GrubeźY,
knowledge extraction and more sophisticated narrative structuring of                                    and J. H. Gennari, ‘Component-based support for building knowledge-
the knowledge fragments. To this end, progress has been made in the                                     acquisition systems’, in In Proceedings of the Conference on Intelligent
                                                                                                        Information Processing of the International Federation for Processing
identification of an approach and the building of a prototype demon-                                    World Computer Congress, Beijing, (2000).
strator for the project.                                                                           [19] M. Pazzani and D. Billsus, ‘Learning and revising user profiles:the
                                                                                                        identification of interesting web sites’, Machine Learning, 313–331,
                                                                                                        (1997).
ACKNOWLEDGEMENTS                                                                                   [20] L. Rutledge, B. Bailey, J. V. Ossenbruggen, L. Hardman, and J. Geurts,
                                                                                                        ‘Generating Presentation Constraints from Rhetorical Structure’, in In
The work presented here is part of a larger project and we would                                        Proceedings of the Eleventh ACM Conference on Hypertext and Hyper-
particularly like to note the contributions of Hugh Glaser, Srinan-                                     media, San Antonio, Texas, USA, pp. 19–28, (2000).
dan Dasmahapatra and David De Roure. This research is funded in                                    [21] S. Sekine and R. Grishman, ‘A corpus-based probabilistic grammar
                                                                                                        with only two non-terminals’, in In Proceedings of the Fourth Inter-
part by EU Framework 5 IST project “Artiste” IST-1999-11978, EP-                                        national Workshop on Parsing Technology, pp. 216–223, (1995).
SRC IRC project “Equator” GR/N15986/01 and EPSRC IRC project                                       [22] S. Staab, A. Maedche, and S. Handschuh, ‘An annotation framework for
“AKT” GR/N15764/01.                                                                                     the semantic web’, in In Proceedings of the First International Work-
                                                                                                        shop on MultiMedia Annotation, Japan, (2001).
                                                                                                   [23] M. Vargas-Vera, E. Motta, and J. Domingue, ‘Knowledge extraction
REFERENCES                                                                                              by using an ontology-based annotation tool’, in In Proceedings of
                                                                                                        the Workshop on Knowledge Markup and Semantic Annotation, K-
[1] H. Alani, Spatial and Thematic Ontology in Cultural Heritage Informa-                               CAP’01,Canada, (2001).
    tion Systems, Ph.D. dissertation, Computer Studies Department Univer-                          [24] M.J. Weal, G.J. Hughes, D.E. Millard, and L. Moreau, ‘Open Hyper-
    sity of Glamorgan, U.K., 2001.                                                                      media as a Navigational Interface to Ontological Information Spaces’,
[2] M. Bal, Narratology: Introduction to the Theory of Narrative, Univer-                               in In Proceedings of the Twelth ACM Conference on Hypertext and Hy-
    sity of Toronto Press, 1978. Trans. Christine van Boheemen. Torento.                                permedia, Arhus, Denmark, pp. 227–236, (2001).
    1985.                                                                                          [25] R. Yangarber and R. Grishman, ‘Machine learning of extraction pat-
[3] R. Cole, J. Mariani, H. Uszkoreit, A. Zaenen, and V. Zue. Survey of the                             terns from unannotated corpora: Position statement’, in In Proceedings
    state of the art in human language technology, 1995.                                                of Workshop on Machine Learning for Information Extraction, pp. 76–
[4] M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell,                                       83, ECAI,Berlin, (2001).
    K. Nigam, and S. Slattery, ‘Learning to construct knowledge bases from
    the world wide web.’, Artificial Intelligence, (1-2), 69–113, (2000).