Web based Knowledge Extraction and Consolidation for
             Automatic Ontology Instantiation
                           Harith Alani, Sanghee Kim, David E. Millard, Mark J. Weal
                                        Wendy Hall, Paul H. Lewis, Nigel Shadbolt
                                                      I.A.M. Group, ECS Dept.
                                                     University of Southampton
                                                          Southampton, UK
                                          {ha, sk, dem, mjw, wh, phl, nrs}@ecs.soton.ac.uk


ABSTRACT                                                                     provide a variety of knowledge services. Automatic instan-
                                                                             tiation of ontologies and building knowledge bases (KB)
The Web is probably the largest and richest information                      with knowledge extracted from the web corpus is therefore
repository available today. Search engines are the common                    very beneficial. Artequakt is concerned with automating
access routes to this valuable source. However, the role of                  ontology instantiation with knowledge triples (subject -
these search engines is often limited to the retrieval of lists              relation - object) about the life and work of artists, and pro-
of potentially relevant documents. The burden of analysing                   viding this knowledge for biography generation services.
the returned documents and identifying the knowledge of
interest is therefore left to the user. The Artequakt system                 When analysing and extracting information from multi
aims to deploy natural language tools to automatically ex-                   sourced documents, it is inevitable that duplicated and con-
tract and consolidate knowledge from web documents and                       tradictory information will be extracted. Handling such in-
instantiate a given ontology, which dictates the type and                    formation is challenging for automatic extraction and ontol-
form of knowledge to extract. Artequakt focuses on the                       ogy instantiation approaches [18]. Artequakt applies a set of
domain of artists, and uses the harvested knowledge to gen-                  heuristics and reasoning methods in an attempt to distin-
erate tailored biographies. This paper describes the latest                  guish conflicting information, to verify it, and to identify
developments of the system and discusses the problem of                      and merge duplicate assertions in the KB automatically.
knowledge consolidation.                                                     This paper describes the main components of the Artequakt
                                                                             system, focusing on the latest development with respect to
Categories and Subject Descriptors                                           knowledge consolidation and ontology instantiation.
I.2.6 Learning – Knowledge acquisition
I.2.7 Natural Language Processing – Text analysis, Lan-                      RELATED WORK
guage parsing and understanding                                              Extracting information from web pages to generate various
                                                                             reports is becoming the focus of much research. The closest
Keywords                                                                     work we found to Artequakt is the area of text summarisa-
Information Extraction, Ontology Instantiation, and Knowl-                   tion. A number of summarisation techniques have been de-
edge Consolidation.                                                          scribed to help bring together important pieces of informa-
                                                                             tion from documents and present them to the user in a com-
INTRODUCTION                                                                 pact form.
Web pages are the source of vast amounts of knowledge.
                                                                             Even though most summarisation systems deal with single
This knowledge is often buried by layers of text and scat-
                                                                             documents, some have targeted multiple resources [12][23].
tered over numerous sites. Associating web pages with an-
                                                                             Statistical based summarisations tend to be domain inde-
notations to identify their knowledge content is the ambition
                                                                             pendent, but lack the sophistication required for merging
of the Semantic Web [3]. Much research is now focused on
                                                                             information from multiple documents [17]. On the other
developing ontologies to manipulate this knowledge and
                                                                             hand, Information Extraction (IE) based summarisations are
                                                                             more capable of extracting and merging information from
 Permission to make digital or hard copies of all or part of this work for   various resources, but due to the use of IE, they are often
 personal or classroom use is granted without fee provided that copies are
 not made or distributed for profit or commercial advantage and that
                                                                             domain dependent.
 copies bear this notice and the full citation on the first page. To copy
                                                                             Radev developed the SUMMONS system [17] to extract
 otherwise, or republish, to post on servers or to redistribute to lists,
 requires prior specific permission and/or a fee.                            information and generate summaries of individual events
 K-CAP’03, October 23-25, 2003, Sanibel Island, FL, USA.                     from MUC (Message Understanding Conferences) text cor-
 Copyright 2003 ACM 1-58113-000-0/00/0000…$5.00                              puses. The system compares information extracted from
multiple resources, merges similar content and highlights        jects; Sculpteur1, Equator2, and AKT3. The main compo-
contradictions. However, like most IE based systems; in-         nents of Artequakt are described in the following sections.
formation merging is often based on linguistics and timeline
comparison of single events [17][23] or multiple events          System Overview
[18].                                                            Figure 1 illustrates Artequakt’s architecture which com-
Artequakt’s knowledge consolidation is based on the com-         prises of three key areas. The first concerns the knowl-
parison of individual knowledge fragments, rather than lin-      edge extraction tools used to extract factual information
guistic analyses or timeline comparison. Furthermore, Arte-      from documents and pass it to the ontology server. The
quakt’s consolidation is more fine-grained, focusing on the      second key area is information management and storage.
                                                                 The information is stored by the ontology server and
comparison and merging of individual entities (e.g. places,
                                                                 consolidated into a KB which can be queried via an in-
people, dates).                                                  ference engine. The final area is the narrative genera-
Most traditional IE systems are domain dependent due to          tion. The Artequakt server takes requests from a reader
the use of linguistic rules designed to extract information of   via a simple Web interface. The request will include an
specific content (e.g. bombing events (MUC systems),             artist and the style of biography to be generated (chro-
earthquake news [23], sports matches [18]). Adaptive IE          nology, summary, fact sheet, etc.). The server uses story
systems [4] can ease this problem by identifying new ex-         templates to render a narrative from the information
                                                                 stored in the KB using a combination of original text
traction rules induced from example annotations supplied
                                                                 fragments and natural language generation.
by users. However, training such tools can be difficult and
time consuming. Promising results are offered by more ad-
vanced adaptive IE tools, such as Armadillo [6], which dis-
covers new linguistic and structural patterns automatically,
thus requiring limited bootstrapping.
Using ontologies to back up IE is hoped to support informa-
tion integration [2][18] and increase domain portability
[10][11]. Poibeau [16] investigated increasing domain in-
dependency by using clustering methods on text corpuses to
aid users construct primitive ontologies to represent the
main corpus topics. Templates could then be generated
from the ontology and guide the IE process. Ontologies
produced by this approach are limited to the content of the
corpus, rather than representing a specific domain. In some
cases (such as in Artequakt) the corpus is very large and
diverse (e.g. the Web). Creating ontologies from such cor-
pus is infeasible. Furthermore, these ontologies are likely to
be rough, shallow, and include undesired concepts that hap-
pen to be in the text corpus. Consequently, the cost of
bringing such ontologies to shape might exceed the benefit.
Instantiating ontologies with assertions from textual docu-
ments can be a very laborious task. A number of tools have
been developed that instantiate ontologies semi automati-
cally with user driven annotations [20]. IE learning tools,
such as Amilcare [4], can be used to automate part of the                        Figure 1. The Artequakt Architecture
annotation process and speed up ontology instantiation
[7][21].
                                                                 The architecture is designed to allow different ap-
ARTEQUAKT                                                        proaches to information extraction to be incorporated
The Artequakt project has implemented a system that              with the ontology acting as a mediation layer between
                                                                 the IE and the KB. Currently we are using textual analy-
searches the Web and extracts knowledge about artists,
                                                                 sis tools to scrape web pages for knowledge, but with
based on an ontology describing that domain, and stores
                                                                 the increasing proliferation of the semantic web, addi-
this knowledge in a KB to be used for automatically pro-
ducing personalised biographies of artists. Artequakt draws
                                                                 1
from the expertise and experience of three separate pro-             http://www.sculpteurweb.org/
                                                                 2
                                                                     http://www.equator.ac.uk/
                                                                 3
                                                                     http://www.aktors.org/
tional tools could be added that take advantage of any           This ontology was modified for Artequakt and enriched
semantically augmented pages passing the embedded                with additional classes and relationships to represent a
knowledge through the KB.                                        variety of information related to artists, their personal
                                                                 information, family relations, relations with other artists,
As well as keeping open the interface between the KB
                                                                 details of their work, etc. The Artequakt ontology and
and the extraction technology, a clear separation has
                                                                 KB are accessible via an ontology server.
been kept between the creation of a structured document
from the knowledge base and the rendering of that
document. In the current system, the information is ren-         KNOWLEDGE EXTRACTION
dered into an HTML page but alternative-rendering en-            The aim of our knowledge extraction tool is to identify
gines could be envisaged. For example, rather than pre-          and extract knowledge triples from text documents and
senting the biography as a linear textual document, the          to provide it as RDF files for entry into the KB [10].
information might be rendered into a dynamic presenta-           Artequakt uses an ontology coupled with a general-
tion system such as SMIL, converted into an audio                purpose lexical database (WordNet) [14] and an entity-
stream using text to speech tools, or perhaps used to            recogniser (GATE) [5] as guidance tools for identifying
generate a dynamic hypertext with links referring back           knowledge fragments.
to queries to the KB on items such as artists names.             Artequakt attempts to identify not just entities, but also their
                                                                 relationships following ontology relation declarations and
                                                                 lexical information.
<kb:Person rdf:about="&kb;Person_1"
       kb:name="Pierre-Auguste Renoir"
       rdfs:label="Person_1">                                    Extraction Procedure
       <kb:date_of_birth rdf:resource=                           The extraction process is launched when the user requests a
       "&kb;Date_1"/>
       <kb:place_of_birth rdf:resource=                          biography for a specific artist that is not in the KB. The
       "&kb;Place_1"/>                                           query is passed to selected web search engines and the
       <kb:has_father rdf:resource=
       "&kb;Person_2"/>                                          search results are analysed with respect to relevancy to the
       <kb:has_information_text rdf:resource=                    domain of artists.
       "&kb;Paragraph_1"/>
</kb:Person>
<kb:Date rdf:about="&kb;Date_1"                                  Each selected document is then divided into paragraphs and
       kb:day="25"                                               sentences. Each sentence is analysed syntactically and se-
       kb:month="2"                                              mantically to identify any relevant knowledge to extract.
       kb:year="1841"
       rdfs:label="Date_1">                                      Below is an example of an extracted paragraph:
</kb:Date>
<kb:E53.Place rdf:about="&kb;Place_1"                                "Pierre-Auguste Renoir was born in Limoges on February
       kb:name="Limoges"                                             25, 1841. His father was a tailor and his mother a dress-
       rdfs:label="Place_1"/>
<kb:Person rdf:about="&kb;Person_2"                                  maker. "
       rdfs:label="Person_2">
       <kb:has_work_information rdf:resource=                    Annotations provided by GATE and WordNet highlight
       "&kb;Work_information_1"/>                                that ‘Pierre-Auguste Renoir‘ is a person’s name, ‘Feb-
</kb:Person>
<kb:Work_information rdf:about=                                  ruary 25, 1841’ is a date, and ‘Limoges‘ is a location.
"&kb;Work_information_1"                                         Relation extraction is determined by the categorisation
       kb:job_title="tailor"                                     result of the verb ‘bear’ which matches with two poten-
       rdfs:label="Work_information_1">
</kb:Work_information>                                           tial relations in the ontology; ‘date_of_birth’ and
                                                                 ‘place_of_birth’. Since both relations are associated
     Figure 2. RDF representation of knowledge extracted from    with ‘February 25, 1841‘ and ‘Limoges‘ respectively,
     the paragraph: “Pierre-Auguste Renoir was born in Limoges   this sentence generates the following knowledge triples
     on February 5, 1841. His father was a tailor.”              about Renoir:
                                                                 •    Pierre-Auguste Renoir date_of_birth
Artequakt Ontology                                                    25/2/1841
For Artequakt the requirement was to build an ontology           •    Pierre-Auguste Renoir place_of_birth Limoges
to represent the domain of artists and artefacts. The            The second sentence generates knowledge triples related
main part of this ontology was constructed from selected         to Renoir’s family:
sections in the CIDOC Conceptual Reference Model
(CRM4) ontology. The CRM ontology is designed to                 Pierre-Auguste Renoir has_father Person_2
represent artefacts, their production, ownership, loca-          •    Person_2 job_title Tailor
tion, etc.                                                       •    Pierre-Auguste Renoir has_mother Person_3
                                                                 •    Person_3 job_title Dressmaker

                                                                 Inaccurately extracted knowledge may reduce the qual-
4                                                                ity of the system’s output. For this reason, our extraction
    http://cidoc.ics.forth.gr/index.html
rules were designed to be of low risk levels to ensure        Very little text generation is used in the current imple-
higher extraction precision. Advanced consistency             mentation (e.g. Figure 3, 1 st and last sentences), but this
checks can help identify some extraction inaccuracies;        will be the focus of the next phase.
e.g. a date of marriage is before the date of birth, or two
                                                              By storing conflicting information rather than discarding
unrelated places of birth for the same person!
                                                              it during the consolidation process, the opportunity ex-
The extraction process terminates by sending the ex-          ists to provide biographies that set out arguments as to
tracted knowledge to the ontology server. Figure 2 is the     the facts (with provenance, in the form of links to the
RDF representation of the extracted knowledge. Arte-          original sources) by juxtaposing the conflicting informa-
quakt’s IE process is out of the scope of this paper, and     tion and allowing the reader to make up their own mind.
is fully described in [2] and [10].
                                                              Different templates can be constructed for different
                                                              types of biography. Two examples are the summary bi-
BIOGRAPHY GENERATION                                          ography, which provides paragraphs about the artist ar-
Once the information has been extracted, stored and           ranged in a rough chronological order, and the fact
consolidated, the Artequakt system repurposes it by           sheet, which simply lists a number of facts about the
automatically generating biographies of the artists. Fig-     artist, i.e. date of birth, place of study etc. The biogra-
ure 3 shows a biography of Renoir.                            phies also take advantage of the structure server’s ability
                                                              to filter the template based on a user’s interest. If the
                                                              reader is not interested in the family life of the artist the
                                                              biography can be tailored to remove this information.
                                                              More about Artequakt’s biography generation is avail-
                                                              able at [14].

                                                              AUTOMATIC INSTANTIATION
                                                              Storing knowledge extracted from text documents in
                                                              KBs offers new possibilities for further analysis and
                                                              reuse. Ontology instantiation refers to the insertion of
                                                              information into the KB, as described by the ontology
                                                              (sometimes referred to as ontology population). Instan-
                                                              tiating ontologies with a high quantity and quality of
                                                              knowledge is one of the main steps towards providing
                                                              valuable and consistent ontology-based knowledge ser-
                                                              vices. Manual ontology instantiation is very labour in-
                                                              tensive and time consuming. Some semi-automatic ap-
                                                              proaches have investigated creating document annota-
                                                              tions and storing the results as assertions [7][20][21].
                                                              [7] and [20] describe two frameworks for user-driven
                                                              ontology-based annotations, enforced with the IE learn-
                                                              ing tool; Amilcare [3]. However, the two frameworks
                                                              are manually driven and mainly focus on entity annota-
                                                              tions. They lack the capability of identifying relation-
                                                              ships reliably. In [20], relationships were added
                                                              automatically between instances, but only if these
                                                              instances already existed in the KB, otherwise user
                                                              intervention is required.
                                                              In Artequakt we investigate the possibility of moving
                                                              towards a fully automatic approach of feeding the ontol-
                                                              ogy with knowledge extracted from unstructured text.
      Figure 3. A Biography Generated Using Sentences.        Information is extracted in Artequakt with respect to a
The biographies are based on templates authored in the        given ontology and provided as RDF or XML files using
Fundamental Open Hypermedia Model (FOHM) and                  tags mapped directly from names of classes and rela-
stored in the Auld Linky contextual structure server          tionships in that ontology. When the ontology server
[13]. Each section of the template is instantiated with       receives a new RDF file, a feeder tool is activated to
paragraphs or sentences generated from information in         parse the file and adds its knowledge triples to the KB
the KB. The KB informs the templates of the theme of          automatically. Once the feeding process terminates, the
the sentences and paragraphs (e.g. influences, family         consolidation tool searches for and merges any duplica-
info, painting) and the generation tool select the relevant   tion in the KB.
ones and structure them in the desired form and order.
KNOWLEDGE BASE CONSOLIDATION                                  Unique Name Assumpti on
Automatically instantiating an ontology from diverse          One basic heuristic applied in Artequakt is that artist
and distributed resources poses significant challenges.       names are unique; where artist instances with identical
One persistent problem is that of the consolidation of        names are merged. According to this heuristic, all in-
duplicate information that arises when extracting similar     stances with the name Rembrandt are combined into one
or overlapping information from different sources.            instance. This heuristic is obviously not fool proof, but
Tackling this problem is important to maintain the refer-     it works well in the limited domain of artists.
ential integrity and quality of results of any ontology-
based knowledge service. [18] relied on manually as-
signed object identifiers to avoid duplication when ex-       Information Overlap
tracting from different documents.                            There are cases where the full name of an artist is not
                                                              given in the source document or its extraction fails, in
Little research has looked at the problem of information      which case they will not be captured by the unique-name
consolidation in the IE domain. This problem becomes          heuristic. For example, when we extracted information
more apparent when extracting from multiple docu-             about Rembrandt and merged same-name artists, two
ments. Comparing and merging extracted information is         instances remained for this artist; Rembrandt and Rem-
often based on domain dependent heuristics [17] [18]          brandt Harmenszoon van Rijn. In such a case we com-
[23]. Our approach attempts to identify inconsistencies       pare certain attribute values, and merge the two in-
and consolidate duplications automatically using a set of     stances if there is sufficient overlap. For the two Rem-
heuristics and term expansion methods based on Word-          brandt instances, both had the same date and place of
Net [22].                                                     birth, and therefore were combined into one instance.
                                                              The duplication would have not been caught if these
Duplicate Information                                         attributes had different values.
There exist two main type of duplication in our KB; du-
plicate instances (e.g. multiple instance representing the    Att ribute Comparison
same artist), and duplicate attribute values (e.g. multiple   When the above heuristics are applied, merged instances
dates of birth extracted for the same artists).               might end up having multiple attribute values (e.g. mul-
Artequakt’s IE tool treats each recognised entity (e.g.       tiple dates and places of birth), which in turn need to be
Rembrandt, Paris) as a new instance. This may result in       analysed and consolidated. Note that some of these at-
creating instances with overlapping information (e.g.         tributes might hold conflicting information that should
two Person instances with the same name and date of           be verified and held for future comparison and use.
birth). The role of consolidation in Artequakt includes       Comparing the values of instance attributes is not al-
analysing and comparing attribute values of the in-           ways straightforward as these values are often extracted
stances of each type of concept in the KB (e.g. Person,       in different formats and specificity levels (e.g. synony-
Date) to identify inconsistencies and duplications.           mous place names, different date styles) making them
The amount of overlap between the attribute values of         harder to match. Artequakt applies a set of heuristics
any pair of instances could indicate their duplication        and expansion methods in an attempt to match these
potential. However, this overlap is not always measur-        values. Consider the following sentences:
able. IE tools are sometimes only able to extract frag-        1. Rembrandt was born in the 17th century in Leyden.
ments of information about a given entity (e.g. an artist),    2. Rembrandt was born in 1606 in Leiden, the Nether-
especially if the source document or paragraph is small            lands.
or difficult to analyse. This leads to the creation of new     3. Rembrandt was born on July 15 1606 in Holland.
instances with only one or two facts associated with
each. For example two artist instances with the name          These sentences provide the same information about an
Rembrandt, where one instance has a location relation-        artist, written in different formats and specificity levels.
ship to Holland, while the other has a date of birth of       Storing this information in the KB in such different for-
1606. Comparing such shallow instances will not reveal        mats is confusing for the biography generator which can
their duplication potential. Furthermore, neither the         benefit from knowing which information is repetitive
source information nor the information extraction is al-      and which is contradictory. Matching the above sen-
ways accurate. For example a Rembrandt instance can           tences required enriching the original ontology with
be extracted with the correct family attribute values, but    some temporal and geographical reasoning.
with the wrong date of birth, in which case this instance
will be mismatched with other Rembrandt instances in          Geographical Consolidation
spite of referring to the same artist.
                                                              There has been much work on developing gazetteers of
                                                              place names, such as the Thesaurus of Geographic
                                                              Names (TGN) [8] and Alexandria Digital Library [9].
                                                              Ontologies can be integrated with such sources to pro-
                                                              vide the necessary knowledge about geographical hier-
archies, place name variations, and other spatial infor-        consistent, but the third date holds more information
mation [1]. Artequakt derives its geographical knowl-           than the other two. Therefore, the third date is used for
edge from WordNet [14]. WordNet contains information            the instance of Rembrandt. If any of the given facts is
about geopolitical place names and their hierarchies,           inconsistent then it will be stored for future verification
providing three useful relations for the context of Arte-       and use.
quakt; synonym, holonym (part of), and part_meronym
                                                                At the end of the consolidation process, the knowledge
(sub part). The Artequakt ontology is extended to add
                                                                extracted from the three sentences above will be stored
this information for each new instance of place added to
                                                                in the KB as the following two triples for the instance of
the KB.
                                                                Rembrandt:
                                                                •    Rembrandt date_of_birth 15 July 1606
Place Name Synonyms                                             •    Rembrandt place_of_birth Leiden
The synonym relationship is used to identify equivalent
place names. For example the three sentences above
mention several place names were Rembrandt was born.            Inconsistent Information
Using the synonym relationship in WordNet, Leyden can           Some of the extracted information can be inconsistent,
be identified as a variant spelling for Leiden, and that        for example an artist with different dates or places of
Holland and The Netherlands are synonymous.                     birth or death, or inconsistent temporal information,
                                                                such as a date of death that falls before the date of birth.
Place Specificity                                               The source of such inconsistency can be the original
The part-of and sub-part relationships in WordNet are           document itself, or an inaccurate extraction. Predicting
used to find any hierarchical links between the given           which knowledge is more reliable is not trivial. Cur-
places. WordNet shows that Leiden is part of the Neth-          rently we rely on the frequency in which a piece of
erlands, indicating that Leiden is the more precise in-         knowledge is extracted as an indicator of its accuracy;
formation about Rembrandt’s place of birth.                     the more a particular piece of information is extracted,
                                                                the more accurate it is considered to be. For example,
                                                                for Renoir, two unique dates of births emerged; 25 Feb
Shared Place Names                                              1841 and 5 Feb 1841. The former date has been ex-
It is common for places to share the same name. For exam-       tracted from several web sites, while the latter was
ple according to the TGN, there are 22 places worldwide         found in one site only, and therefore considered to be
named London. This problem is less apparent with Word-          less reliable.
Net due to its limited geographical coverage.                   A more advanced approach can be based on assigning
In Artequakt, disambiguation of place names is dependent        levels of trust for each extracted piece of knowledge,
on their specificity variations. For example after processing   which can be derived from the reliability of the source
the three sentences about Rembrandt, it becomes apparent        document, or the confidence level of the extraction of
                                                                that particular information. The knowledge consolida-
that he was born in a place named Leiden in the Nether-
                                                                tion process is not aimed at finding ‘the right answers’
lands. If the last two sentences were not available, it would
                                                                however. The facts extracted are stored for future use,
have not been possible to tell for sure which Leiden is being   with references to the original material.
referred to (assuming there is more than one). One possibil-
ity is to rely on other information, such as place of work,
place of death, to make a disambiguation decision. How-         PORTABILITY TO OTHER DOMAINS
ever, this is likely to produce unreliable results.             The use of an ontology to back up IE is meant to increase
                                                                the system’s portability to other domains. By swapping the
Temporal Consolidation                                          current artist ontology with another domain specific one,
Dates need to be analysed to identify any inconsistencies       the IE tool should still be able to function and extract some
and locate precise dates to use in the biographies. Sim-        relevant knowledge, especially if it is concerned with do-
ple temporal reasoning and heuristics can be used to            main independent relations expressed in the ontology, such
support this task.                                              as personal information (name, date and place of birth, fam-
                                                                ily relations, etc). However, some domain specific extrac-
Artequakt’s IE tool can identify and extract dates in dif-
                                                                tion rules, such as painting style, will eventually have to be
ferent formats, providing them as day, month, year, dec-
                                                                retuned to fit the new domain.
ade, etc. This requires consolidation with respect to pre-
cision and consistency. Going back to our previous ex-          Similarly, the generation templates are currently manually
ample, to consolidate the first date (17th century), the        set for biography construction. These templates may need to
process checks if the years of the other dates fall within      be modified if a different type of output is required. We aim
the given century. If this is true, then the process tries to   to investigate developing templates that can be dynamically
identify the more precise date. The date in the third sen-      instructed and modified by the ontology.
tence is favoured over the other two dates as they are all
Consolidation is often based on domain dependent heuris-       differences, e.g. “25 th/2/1841” versus “25/2/1841”. This
tics. However, some of the heuristics used in Artequakt can    highlights the need for an additional syntactic-checking
be suitable for other domain. For example, Artequakt’s ap-     process that could eliminate such noise.
proach for comparing and integrating place names using
external gazetteers can be used in any domain. Similarly,                     Table 1. Consolidation rates
heuristics concerning the comparison of specific facts to          Class      Before consld.     After consld.      Rate%
decide whether or not two instances of people are dupli-
cates is also domain independent. Further work is planned        Person
                                                                                   1475               152             -90
to extend the scope of information integration                   instance

Building a cross-domain system is one of the aims of this        Date
                                                                                    83                 30             -64
project, and will be fully investigated in the next stage of     instance
development.                                                     Place
                                                                                    30                505             +94
                                                                 instance
EVALUATION                                                       Person
                                                                                   4240               1562            -63
We used the system to instantiate the KB with informa-           relations
tion on five artists, extracted from around 50 web pages.
                                                               CONCLUSIONS
Extraction Performance                                         This paper describes a system that automatically extracts
Precision and recall were calculated for a set of 10 artist    knowledge, instantiates an ontology with knowledge triples,
relations (about birth, death, places where they worked        and reassembles the knowledge in the form of biographies.
or studied, who influenced them, professions of their          Problems related to this task, such as the identification and
parents, etc). Results showed that precision scored            consolidation of duplicated knowledge and the verification
higher than recall with average values of 85 and 42 re-        of inconsistent knowledge, are highlighted. Artequakt’s
spectively. The experiment is more detailed in [2].            approaches to tackle these problems are described.
                                                               An initial experiment, using around 50 web pages and 5
Biography Evaluation                                           artists, showed promising results, with nearly 3 thousand
Although we have not conducted any formal evaluation           unique knowledge triples extracted (before consolidation).
of the biographies generated by the system, we are in the      However, some of this knowledge was too sparse to be of
position to make a few observations. In general we             any clear benefit. This indicates that more pages need to be
found that the system is fairly successful in reproducing      processed, and further rules need to be constructed to cover
text for a given artist. We are currently looking at how       additional ontology concepts and relations and expand the
best to perform a qualitative evaluation of the biogra-        knowledge extraction scope.
phies, perhaps with a task-based user evaluation, com-
                                                               The generated biographies were informative and brought
paring the Artequakt system with a traditional search
engine.                                                        together knowledge extracted from various sources. How-
                                                               ever, reusing original text to generate biographies high-
                                                               lighted several problems, including co-referencing and
Consolidation Rate                                             other textual deixis (such as 'Later', or 'Nevertheless'). This
Table 1 shows the reduction rate in number of instances        underlines the potential benefits of regenerating text di-
and relations after consolidating the KB. Applying the         rectly from the extracted facts, which is part of our near
heuristics described earlier in the paper lead to the re-      future plans.
duction in number of instances of the Person and Date
classes by 90% and 64% respectively. Before consolida-         Our consolidation techniques significantly decreased the
tion, 283 instances representing Rembrandt were stored.        number of instances in the KB by up to 90% for certain
The unique-name consolidation heuristic was the most           classes and 63% for attributes related to instances of Per-
effective with no identified mistakes.                         son. Few instances remained undetected, mainly due to lack
                                                               of information required for the knowledge comparison.
When place instances are fed to the KB, they are ex-
panded using WordNet and stored alongside their syno-          Future work on Artequakt will continue to develop its
nyms, holonyms (part of), and part_meronym (sub                modular architecture and refine the information extraction
parts). The number of Place instances created in the KB        and consolidation processes. In addition we are beginning
has therefore increased significantly (94% rise). This         to look at how we might leverage the full power of the un-
gave the consolidation the power to identify and con-          derlying ontology to aid extracting information from multi-
solidate relationships to places as described in the geo-      ple domains and produce different type of reports.
graphical consolidation section. Some instances (mainly
dates) were not consolidated due to slight syntactical
ACKNOWLEDGEMENTS                                                 [12] McKeown, K.R., Barzilay, R., Evans, D., Hatzivassi-
This research is funded in part by EU Framework 5 IST pro-            loglou, V., Klavans, J.L., Nenkova, A., Sable, C.,
ject “Scultpeur” IST-2001-35372, EPSRC IRC project “Equa-             Schiffman, B., Sigelman, S.: Tracking and Summariz-
tor” GR/N15986/01 and EPSRC IRC project “AKT”                         ing News on a Daily Basis with Columbia's Newsblas-
GR/N15764/01                                                          ter. Proc. Human Language Technology Conf., San
                                                                      Diego, CA, USA. 2002.
REFERENCES                                                       [13] Michaelides, D.T., Millard, D.E., Weal, M.J., DeR-
[1] Alani, H., Jones, C., Tudhope, D.: Associative and                oure, D.: Auld Leaky: A Contextual Open Hypermedia
    Spatial Relationships in Thesaurus-Based Retrieval.               Link Server. Proc. 7th Hypermedia: Openness, Struc-
    Proc. 4th European Conf. on Digital Libraries, pages              tural Awareness, and Adaptivity, pages 59--70,
    45--58, Lisbon, Portugal, Sept. LNCS, 2000.                       Springer Verlag, Heidelberg, 2001.
[2] Alani, H., Kim, S., Millard, D., Weal, M., Lewis, P.,        [14] Millard, D.E., Alani, H., Kim, S., Weal, M.J., Lewis,
    Hall, W., Shadbolt, N.: Automatic Extraction of                   P., Hall, W., DeRoure, D., Shadbolt, N.: Generating
    Knowledge from Web Documents. Workshop on Hu-                     Adaptive Hypertext Content from the Semantic Web.
    man Language Technology for the Semantic Web and                  1st International Workshop on Hypermedia and the
    Web Services, 2nd Int. Semantic Web Conf. Sanibel Is-             Semantic Web, HyperText'03, Nottingham, UK. 2003.
    land, Florida, USA, 2003.
                                                                 [15] Miller, G., Beckwith, R., Fellbaum, C., Gross, D.,
[3] Berners-Lee, T., Hendler, J., Lassila, O.: The Semantic           Miller, K.: Introduction to wordnet: An on-line lexical
    Web. Scientific American, 2001.                                   database. Int. J. Lexicography, 3(4):235--312, 1993.
[4] Ciravegna, F.: Adaptive Information Extraction from          [16] Poibeau, T.: Deriving a multi-domain information ex-
    Text by Rule Induction and Generalisation. Proc.17th              traction system from a rough ontology. Proc. 17th Int.
    Int. Joint Conf. on Artificial Intelligence (IJCAI),              Conf. on Artificial Intelligence, Seattle. USA, 2001.
    pages 1251--1256, Seattle, USA, 2001.
                                                                 [17] Radev, D. R., McKeown. K. R.: Generating natural
[5] Cunningham, H., Maynard, D., Bontcheva, K., Tablan,               language summaries from multiple on-line sources.
    V.: GATE: a framework and graphical development                   Computational Linguistics, 24(3): 469—500, 1998.
    environment for robust NLP tools and applications.
    Proc. 40th Anniversary Meeting of the Association for        [18] Reidsma, D., Kuper, J., Declerck, T., Saggion, H.,
    Computational Linguistics, Phil, USA, 2002.                       Cunningham, H.: Cross document annotation for mul-
                                                                      timedia retrieval. EACL Workshop on Language Tech-
[6] Dingli, A., Ciravegna, F., Guthrie, D., Wilks, Y.: Min-           nology and the Semantic Web, Budapest, 2003.
    ing Web Sites Using Unsupervised Adaptive Informa-
    tion Extraction. Proc. 10th Conf. of the European            [19] Staab, S., Maedche, A., Handschuh, S.: An Annotation
    Chapter of the Association for Computational Linguis-             Framework for the Semantic Web. Proc. 1st Int. Work-
    tics, Budapest, Hungary, 2003.                                    shop on MultiMedia Annotation, Tokyo, 2001.
[7] Handschuh, S., Staab, S., Ciravegna, F.: S-CREAM –           [20] Vargas-Vera, M., Motta, E., Domingue, J., Bucking-
    Semi Automatic Creation of Metadata. Semantic Au-                 ham Shum, S., Lanzoni, M.: Knowledge Extraction by
    thoring, Annotation and Markup Workshop, 15th Euro-               using an Ontology-based Annotation Tool. Proc.
    pean Conf. Artificial Intelligence, France, Lyon, 2002.           Workshop on Knowledge Markup & Semantic Annota-
                                                                      tion, 1st Int. Conf. on Knowledge Capture, pp 5--12,
[8] Harpring, P.: Proper Words in Proper Places: The The-             Victoria, B.C., Canada, 2001.
    saurus of Geographic Names. MDA Info. 2(3), 1997.
                                                                 [21] Vargas-Vera, M., Motta, E., Domingue, J., Lanzoni,
[9] Hill, L.L., Frew, J., Zheng, Q.: Geographic Names. The            M., Stutt, A., Ciravegna, F.: MnM: Ontology Driven
    Implementation of a Gazetteer in a Georeferenced                  Semi-Automatic and Automatic Support for Semantic
    Digital Library. Digital Library Magazine, 5(1), 1999.            Markup. 13th Int. Conf. on Knowledge Engineering and
[10] Kim, S., Alani, H., Hall, W., Lewis, P.H., Millard,              Management (EKAW), Spain, 2002.
     D.E., Shadbolt, N., Weal, M.J.: Artequakt: Generating       [22] Voorhees, E.M.: Using WordNet for Text Retrieval.
     Tailored Biographies with Automatically Annotated                Fellbaum (edt.) WordNet: An Electronic Lexical Data-
     Fragments from the Web. Workshop on Semantic Au-                 base, pages 285--303, MIT Press, 1998.
     thoring, Annotation & Knowledge Markup, 15th Europ.
     Conf. on Artificial Intelligence, pp 1--6, France, 2002.    [23] White, M., Korelsky, T., Cardie, C., Ng, V., Pierce, D.,
                                                                      Wagstaff, K.: Multidocument Summarization via In-
[11] Maedche, A., Neumann, G., Staab, S.: Bootstrapping               formation Extraction. Proc. of Human Language Tech-
     an Ontology-based Information Extraction System. In-             nology Conf. (HLT 2001), San Diego, CA, 2000.
     telligent Exploration of the Web. P. Szczepaniak, et al.,
     Heidelberg, Springer 2002.