<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>An Infrastructure for Building Semantic Web Portals</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Yuangui Lei</string-name>
          <email>y.lei@open.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vanessa Lopez</string-name>
          <email>v.lopez@open.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Enrico Motta</string-name>
          <email>e.motta@open.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Knowledge Media Institute (KMi), The Open University</institution>
          ,
          <addr-line>Milton Keynes</addr-line>
        </aff>
      </contrib-group>
      <fpage>1019</fpage>
      <lpage>1035</lpage>
      <abstract>
        <p>In this paper, we present our KMi semantic web portal infrastructure, which supports two important tasks of semantic web portals, namely metadata extraction and data querying. Central to our infrastructure are three components: i) an automated metadata extraction tool, ASDI, which supports the extraction of high quality metadata from heterogeneous sources, ii) an ontology-driven question answering tool, AquaLog, which makes use of the domain specific ontology and the semantic metadata extracted by ASDI to answers questions in natural language format, and iii) a semantic search engine, which enhances traditional text-based searching by making use of the underlying ontologies and the extracted metadata. A semantic web portal application has been built, which illustrates the usage of this infrastructure.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>domain specific ontologies) is extracted from heterogeneous sources in an
automated manner, and ii) comprehensive querying facilities are provided, enabling
knowledge to be accessed as easily as possible.</p>
      <p>
        Our overview of current semantic web portals reveals that they offer limited
support for metadata extraction and data querying. In contrast with these, the
system we present here, the KMi semantic web portal infrastructure, focuses
on these issues. Central to our infrastructure are three components: i) ASDI,
an automated semantic data acquisition tool, which supports the acquisition of
high quality metadata from heterogeneous sources, ii) AquaLog [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], a portable
ontology-driven question answering tool, which exploits available semantic
markups to answer questions in natural language format, and iii) a semantic search
engine, which enhances keyword searching by making use of the underlying
domain knowledge (i.e. the ontologies and the extracted metadata).
      </p>
      <p>The rest of the paper is organized as follows. We begin in section 2 by
investigating how current semantic web portals approach the issues of metadata
extraction and data querying. We then present an overview of the KMi semantic
web portal infrastructure in section 3. Thereafter, we explain the core
components of the infrastructure in sections 4, 5, and 6. In section 7, we present the
application of our infrastructure and describe the experimental evaluations
carried out in this application. Finally, in sections 8, we conclude our paper with a
discussion of our results, the limitations and future work.
2</p>
    </sec>
    <sec id="sec-2">
      <title>State of the art</title>
      <p>In this section, we investigate how current semantic web portals address the two
important issues described above namely metadata extraction and data
querying. As we focus only on these two issues, other facilities (e.g., the generation of
user interfaces, the support for ontology management) will be left out. We survey
a representative sample of portals and tools without performing an exhaustive
study of this research strand.</p>
      <p>MindSwap, OntoWeb, and Knowledge Web are examples of research projects
based semantic web portals. They rely on back-end semantic data repositories
to describe the relevant data. Some portals (e.g. MindSwap and OntoWeb)
provide tools to support metadata extraction. Such tools are however not directly
deployed in these portals. Regarding data querying, most portals offer
ontologybased searching facilities, which augment traditional information retrieval
techniques with ontologies to support the querying of semantic entities.</p>
      <p>
        The CS AKTive Space [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], the winner of the 2003 Semantic Web
Challenge competition, gathers data automatically on a continuous basis for the UK
Computer Science domain. Quality control related issues such as the problem of
duplicate entities are only weakly addressed (for example, by heuristics based
methods or using manual input). The portal offers several filtering facilities
(including geographical filtering) to support information access. Regarding data
querying, it provides a querying language based interface, which allows users to
specify queries using the specified languages (e.g., SPARQL 7).
      </p>
      <p>
        MuseumFinland [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] is the first semantic web portal that aggregates
heterogeneous museum collections. The underlying metadata is extracted from
distributed databases by means of mapping database schemas to the shared
museum ontologies. The portal provides a view-based multi-facet searching facility,
which makes use of museum category hierarchies to filter information. It also
provides a keyword searching facility, which attempts to match the keyword with
the available categories and then uses the category matches to filter information.
      </p>
      <p>
        Flink [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], winner of the 2004 Semantic Web Challenge competition, extracts
and aggregates online social networks in the research community of the semantic
web. The data is aggregated in an RDF(S) repository and a set of domain-specific
inference rules are used to ensure its quality. In particular, identity reasoning is
performed to determine if different resources refer to the same individual (i.e.,
co-relation). FLink makes use of a range of visualization techniques to support
information browsing. Data querying is however not supported.
      </p>
      <p>In summary, most portals mentioned above support metadata extraction.
While they do provide useful support in their specific problem domain, their
support for quality control is relatively weak. Even though some co-relation and
disambiguation mechanisms have been exploited (e.g., in CS AKTive Space and
Flink), quality control has not been fully addressed. For example, the problems of
duplicate or erroneous entities have not been addressed in any of the approaches
mentioned above. Such problems may significantly decrease the quality of the
acquired semantic data.</p>
      <p>Regarding data querying, some portals (e.g. CS AKTive Space,
MuseumFinland, and FLink) provide comprehensive support for information clustering and
filtering. Some (e.g. MindSwap, MuseumFinland) provide ontology-based
searching facilities. While these facilities do provide support for users to access the
underlying information, they typically suffer from the problem of knowledge
overhead, which is requiring users be equipped with extensive knowledge of the
underlying ontologies or the specified query language in order to be able to i)
formulate queries by means of filling out forms with ontology jargons (e.g., in
OntoWeb) or specifying queries (e.g., in CS AKTive Space), ii) to understand
the querying result (e.g., in MindSwap), or iii) to reach the data they are
interested in (e.g., in MuseumFinland, FLink). Users are not able to pose questions
in their own terms.</p>
      <p>Another issue associated with data querying is that the keyword-based
searching support is relatively weak. The benefits of the availability of semantic markups
have not yet been fully exploited. Indeed, with a partial exception of
MuseumFinland (which matches keywords against museum categories), no efforts have been
made to try to understand the meaning of the queries. Furthermore, the
techniques used are primarily focused on the enabling of search for semantic entities
(e.g., MindSwap, OntoWeb). The underlying semantic relations of metadata have
not been used to support the finding of other relevant information. For example,</p>
      <sec id="sec-2-1">
        <title>7 http://www.w3.org/TR/rdf-sparql-query/</title>
        <p>when searching “phd student”, we often get a list of instances of the class phd
student. We may however be more interested in the news stories, the academic
projects, the publications that phd students are involved in, rather than the
instances themselves. The search engines will fail if such information is not
directly defined in the class. This is because the complex relations of the semantic
metadata have not been exploited.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>An overview of the KMi semantic web portal infrastructure</title>
      <p>Central to this layer are two components. One is AquaLog, which takes questions
in natural language format and an ontology as input and presents precise answers
to users. As will be explained in section 5, AquaLog makes use of natural
language processing technologies and the semantic representation of the extracted
knowledge to achieve the task of question answering. The other component is the
semantic search engine, which enhances the performance of traditional keyword
search by augmenting current semantic search techniques with complex semantic
relations extracted from the underlying heterogeneous sources. This component
will be described in section 6.</p>
      <p>
        The presentation layer supports the generation of user interfaces for
semantic web portals. The KMi semantic web portal infrastructure relies on
OntoWeaver-S [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] to support i) the aggregation of data from the semantic
service layer and the semantic data layer and ii) the generation of dynamic web
pages. OntoWeaver-S offers high level support for the design of data aggregation
templates, which describe how to retrieve data and organize data content. As
the generation of user interfaces is out of the scope of this paper, we will not go
through the details of OntoWeaver-S.
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>The extraction of high quality metadata</title>
      <p>To ensure high quality, we identify three generic tasks that are related to
metadata extraction and which should be supported in semantic web portals. They
include i) extracting information in an automatic and adaptive manner, so that
on one hand the process can be easily repeated periodically in order to keep the
knowledge updated and on the other hand different meanings of a given term can
be captured in different context; ii) ensuring that the derived metadata is free
of common errors; and iii) updating the semantic metadata as new information
becomes available.</p>
      <p>In ASDI we provide support for all these quality insurance related tasks.
Figure 2 shows its architecture. ASDI relies on an automatic and adaptive
information extraction tool, which marks-up textual sources, a semantic
transformation engine, which converts data from source representations into the
specified domain ontology according to the transformation instructions specified in
a mapping ontology, and a verification engine, which checks the quality of the
previously generated semantic data entries. These components will be explained
in the following subsections.
4.1</p>
      <sec id="sec-4-1">
        <title>Information extraction</title>
        <p>
          To address the issue of adaptive information extraction, we use ESpotter [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ],
a named entity recognition (NER) system that provides an adaptive service.
ESpotter accepts the URL of a textual document as input and produces a list
of the named entities mentioned in that text. The adaptability is realized by
means of domain ontologies and a repository of lexicon entries. For example, in
the context of the KMi domain, ESpotter is able to mark the term “Magpie” as
a project, while in other domains it marks it as a bird.
        </p>
        <p>
          For the purpose of converting the extracted data to the specified domain
ontology (i.e., the ontology that should be used by the final applications), an
instance mapping ontology (see details in [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]) has been developed, which supports
i) the generation of rich semantic relations along with semantic data entries, and
ii) the specification of domain specific knowledge (i.e. lexicons). The lexicons
are later used by the verification process. A semantic transformation engine is
prototyped, which accepts structured sources and transformation instructions as
input and produces semantic data entries.
        </p>
        <p>To ensure that the acquired data stays up to date, a set of monitoring services
detect and capture changes made in the underlying data sources and initiate the
whole extraction process again. This ensures a sustainable and maintenance-free
operation of the overall architecture.
4.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Information verification</title>
        <p>The goal of the verification engine is to check that each entity has been extracted
correctly by the extraction components.The verification process consists of three
increasingly complex steps as depicted in Figure 3. These steps employ several
semantic web tools and a set of resources to complete their tasks.</p>
      </sec>
      <sec id="sec-4-3">
        <title>Step1: Checking the internal lexicon library. The lexicon library main</title>
        <p>tains domain specific lexicons (e.g., abbreviations) and records the mappings
between strings and instance names. One lexicon mapping example in the KMi
semantic web portal is that the string “ou” corresponds to the instance
the-openuniversity entity that has been defined in one of the domain specific ontologies.
The verification engine will consider any appearances of this abbreviation as
referring to the corresponding entity.</p>
        <p>The lexicon library is initialized by lexicons specified through the mapping
instruction and expands as the verification process goes on. By using the lexicon
library, the verification engine is able to i) exploit domain specific lexicons to
avoid domain specific noisy data and ii) avoid repeating the verification of the
same entity thus making the process more efficient.</p>
      </sec>
      <sec id="sec-4-4">
        <title>Step2: Querying the semantic web data repository. This step uses</title>
        <p>an ontology-based data querying engine, to query the already acquired semantic
web data (which is assumed to be correct, i.e. trusted) and to solve obvious typos
and minor errors in the data. This step contains a disambiguation mechanism,
whose role is to dereference ambiguous entities (e.g., whether the term “star
wars” refers to the Lucas’ movie or President Reagan’s military programme).</p>
        <p>The data querying engine employs a number of string matching algorithms
to deal with obvious typos and minor errors in the data. For example, in a news
story a student called Dnyanesh Rajapathak is mentioned. The student name is
however misspelled as it should be Dnyanesh Rajpathak. While the student name
is successfully marked up and integrated, the misspelling problem is carried into
the portal as well. With support from the data querying engine, this problem
is corrected by the verification engine. It queries the knowledge base for all
entities of type Student and discovers that the difference between the name of
the verified instance (i.e., Dnyanesh Rajapathak ) and that of one of the students
(i.e, Dnyanesh Rajpathak ) is minimal (they only differ by one letter). Therefore,
the engine returns the correct name of the student as a result of the verification.
Note that this mechanism has its downfall when similarly named entities denote
different real life objects.</p>
        <p>If there is a single match, the verification process ends. However, when more
matches exist, contextual information is exploited to address the ambiguities.
The verification engine exploits the semantic relations between other entities
appearing in the same piece of text (e.g. the news story) and the matches as the
contextual information. For example, when verifying the person entity Victoria,
two matches are found: Victoria-Uren and Victoria-Wilson. To decide which
one is the appropriate match, the verification engine looks up other entities
referenced in the same story and checks whether they have any relation with
any of the matches in the knowledge base. In this example, the AKT project is
mentioned in the same story, and the match Victoria-Uren has a relation (i.e.,
has-project-member ) with the project. Hence, the appropriate match is more
likely to be Victoria-Uren than Victoria-Wilson.</p>
      </sec>
      <sec id="sec-4-5">
        <title>Step3: Investigating external resources. If the second step fails, external</title>
        <p>
          resources such as the Web are investigated to identify whether the entity is
erroneous, which should be removed, or correct but new to the system. For
this purpose, an instance classification tool is developed, which makes use of
PANKOW [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] and WordNet [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], to determine the appropriate classification of
the verified entity. Now let us explain the mechanism by using the process of
verifying the entity IBM as an example.
        </p>
        <p>Step 3.1. The PANKOW service is used to classify the string IBM. PANKOW
employs an unsupervised, pattern-based approach on Web data to categorize the
string and produces a set of possible classifications along with ranking values. If
PANKOW cannot get any result, the term is treated as erroneous but still can
be partially correct. Thus, its variants are investigated one by one until
classifications can be drawn. For example, the variants of the term ”BBC news” are
the term ”BBC” and the term ”news”. If PANKOW returns any results, the
classifications with the highest ranking are picked up. In this example, the term
“company” has the highest ranking.</p>
        <p>Step 3.2. Next the algorithm uses WordNet to compare the similarity
between the type of the verified entity as proposed (i.e., “organization”) and an
alternative type for the entity as returned by PANKOW (i.e.,“company”). The
algorithm here only checks whether they are synonyms. If they are (which is
the case of the example), it is concluded that the verified entity is classified
correctly. Thus, a new instance (IBM of type Organization) needs to be created
and added to the repository. Otherwise, other major concepts of the domain
ontology are compared to the Web-endorsed type (i.e.,“company”) in an effort
to find a proper classification for the entity in the domain ontology. If such
classification is found, it is concluded that the verified entity was wrongly classified.
Otherwise, it can be safely concluded that the verified entity is erroneous.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Question answering as data querying</title>
      <p>In the context of semantic web portals, for a querying facility to be really useful,
users have to be able to pose questions in their own terms, without having to
know about the vocabulary or structure of the ontology or having to master a
special query language. Building the AquaLog system has allowed us to address
this problem.</p>
      <p>AquaLog exploits the power of ontologies as a model of knowledge and the
availability of up-to-date semantic markups extracted by ASDI to give precise,
focused answers rather than retrieving possible documents or pre-written
paragraph of text. In particular, these semantic markups facilitate queries where
multiple pieces of information (that may come from different sources) need to
be inferred and combined together. For instance, when we ask a query such as
“what is the homepage of Peter who has an interest on the Semantic Web?”, we
get the precise answer. Behind the scene, AquaLog is not only able to correctly
understand the question but also competent to disambiguate multiple matches
of the term Peter and give back the correct match by consulting the ontology
and the available metadata.</p>
      <p>An important feature of AquaLog is its portability with respect to the
ontologies. Figure 4 shows an overview of the AquaLog architecture. It relies on
a linguistic component, which parses the natural language (NL) queries into a
set of linguistic triples, a relation similarity service (RSS), which makes use of
the underlying ontology and the available metadata to interpret the linguistic
triples as ontological triples, and an answer engine, which infers answers from
the derived ontological triples. To illustrate the question answering mechanism,
we use the question “what are the planet news in KMi related to akt?” as an
example. The process takes the following steps:</p>
      <sec id="sec-5-1">
        <title>Step1: Parsing a NL query into linguistic querying triples. The lin</title>
        <p>
          guistic component uses the GATE [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] infrastructure to annotate the input query
and parses the query into a set of linguistic querying triples. At this stage the
analysis is domain independent. It is completely based upon linguistic
technologies. The example described above is parsed into two linguistic triples (which is,
planet news, KMi) and (which is, related to, akt).
        </p>
      </sec>
      <sec id="sec-5-2">
        <title>Step2: Interpreting linguistic querying triples as ontology-compliant</title>
        <p>triples. In this step, the relation similarity service (RSS) disambiguates and
interprets the linguistic querying triples, by making use of i) string metric
algorithms, ii) lexical resources such as WordNet, and iii) the domain specific
ontology. It produces ontology-compliant triples as results.</p>
        <p>For the first linguistic triple (i.e. (which is, planet news, KMi)) derived from
the example query, the RSS component matches the term planet news to the
concept kmi-planet-news-item in the available metadata repository. It also identifies
the term KMi as the instance Knowledge-Media-Institute-at-the-Open-University
which is actually an instance of the concept research-institute.</p>
        <p>Now, the problem becomes finding a relation which links the class
kmi-planetnews-item to the class research-institute or viceversa. In order to find a relation
all the subclasses of the non-ground generic term, kmi-planet-news-item, need
to be considered. Thanks to ontology inference all the relations of the
superclasses of the subject concept of the relation (e.g. ”kmi-planet-news-item” for
direct relations or ”research institute” for inverse relations) are also its
relations. The relations found include “owned-by”, “has-author”, and
“mentionsorganization”. The interpreted ontology triples thus include
(kmi-planet-newsitem owned-by research-institute), (kmi-planet-news-item has-author
researchinstitute) and (kmi-planet-news-item mentions-organization research-institute).
Likewise, the second linguistic triple (i.e. (which is, related to, akt)) is linked
to the first triple through the non-ground term kmi-planet-news-item and
processed as ontology triples (kmi-planet-news-item, mentions-project, akt) and
(kmi-planet-news-item, has-publications, akt)</p>
        <p>Since the universe of discourse is determined by the particular ontology used,
there will be a number of discrepancies between the NL questions and the set
of relations recognized in the ontology. External resources like WordNet help in
mapping unknown terms by giving a set of synonyms. However, in quite a few
cases, such as in the two triples of our example, such resources are not enough
to disambiguate between possible ontology relations due to the user or ontology
specific jargon. To overcome this problem, AquaLog includes a learning
mechanism, which ensures that, for a given ontology and a specific user community, its
performance improves over time, as the users can easily give feedback and allow
AquaLog to learn novel associations between the relations used by users, which
are expressed in natural language, and the internal structure of the ontology.</p>
        <p>Step3: Inferring answers from ontology-compliant triples. The
answer engine looks through the available metadata to find data entries which
satisfy the derived ontology triples and produces answers accordingly. As shown
in figure 5, the answer of our query example is a list of instances of the class
kmi-planet-news-item that mentions the project akt, in the context of the KMi
semantic web portal, an application of our infrastructure, which will be explained
in section 7. The user can navigate through each news entry extracted by the
ASDI tool. Figure 5 also illustrates the derived linguistic triples and ontology
triples.
6</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Semantic search</title>
      <p>
        Keyword search is often an appealing facility, as it provides a simple way of
querying information. The role of semantic search is to make use of the
underlying ontologies and metadata available in semantic web portals to provide
better performance for keyword searching. In particular, the semantic search
facility developed in our infrastructure extends current semantic search
technologies[
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ] (which are primarily centered around enabling search for semantic
entities) by augmenting the search mechanisms with complex semantic relations
of data resources which have been made available either by manual annotation
or automatic/semi-automatic extraction in the context of semantic web portals.
      </p>
      <p>To illustrate the semantic searching facility, we use the searching of news
stories about phd students as an example. With traditional keyword searching
technologies, we often get news entries in which the string “phd students”
appears. Those entries which mention the names of the involved phd students but
do not use the term “phd students” directly will be missed out. Such news
however are often the ones that we are really interested in. In the context of semantic
web portals where semantics of the domain knowledge are available, the
semantic meaning of the keyword (which is a general concept in the example of phd
students) can be figured out. Furthermore, the underlying semantic relations of
metadata can be exploited to support the retrieving of news entries which are
closely related to the semantic meaning of the keyword. Thus, the search
performance can be significantly improved by expanding the query with instances
and relations.</p>
      <p>The search engine accepts a keyword and a search subject (e.g., news, projects,
publications) as input and gives back search results which are ranked according
to their relations to the keyword. The search subject is often domain dependent
and can be predefined in specific problem domain. In customized portals, the
search subjects can be extracted from user profiles. In the example described
above, the search subject is news stories. The search engine follows four major
steps to achieve its task:</p>
      <p>Step1: Making sense of the keyword. From the semantic meaning point
of view, the keyword may mean i) general concepts (e.g., the keyword “phd
students” which matches the concept phd-student ), ii) instances entities (e.g.,
the keyword “enrico” which matches the instance enrico-motta, or iii) literals of
certain instances (e.g., the keyword “chief scientist” which matches the values
of the instance marc-eisenstadt ). The task of this step is to find out into which
category the keyword falls, by employing string matching mechanisms to match
the keyword against ontology concepts, instances, and literals (i.e., non semantic
entities, e.g., string values) step by step. The output of this step is the entity
matches and the weight of the string similarity.</p>
      <p>Step2: Getting related instances. When the keyword falls into the first
category described above, the related instances are the instances of the matched
concepts. In the second category case, the related instances are the matched
instances. In the final category case, the related instances are the instances
which have value as the matched literals. For example, when querying for the
keyword “chief scientist”, the matched instance is marc-eisenstadt whose value
of the property job-title matches the keyword.</p>
      <p>Step3: Getting search results. This step is to find those instances of
the specified subject (e.g. news stories) which have explicit or (close) implicit
relations with the matched instances. By explicit relations we mean the instances
that are directly associated together in explicitly specified triples. For example,
for instances i1 and i2, their explicit relations are specified in triples (i1 p i2) or
(i2 p i1), where the entity p represents relations between them.</p>
      <p>Implicit relations are the ones which can be derived from the explicit triples.
Mediators exist in between, which bridge the relations. The more mediators exist
in between, the weaker the derived relation is. For the sake of simplicity, we only
consider the closest implicit relations, in which there is only one mediator in
between. For instances i1 and i2, such implicit relations can be be represented
as the following triples : i) (i1 p1 m), (m p2 i2); ii) (i2 p2 m), (m p1 i1); iii) (i1 p1
m), (i2 p2 m); and iv) (m p1 i1), (m p2 i2). In these relations, the semantic entity
m acts as the mediated instance; the predicates p1 and p2 act as the mediated
relations.</p>
      <p>Step4: Ranking results. In this step, the search results are ranked
according to their closeness to the keyword. We take into account three factors,
including string similarity, domain context, and semantic relation closeness. The
domain context weight applies to non-exact matches, which helps deciding the
closeness of the instance matches to the keyword from the specific domain point
of view. For example, with the keyword “enric”, is the user more likely to mean
the person enrico-motta than the project enrich? The domain context weight of
a matched instance mx is calculated as PnP mx , where P mx denotes the count
i=1 P mi
of the matched instance mx serving as value of other instances in the metadata
repository; and m1, m2, ..., and mn represent all of the non-extact matches.</p>
      <p>The semantic relation closeness describes how close the semantic relations
are between a search result and the matched instances. The way of calculating it
is to count all the relations a search result has with all of the matched instances.
For this purpose, we give the explicit relations the weight 1.0, and the derived
ones 0.5.</p>
      <p>For the sake of experiments, we give each of the three factors described
above (namely string similiary, domain context and semantic relation closeness)
the same confidence. The confidence however can be easily changed to reflect
the importance of certain factors. We applied the semantic search facility in
the application of the KMi semantic web portal, which will be presented in the
following section.
7</p>
    </sec>
    <sec id="sec-7">
      <title>The KMi semantic web portal: an application</title>
      <p>We have designed and tested our infrastructure in the context of building a
semantic Web portal for our lab, the Knowledge Media Institute (KMi) at the
UK’s Open University, which provides integrated access to various aspects of the
academic life of our lab8.
7.1</p>
      <sec id="sec-7-1">
        <title>Metadata extraction</title>
        <p>The KMi semantic web portal has been built and running for several months
generating and maintaining semantic data from the underlying sources in an
automated way. The relevant data is spread in several different data sources such
as departmental databases, knowledge bases and HTML pages. In particular,
KMi has an electronic newsletter9, which now contains an archive of several
hundreds of news items, describing events of significance to the KMi members.
Furthermore, related events are continuously reported in the newsletter and
added to the news archive.</p>
        <p>An experimental evaluation has been carried out to assess the performance
of the metadata extraction in the context of the KMi semantic web portal, by
comparing the automatically extracted data to the manual annotations in terms
of recall, precision and f-measure, where recall is the proportion of all possible
correct annotations that were found by the system with respect to the ones that
can be in principle extracted from the source text, precision is the proportion of
the extracted annotations that were found to be correct, and f-measure assesses
the overall performance by treating recall and precision equally. The task is to
recognize entities (including people, projects and organizations) mentioned in the
randomly chosen news stories.</p>
        <p>To illustrate the important role of the verification engine, the performance of
ESpotter (which is the information extraction tool used in ASDI) is introduced in
the comparison, which shows the quality of the extracted data before and after</p>
        <sec id="sec-7-1-1">
          <title>8 http://semanticweb.kmi.open.ac.uk 9 http://kmi.open.ac.uk/news</title>
          <p>People Organizations Projects Total
the verification process. Table 1 shows the evaluation results. Although the
recall rate is slightly lower than Espotter (this is because PANKOW sometimes
gives back empty classifications thus resulting in losing some correct values),
the precision rate has been improved. The f-measure values show that ASDI
performs better than ESpotter in terms of the overall performance. This means
that the quality of the extracted data is improved by our verification engine.
To assess the performance of the question answering facility, we carried out an
initial study in the context of the KMi semantic web portal. We collected 69
different questions. Among them, 40 have been handled correctly. 19 more can
be handled correctly if re-formatted by end user. This was a pretty good result,
considering that no linguistic restrictions were imposed on the questions (please
note that we have asked users not to ask questions which required temporal
reasoning, as the underlying ontology does not cover it).</p>
          <p>Among the failures, linguistic handling accounts for 16 out of the 20
failures, (23.18% of the queries). This is because the natural language processing
component is unable to classify the query and generate appropriate intermediate
representations (e.g., the question “what are the KMi publications in the area of
the semantic web”). Such questions however can usually be easily reformulated
by the end users and thus to be answered correctly. For example, changing the
question mentioned earlier as “what are the KMi publications in the semantic
web area”, avoiding the use of nominal compounds (e.g, terms that are a
combination of two classes as “akt researchers”). Another important observation is
that some failures (10 out of 69) are caused by the lack of appropriate services
over the ontology, e.g. the queries about “top researchers”. Finally, the
limitation of the coverage of the available metadata also plays an important role in the
failures. For example, when dealing with the question “who funds the Magpie
project”, as there is no such information available in the metadata repository,
the answer engine fails.
7.3</p>
        </sec>
      </sec>
      <sec id="sec-7-2">
        <title>Semantic search</title>
        <p>Although the semantic search engine is still in its infancy as the ranking
algorithm has not yet been fully investigated, it produces encouraging results. Figure
6 shows the results of the search example which has been described in section
6. Behind the scene, the news stories are annotated by the metadata
extraction component ASDI and thus being associated with semantic mark-ups. The
metadata of phd students are extracted from the departmental databases.</p>
        <p>As shown in the figure, the news entry which mentions most phd students
appears in the top, as it gets the biggest relation weight. The news entries that
mention other semantic entities (e.g., the news story ranked as the second
mentions the project AKT ) with which many phd students have relations also get
good rankings. This is because in the experiment the ranking algorithm gives
implicit relations half the weight of the explicit ones. This needs further
investigation. The semantic search mechanism is however profound, which makes use
of the available semantic relations between different resources to bring forward
most relevant information.
The core observation that underlies this paper is that, in the case of semantic
web portals that offer both end users and applications an integrated access to
information in specific user communities, it is crucial to ensure that i) high
quality metadata are acquired and combined from several data sources, and ii)
simple but effective querying facilities are provided thus enabling knowledge to be
accessed as easily as possible. By quality here we mean that the semantic data
contains no duplicates, no errors and that the semantic descriptions correctly
reflect the nature of the described entities. By simple but effective querying we
mean that users are able to pose questions in their own terms and get precise
results.</p>
        <p>Our survey of a set of semantic web portals shows that on the one hand
little or no attention is paid to ensure the quality of the extracted data, and on
the other hand the support for data querying is limited. In contrast with these
efforts, our semantic web portal infrastructure focuses on ensuring the quality
of the extracted metadata and the facilities for data querying.</p>
        <p>Our evaluation of the quality verification module shows that it improved the
performance of the bare extraction layer. The metadata extraction component
ASDI outperforms ESpotter by achieving 91% precision and 77% recall. In the
context of semantic web portals, precision is more important than recall -
erroneous results annoy user more than missing information. We plan to improve the
recall rate by introducing additional information extraction engines to work in
parallel with ESpotter. Such a redundancy is expected to substantially improve
recall.</p>
        <p>Our initial study of the question answering tool AquaLog shows that it
provides reasonably good performance when allowing users to ask questions in their
own terms. We are currently working on the failures learned from the study. We
plan to use more than one ontology so that limitations in one ontology can be
overcome by other ones.</p>
        <p>The semantic search facility developed within the framework of our
infrastructure takes advantage of semantic representation of information to facilitate
keyword searching. It produces encouraging results. We plan to further
investigate i) a more fine grained ranking algorithm which gives appropriate
consideration for all the factors affecting the search results, and ii) the effect of implicit
relations on search results.</p>
        <p>We are, however, aware of a number of limitations associated with this
semantic web portal infrastructure. For example, the manual specification of mappings
in the process of setting up the metadata extraction component makes the
approach heavy to launch. We are currently addressing this issue by investigating
the use of automatic or semi-automatic mapping algorithms. A semi-automatic
mapping would allow our tool to be portable across several different application
domains.</p>
        <p>Another limitation of the infrastructure is the support for trust management.
As anyone can contribute to semantic data, trust is an important issue, which
needs to be addressed. We plan to study this issue in the near future, by looking
at i) how trust factors can be associated with the metadata of semantic web
portals, ii) how they can be combined together when one piece of markup comes
from different sources, and iii) what roles the processing tools play from the trust
point of view.</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgements</title>
      <p>We wish to thank Dr. Marta Sabou and Dr. Victoria Uren for their valuable
comments on earlier drafts of this paper. This work was funded by the
Advanced Knowledge Technologies Interdisciplinary Research Collaboration (IRC),
the Knowledge Sharing and Reuse across Media (X-Media) project, and the
OpenKnowledge project. AKT is sponsored by the UK Engineering and
Physical Sciences Research Council under grant number GR/N15764/01. X-Media
and OpenKnowledge are sponsored by the European Commission as part of the
Information Society Technologies (IST) programme under EC Grant
IST-FP626978 and IST-FP6-027253.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>T.</given-names>
            <surname>Berners-Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hendler</surname>
          </string-name>
          , and
          <string-name>
            <given-names>O.</given-names>
            <surname>Lassila</surname>
          </string-name>
          .
          <source>The Semantic Web. Scientific American</source>
          ,
          <volume>284</volume>
          (
          <issue>5</issue>
          ):
          <fpage>34</fpage>
          -
          <lpage>43</lpage>
          , May
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>P.</given-names>
            <surname>Cimiano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Handschuh</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Staab</surname>
          </string-name>
          .
          <article-title>Towards the Self-Annotating Web</article-title>
          . In S. Feldman,
          <string-name>
            <given-names>M.</given-names>
            <surname>Uretsky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Najork</surname>
          </string-name>
          , and C. Wills, editors,
          <source>Proceedings of the 13th International World Wide Web Conference</source>
          , pages
          <fpage>462</fpage>
          -
          <lpage>471</lpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>H.</given-names>
            <surname>Cunningham</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Maynard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Bontcheva</surname>
          </string-name>
          , and
          <string-name>
            <given-names>V.</given-names>
            <surname>Tablan</surname>
          </string-name>
          .
          <article-title>Gate: A framework and graphical development environment for robust nlp tools and applications</article-title>
          .
          <source>In Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics (ACL'02)</source>
          , Philadelphia,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>C.</given-names>
            <surname>Fellbaum</surname>
          </string-name>
          . WORDNET:
          <article-title>An Electronic Lexical Database</article-title>
          . MIT Press,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>R.</given-names>
            <surname>Guha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>McCool</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and E.</given-names>
            <surname>Miller</surname>
          </string-name>
          .
          <article-title>Semantic search</article-title>
          .
          <source>In Proceedings of WWW2003</source>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>J.</given-names>
            <surname>Heflin</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Hendler</surname>
          </string-name>
          .
          <article-title>Searching the web with shoe</article-title>
          .
          <source>In Proceedings of the AAAI Workshop on AI for Web Search</source>
          , pages
          <fpage>35</fpage>
          -
          <lpage>40</lpage>
          . AAAI Press,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>E.</given-names>
            <surname>Hyvonen</surname>
          </string-name>
          , E. Makela,
          <string-name>
            <given-names>M.</given-names>
            <surname>Salminen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Valo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Viljanen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Saarela</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Junnila</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Kettula. MuseumFinland - Finnish</surname>
          </string-name>
          <article-title>Museums on the Semantic Web</article-title>
          .
          <source>Journal of Web Semantics</source>
          ,
          <volume>3</volume>
          (
          <issue>2</issue>
          ),
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lei</surname>
          </string-name>
          .
          <article-title>An Instance Mapping Ontology for the Semantic Web</article-title>
          .
          <source>In Proceedings of the Third International Conference on Knowledge Capture</source>
          , Banff, Canada,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lei</surname>
          </string-name>
          , E. Motta, and
          <string-name>
            <given-names>J.</given-names>
            <surname>Domingue</surname>
          </string-name>
          .
          <article-title>Ontoweaver-s: Supporting the design of knowledge portals</article-title>
          .
          <source>In Proceedings of the 14th International Conference on Knowledge Engineering and Knowledge Management (EKAW</source>
          <year>2004</year>
          ). Springer,
          <year>October 2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <given-names>V.</given-names>
            <surname>Lopez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Pasin</surname>
          </string-name>
          , and
          <string-name>
            <surname>E. Motta.</surname>
          </string-name>
          <article-title>AquaLog: An Ontology-portable Question Answering System for the Semantic Web</article-title>
          .
          <source>In Proceedings of ESWC</source>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <given-names>P.</given-names>
            <surname>Mika</surname>
          </string-name>
          . Flink:
          <article-title>Semantic Web Technology for the Extraction and Analysis of Social Networks</article-title>
          .
          <source>Journal of Web Semantics</source>
          ,
          <volume>3</volume>
          (
          <issue>2</issue>
          ),
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>M.C. Schraefel</surname>
            ,
            <given-names>N.R.</given-names>
          </string-name>
          <string-name>
            <surname>Shadbolt</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Gibbins</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Glaser</surname>
            , and
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Harris. CS AKTive</surname>
          </string-name>
          <article-title>Space: Representing Computer Science in the Semantic Web</article-title>
          .
          <source>In Proceedings of the 13th International World Wide Web Conference</source>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>J. Zhu</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Uren</surname>
            , and
            <given-names>E. Motta.</given-names>
          </string-name>
          <article-title>ESpotter: Adaptive Named Entity Recognition for Web Browsing</article-title>
          .
          <source>In Proceedings of the Professional Knowledge Management Conference</source>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>