=Paper=
{{Paper
|id=Vol-101/paper-16
|storemode=property
|title=A Client Side Approach to Building the Semantic Web
|pdfUrl=https://ceur-ws.org/Vol-101/Erik_Larson.pdf
|volume=Vol-101
}}
==A Client Side Approach to Building the Semantic Web==
A Client Side Approach to Building the Semantic Web
Erik Larson
Digital Media Collaboratory, IC2 Institute
The University of Texas at Austin
1301 West 25th Street, Suite 300
Austin, TX 78705 USA
512 474 6312 Ext. 265
elarson@icc.utexas.edu
ABSTRACT society nearby and to schedule a visit in the next few days,
In this paper, I describe an alternative approach to building the agent will know that bird watching is a type of outdoor
a semantic web that addresses some known challenges to activity and therefore that the weather is a relevant factor,
existing attempts. In particular, powerful information checking the online local forecast (knowing that “nearby”
extraction techniques are used to identify concepts of means “local”) for signs of thunderstorms, rain, or other
interest in Web pages. Identified concepts are then used to kinds or weather incompatible with outdoor activities. The
semi-automatically construct assertions in a computer- agent can then inform the person of the location of a bird
readable markup, reducing manual annotation watching society, visiting hours, and good days during the
requirements. It is also envisioned that these semantic week to go. The evolution of the Web into the Semantic
assertions will be constructed specifically by communities Web, in other words, creates more opportunities for
of users with common interests. The structured knowledge exploiting the rich content on the Web to create value and
bases created will then contain content that reflects the uses provide services to everyone.
they were designed for, thereby facilitating effective As laudable as this vision is, there are a number of
automated reasoning and inference for real-world problems with its practical implementation. First, much of
problems. the focus in the current Semantic Web activity is in
Keywords transforming the HTML content sitting on Web servers to
Semantic web, information extraction, ontology include semantic information in a machine-readable
markup language, such as the Resource Description Format
INTRODUCTION
(RDF), the DARPA Agent Markup Language with
The World Wide Web is a vast repository of information Ontology Inference Layer (DAML+OIL), or the updated
annotated in a human-readable format. Unfortunately, version of DAML, the Ontology Web Language (OWL)
annotation that is understood by humans is typically poorly [RDF, 2003], [DAML, 2003], [OWL, 2003]. This
understood by machines. Because the Web was designed transformation requires, in effect, a re-writing of the
for human and not machine understanding, facilitating the billions of pages of content comprising the current World
development of enhancements to the Web such as better Wide Web—no small feat, particularly since the Semantic
search and retrieval, question and answering (Q&A), and Web languages are much less user friendly than simple
automated services through intelligent Web agents is HTML. True, the Semantic Web markup languages were
difficult and in many cases not yet practically feasible. Not designed for greater ease of use than traditional knowledge
surprisingly, work on a “next-generation” Web more representation (KR) languages based on first-order logic
friendly to machines is already underway, most visibly in (they are also not as expressive, see [Stevens, R., 2003]),
the Semantic Web activity championed by Tim Berners- but for non-experts unversed in logic systems, annotating
Lee, inventor of the current Web and director of the W3C Web pages with RDF, DAML or OWL represents a whole
(http://www.w3c.org), the Web standards and development new layer of effort, particularly in relation to the
committee. WYSIWYG software for HTML annotation that is now a
The vision of the Semantic Web activity is an “evolution” commonplace.
of the existing Web into one that contains machine- Second, developers cannot effectively markup Web
readable markup [Berners-Lee, 2001]. As seen by people, documents with semantic content unless they understand
the Semantic Web remains indistinguishable from the clearly the context—what is the purpose of adding the new
current one. Yet machines using the Semantic Web can information? What function is it serving? What questions
read Web pages that contain semantic information encoded will it answer, or services will it provide, that represent a
in a logic-based markup describing their content. This clear benefit in some well-defined context? Without this
increased power that semantic markup gives to machines context, the average Web page developer won’t likely see a
also benefits humans: if someone instructs their agent (i.e., clear point to creating logical markup. Such an effort
their intelligent agent software) to find a bird watching
would represent, in other words, a purely technical identification of binary relations [Lockheed Martin
exercise. Management and Data Systems, 2001-2003].
Yet a semantic web that facilitates better machine There are a number of challenges to using information
reasoning is indeed desirable and (it is hoped) practically extractions systems for the CSSW. First, no matter how
feasible as well. The position taken here is that the “server- effective an IE system, one cannot yet expect 100%
side” transformation of Web content in the current accuracy and recall on arbitrary source documents. This
Semantic Web activity is, while perhaps helpful in certain means that false negatives and positives are unavoidable (at
cases, nonetheless not a panacea and may even be a least in unconstrained domains). A CSSW system must
hindrance to the task of enhancing the capabilities of the have functionality in the user interface (UI) to permit
Web for many users. An alternative, “client-side” selection and editing of extracted results by a human user.
approach that enables users to effectively transform the Second, there is a problem of specificity: IE systems
existing (HTML) content of the Web into more usable, suitable for handling arbitrary source content (such as, for
structured representations that facilitate reasoning within a instance, different Web pages) will not easily support
context of interest will be presented. pattern matching for numerous specific concepts. The base
THE CLIENT SIDE VISION OF THE SEMANTIC WEB functionality of AeroText, for instance, identifies
In a “client-side” semantic web approach, the HTML distinctions between ‘organizations’ and ‘people’, but not
content of the Web is used “as is.” Instead of adding (in the general case) between types of organizations such as
additional markup, a suite of tools and applications are the Red Cross (non-profit organization), the University of
envisioned that extract concepts from Web pages for Texas at Austin (higher education institution), Dell
uploading into a structured knowledge base (KB). The KB Computer (corporation), and the Smithsonian Institute (art
is then used for advanced querying, inference, and problem and science institution). A user working on a research
solving. project on types of organizations in the United States
The client-side approach has a significant advantage over would get all these types of organizations extracted merely
the standard server-side semantic web (hereafter SSSW) as “Organization”—hardly helpful in this context.
because it reduces the content development bottleneck. The Lastly, there is a problem of identifying proper relations:
client-side semantic web (hereafter CSSW) enables the while IE systems excel at identifying patterns for particular
semi-automatic construction of a “virtual” web on the things (e.g., proper nouns), they are less effective with
user’s machine (or, in a multi-user environment, on a relations between things (e.g., binary relations). The
server that is available to a number of users) that retains reason, to put it bluntly, is that natural language
hypertext links back to the original Web content but adds a understanding by machines is in too rudimentary a state to
set of logical assertions that captures the meanings germane handle the grammatical variations in free text occurrences
to the user or users’ interests. It therefore helps solve both of relations. Extraction rules that do match multi-word
problems with the SSSW approach explained above: patterns and can consistently resolve the semantics of
manual annotation effort is reduced by semi-automatic relations embedded in natural language assertions are either
extraction techniques, and because the KB is constructed domain specific or difficult to construct, or both. (For
with a particular interest in mind, there is a clear context example, extracting the relation “managed” in “John Doe
for the creation of logical assertions (i.e., the user is managed numerous food chains in California before
creating a KB for a particular purpose that, ex hypothesi, is becoming vice president of operations” as an instance of
known to the user in advance). the predicate “managerOf” in an ontology would require
distinguishing between this semantics of ‘managed’ and the
KNOWLEDGE ACQUISITION
following: “Mary managed the sale of half of her stocks
Because Web content is left as HTML, the CSSW
before the market took a downturn.”)
approach must solve a knowledge acquisition problem:
how does one transform semi-structured content into A solution to the first and second problems above (and to
structured representations? The short, technical answer to some extent the third) is to customize an information
this question is: with an information extraction (IE) extraction system’s rule base to perform well on documents
system. There are in fact a number of both commercial and containing certain targeted content such as the specific
open-source IE systems available that can extract concepts concepts of interest in those documents. This type of
and even simple relations from text sources, outputting solution, however, would not appear to be a complete
them into XML or other structured languages (e.g., RDF, answer to engineering a CSSW approach, since the original
DAML). Lockheed Martin’s AeroText™ IE system, for impetus of such an approach was to reduce technical time
instance, can extract key phrases and elements from text and effort, but unfortunately customization of IE rule bases
documents, as well as perform sophisticated analysis of is, like manual annotation for the semantic web, a non-
document structure (identifying tables, lists, and other trivial technical effort.
elements) in addition to complex event extraction and some However the position advanced here is that the knowledge
acquisition challenges specific to the CSSW approach are
nonetheless solvable, either completely or in large degree. Now consider the CSSW approach. In this case, we begin
It is more difficult to draw the same conclusion of the with the assumption that a user has a particular interest in
SSSW. In other words, both approaches have bottlenecks, creating structured content. For instance, a user may want
but the CSSW approach structures the task in such a way to construct a KB containing assertions about artificial
that workable remedies seem possible. This suggests that intelligence (AI) research labs in academia and industry,
the CSSW approach holds promise for more dynamic and perform research on whether there are new markets
progress in the mid, long, and even short term. emerging for AI-based techniques. The user can then a)
specify the concepts of interest (e.g., research lab,
university, corporation, AI techniques, products using AI
THE IMPORTANCE OF CONTEXT techniques), b) extract these concepts and upload them into
A key difference between the two semantic web a KB, c) write inference rules that specifically conclude
approaches is in considerations of context. On the one more information of interest from existing information in
hand, the SSSW requires developers to describe the content the KB, such as:
of their Web pages in logic, so that the content is
understandable (processable) by other software agents with
a large range of different goals when visiting Web sites. ((If ResearchLab hasResearchArea
The problem here is that the developer can’t be sure what InformationExtraction and ResearchLab hasDirector
type of information will be most helpful, and so can’t make JohnDoe) then JohnDoe is a ContactInArea-AI),
effective decisions on what to encode. For instance,
someone might host a travel site with content on different
cities, places, transportation options, fares, special offers, and finally d) use the KB to ask and answer questions
monuments and places of interest. Well, what should they within the context of the research, having now a persistent
represent logically? Of course it depends on what types of knowledge source that is focused on a particular domain of
queries and inferences they can expect. It will probably interest.
make sense to provide a taxonomy of types: Creating structured content in a context of inquiry also
helps reduce information extraction customization
requirements. For instance, in a particular context there
Car is a type of Vehicle. will typically be a relatively small set of high-value
Airplane is a type of Vehicle. concepts that constitute the main conceptual “framework”
Taxi is a type of Car. of the domain of interest. In the “new market
identification” context described above, one might choose,
Boeing737 is a type of Airplane. say, the concepts “person”, “organization”, and “project.”
An information extraction rule base identifying instances of
these generic concepts will require less development time
But it is less clear what types of inference rules to spend
and effort than a corresponding rule base that attempts to
time supporting: does one anticipate agents and queries
match patterns for all subclasses of the generic classes
that want to check:
(e.g., subclasses research lab, institution of higher
education, and C-corporation for superclass
((If Place is a Destination and ‘Organization’). When the user has an interest in
classifying, say, the AI Lab at the University of Texas at
Customer arrivesAt Destination on Day and
Austin as an instance of ResearchLab in the ontology—not
WeatherForecast for Day is Severe) then just an instance of Organization—this functionality can be
Suggest Cancellation or a NewDay)? handled in the application UI, by providing a means for the
user to view, navigate, and modify the ontology and the
contents of the KB. The minimal set of ‘focused’ terms—
Not unreasonable, to be sure. But now creating vocabulary person, organization, project—provide the pattern
for “WeatherForecast” as well as attributes like “Severe” matching parameters to the IE system, while any finer-
will be pointless if an agent visiting the site doesn’t use grained classification is handled by the user in the UI.
such a rule. Given that there might be tens, hundreds, or
An alternative approach to “offloading” development effort
even thousands of software agents reading travel sites for
from IE rule base customization for each specific term of
various reasons (to continue this example), and it is quite
interest to UI based KB classification efforts, is to utilize
likely that there won’t be perfect matches between
machine learning (ML) techniques to semi-automatically
inference rules and logical concepts and assertions on
construct extraction rules for concepts (entities). This
source pages—in which case, nothing will be gained by
approach presents a number of exciting possibilities, most
writing the concepts and assertions—it is hard to make a
notably the possibility of training an IE system to identify
case for doing the knowledge representation at all.
concepts of interest as a user “surfs” the Web. However,
because ML approaches typically require many training that serve particular purposes co-exist with standard
examples before accuracy can be achieved (and again, presentational markup, and the choice of whether to
100% accuracy in unconstrained domains is not likely), enhance the Web is made by particular users within a
such an approach is not a panacea. context of interest.
For a large KB, training IE rules to find instances for each IMPLEMENTATION
particular class in an ontology is likely still to be time and A proof of concept for the CSSW approach is currently
effort intensive. However, the approach favored here is to under development at Digital Media Collaboratory (DMC),
investigate the use of ML techniques for improving the IC2 Institute, the University of Texas at Austin
identification of instances of a smaller set of focused terms (http://dmc.ic2.org). The Focused Knowledge Base (FKB)
such as explained above, that capture the context of a project implements a client server architecture that allows
particular research project. This application of machine multiple users to login to the system, perform research on
learning seems highly promising. For instance, ML the Web, and save facts and knowledge from the Web into
techniques could be used to customize an IE rule base to a KB. The FKB system uses the AeroText™ information
identify research labs as instances of Organization. Users extraction engine to tag ‘focused’ terms, where they are
wishing to re-classify research labs as instances of the presented on a separate “knowledge page” in the UI
subclass ResearchLab in the ontology could then perform together with a list of relations (taken from the ontology)
re-classification by simple specialization of the term in the that can be easily connected to subject and object terms to
KB. form a “triple” subject-verb-object assertion in the
SCOPE OF THE CLIENT SIDE APPROACH DAML+OIL language.
The approach outlined above specifically addresses Assertions, together with contextual information (e.g.,
limitations apparent in the SSSW approach. By using IE login ID, project name, date, time, area of knowledge) are
techniques to semi-automatically extract relevant concepts, uploaded into an ontology server. The KAON Ontology
and by focusing on a particular research context when server is used to store knowledge in DAML+OIL format
undertaking more complicated annotation strategies (e.g., [KAON, 2003]. Users can thus browse the Web to identify
making assertions for automated inference), a usable KB pages relevant to a research project, enhance the page using
can be constructed that facilitates more advanced Q&A and AeroText™, add important information not supplied by the
reasoning in a particular domain. IE system (binary relations are presented in drop-down
However there are a number of considerations that should boxes based on the concepts in the subject and object
be addressed here. One, because there is still a significant locations), and easily update the KB with the new facts.
amount of work required to transform free text or HTML (Domain specific facts that are uploaded into the KB are
markup into a structured, usable KB (some IE rule base subsumed by a top-level (“upper”) ontology layer provided
customization will be required, as well as manual effort in by the Standard Upper Merged Ontology (SUMO)
making relational assertions and classifying concepts in the [SUMO, 2001].)
KB), the CSSW approach will not be suitable for non- In addition to this functionality, DMC is investigating two
persistent “quick” projects that can be answered by advanced enhancements to the system. One is the use of an
performing a few keyword searches on the Web. Such embedded theorem prover. Although DAML+OIL
projects are still best handled by existing technologies, supports standard set-theoretic operations, it provides no
such as the Google™ search engine. facility for constructing rules in the form of logical
Construction of a KB makes the most sense when projects implications. Such implications, together with a suitable
are complex, require the combining of many different types theorem prover such as the JTP theorem prover of Stanford
of information, and are relatively long-term and require Knowledge Systems Laboratory
persistent repositories. In other words, research that spans (http://www.ksl.stanford.edu), make possible the automatic
multiple days, weeks, or even months and that can’t easily addition of new knowledge (consequences) in the KB from
be handled via conventional browser techniques (saving existing knowledge [JTP, 2003]. Rule bases that are
links into “Favorites” in Internet Explorer) without losing focused to add desired information that may be implicit but
track of the knowledge added and the knowledge still not noticed in the KB can add significant value. Two,
needed, is suited for a more structured approach such as DMC is investigating machine learning approaches to
that outlined here. Also, the assumption is that the time speed construction of IE rule bases suitable for matching
involved creating a KB to facilitate reasoning about a instances of focused terms. In particular, relational
particular problem will be offset by the amount of inductive algorithms for learning information extraction
sustained use of the KB a user can expect. Ideally, the KB rules such as those designed by Ray Mooney at the
becomes a semi-permanent repository for a user (or users) University of Texas at Austin
that can be referenced, modified, and added to as needed. (http://www.cs.utexas.edu/users/ml/) show promise
especially for Web-based source data [Mooney, 1999].
Hence, the vision that emerges in the CSSW is a “hybrid”
notion of the next generation web, where structured KB’s
CONCLUSION http://mds.external.lmco.com/mds/products/gims/aero/
The CSSW is an intriguing alternative to the SSSW vision docs/AeroText-V2.5-Whitepaper-April-2003.pdf
and ameliorates a number of recognized problems. The
6. [Mooney, 1999] Mooney, R., Califf, M.
high performance of information extraction systems such
“Relational Learning of Pattern-Match Rules for
as AeroText™ coupled with a clearly defined context for
Information Extraction” In Proceedings of the
Web-based research make the construction of a client-side
Sixteenth National Conference on Artificial
“virtual” Web with structured repositories of knowledge
Intelligence (AAAI-99), Orlando, FL, pp. 328-
servicing users and communities of users not just a viable,
334, July 1999.
but an intriguing, option. Further research will include the
use of KIF-like rules with DAML+OIL (or OWL) and an 7. [OWL, 2003]
embedded theorem prover to generated additional http://www.w3.org/2001/sw/WebOnt/
knowledge from existing knowledge. Also, machine 8. [RDF, 2003] http://www.w3.org/RDF/
learning techniques that work well with Web-based
information and can help speed the customization of IE 9. [SUMO, 2001] Niles, I., and Pease, A. 2001.
systems are an active area of research that promise to make Towards a Standard Upper Ontology. In
the CSSW approach even more appealing and feasible as Proceedings of the 2nd International Conference
the “next-generation” Web takes shape. on Formal Ontology in Information Systems
(FOIS-2001), Chris Welty and Barry Smith, eds,
ACKNOWLEDGMENTS
Ogunquit, Maine, October 17-19, 2001.
I thank Melinda Jackson and the rest of the DMC staff for
providing helpful comments on previous versions of this 10. [Stevens, R. 2003] Stevens, R., Wroe, C.,
paper. Bechhofer, S., Lord, P., Rector, A., Goble, C.
“Building ontologies in DAML+OIL” In
REFERENCES
Comparative and Functional Genomics
Volume: 4, Issue: 1, Date: January/February 2003,
1. [Berners-Lee, 2001] Berners-Lee, T., Hendler, J., Pages: 133-141
Lasilla, O. “The Semantic Web.” In Scientific
American, May 2001.
2. [DAML, 2003] http://www.daml.org
3. [JTP, 2003] http://www.ksl.stanford.edu/software/JTP/
4. [KAON 2003] http://kaon.semanticweb.org
5. [Lockheed Martin Management and Data Systems,
2001-2003]