A Client Side Approach to Building the Semantic Web
                  Erik Larson
    Digital Media Collaboratory, IC2 Institute
       The University of Texas at Austin
        1301 West 25th Street, Suite 300
             Austin, TX 78705 USA
             512 474 6312 Ext. 265
             elarson@icc.utexas.edu
ABSTRACT                                                         society nearby and to schedule a visit in the next few days,
In this paper, I describe an alternative approach to building    the agent will know that bird watching is a type of outdoor
a semantic web that addresses some known challenges to           activity and therefore that the weather is a relevant factor,
existing attempts. In particular, powerful information           checking the online local forecast (knowing that “nearby”
extraction techniques are used to identify concepts of           means “local”) for signs of thunderstorms, rain, or other
interest in Web pages. Identified concepts are then used to      kinds or weather incompatible with outdoor activities. The
semi-automatically construct assertions in a computer-           agent can then inform the person of the location of a bird
readable     markup,      reducing      manual     annotation    watching society, visiting hours, and good days during the
requirements. It is also envisioned that these semantic          week to go. The evolution of the Web into the Semantic
assertions will be constructed specifically by communities       Web, in other words, creates more opportunities for
of users with common interests. The structured knowledge         exploiting the rich content on the Web to create value and
bases created will then contain content that reflects the uses   provide services to everyone.
they were designed for, thereby facilitating effective           As laudable as this vision is, there are a number of
automated reasoning and inference for real-world                 problems with its practical implementation. First, much of
problems.                                                        the focus in the current Semantic Web activity is in
Keywords                                                         transforming the HTML content sitting on Web servers to
Semantic web, information extraction, ontology                   include semantic information in a machine-readable
                                                                 markup language, such as the Resource Description Format
INTRODUCTION
                                                                 (RDF), the DARPA Agent Markup Language with
The World Wide Web is a vast repository of information           Ontology Inference Layer (DAML+OIL), or the updated
annotated in a human-readable format. Unfortunately,             version of DAML, the Ontology Web Language (OWL)
annotation that is understood by humans is typically poorly      [RDF, 2003], [DAML, 2003], [OWL, 2003].                This
understood by machines. Because the Web was designed             transformation requires, in effect, a re-writing of the
for human and not machine understanding, facilitating the        billions of pages of content comprising the current World
development of enhancements to the Web such as better            Wide Web—no small feat, particularly since the Semantic
search and retrieval, question and answering (Q&A), and          Web languages are much less user friendly than simple
automated services through intelligent Web agents is             HTML. True, the Semantic Web markup languages were
difficult and in many cases not yet practically feasible. Not    designed for greater ease of use than traditional knowledge
surprisingly, work on a “next-generation” Web more               representation (KR) languages based on first-order logic
friendly to machines is already underway, most visibly in        (they are also not as expressive, see [Stevens, R., 2003]),
the Semantic Web activity championed by Tim Berners-             but for non-experts unversed in logic systems, annotating
Lee, inventor of the current Web and director of the W3C         Web pages with RDF, DAML or OWL represents a whole
(http://www.w3c.org), the Web standards and development          new layer of effort, particularly in relation to the
committee.                                                       WYSIWYG software for HTML annotation that is now a
The vision of the Semantic Web activity is an “evolution”        commonplace.
of the existing Web into one that contains machine-              Second, developers cannot effectively markup Web
readable markup [Berners-Lee, 2001]. As seen by people,          documents with semantic content unless they understand
the Semantic Web remains indistinguishable from the              clearly the context—what is the purpose of adding the new
current one. Yet machines using the Semantic Web can             information? What function is it serving? What questions
read Web pages that contain semantic information encoded         will it answer, or services will it provide, that represent a
in a logic-based markup describing their content. This           clear benefit in some well-defined context? Without this
increased power that semantic markup gives to machines           context, the average Web page developer won’t likely see a
also benefits humans: if someone instructs their agent (i.e.,    clear point to creating logical markup. Such an effort
their intelligent agent software) to find a bird watching
would represent, in other words, a purely technical              identification of binary relations [Lockheed Martin
exercise.                                                        Management and Data Systems, 2001-2003].
Yet a semantic web that facilitates better machine               There are a number of challenges to using information
reasoning is indeed desirable and (it is hoped) practically      extractions systems for the CSSW. First, no matter how
feasible as well. The position taken here is that the “server-   effective an IE system, one cannot yet expect 100%
side” transformation of Web content in the current               accuracy and recall on arbitrary source documents. This
Semantic Web activity is, while perhaps helpful in certain       means that false negatives and positives are unavoidable (at
cases, nonetheless not a panacea and may even be a               least in unconstrained domains). A CSSW system must
hindrance to the task of enhancing the capabilities of the       have functionality in the user interface (UI) to permit
Web for many users.            An alternative, “client-side”     selection and editing of extracted results by a human user.
approach that enables users to effectively transform the         Second, there is a problem of specificity: IE systems
existing (HTML) content of the Web into more usable,             suitable for handling arbitrary source content (such as, for
structured representations that facilitate reasoning within a    instance, different Web pages) will not easily support
context of interest will be presented.                           pattern matching for numerous specific concepts. The base
THE CLIENT SIDE VISION OF THE SEMANTIC WEB                       functionality of AeroText, for instance, identifies
In a “client-side” semantic web approach, the HTML               distinctions between ‘organizations’ and ‘people’, but not
content of the Web is used “as is.” Instead of adding            (in the general case) between types of organizations such as
additional markup, a suite of tools and applications are         the Red Cross (non-profit organization), the University of
envisioned that extract concepts from Web pages for              Texas at Austin (higher education institution), Dell
uploading into a structured knowledge base (KB). The KB          Computer (corporation), and the Smithsonian Institute (art
is then used for advanced querying, inference, and problem       and science institution). A user working on a research
solving.                                                         project on types of organizations in the United States
The client-side approach has a significant advantage over        would get all these types of organizations extracted merely
the standard server-side semantic web (hereafter SSSW)           as “Organization”—hardly helpful in this context.
because it reduces the content development bottleneck. The       Lastly, there is a problem of identifying proper relations:
client-side semantic web (hereafter CSSW) enables the            while IE systems excel at identifying patterns for particular
semi-automatic construction of a “virtual” web on the            things (e.g., proper nouns), they are less effective with
user’s machine (or, in a multi-user environment, on a            relations between things (e.g., binary relations). The
server that is available to a number of users) that retains      reason, to put it bluntly, is that natural language
hypertext links back to the original Web content but adds a      understanding by machines is in too rudimentary a state to
set of logical assertions that captures the meanings germane     handle the grammatical variations in free text occurrences
to the user or users’ interests. It therefore helps solve both   of relations. Extraction rules that do match multi-word
problems with the SSSW approach explained above:                 patterns and can consistently resolve the semantics of
manual annotation effort is reduced by semi-automatic            relations embedded in natural language assertions are either
extraction techniques, and because the KB is constructed         domain specific or difficult to construct, or both. (For
with a particular interest in mind, there is a clear context     example, extracting the relation “managed” in “John Doe
for the creation of logical assertions (i.e., the user is        managed numerous food chains in California before
creating a KB for a particular purpose that, ex hypothesi, is    becoming vice president of operations” as an instance of
known to the user in advance).                                   the predicate “managerOf” in an ontology would require
                                                                 distinguishing between this semantics of ‘managed’ and the
KNOWLEDGE ACQUISITION
                                                                 following: “Mary managed the sale of half of her stocks
Because Web content is left as HTML, the CSSW
                                                                 before the market took a downturn.”)
approach must solve a knowledge acquisition problem:
how does one transform semi-structured content into              A solution to the first and second problems above (and to
structured representations? The short, technical answer to       some extent the third) is to customize an information
this question is: with an information extraction (IE)            extraction system’s rule base to perform well on documents
system. There are in fact a number of both commercial and        containing certain targeted content such as the specific
open-source IE systems available that can extract concepts       concepts of interest in those documents. This type of
and even simple relations from text sources, outputting          solution, however, would not appear to be a complete
them into XML or other structured languages (e.g., RDF,          answer to engineering a CSSW approach, since the original
DAML). Lockheed Martin’s AeroText™ IE system, for                impetus of such an approach was to reduce technical time
instance, can extract key phrases and elements from text         and effort, but unfortunately customization of IE rule bases
documents, as well as perform sophisticated analysis of          is, like manual annotation for the semantic web, a non-
document structure (identifying tables, lists, and other         trivial technical effort.
elements) in addition to complex event extraction and some       However the position advanced here is that the knowledge
                                                                 acquisition challenges specific to the CSSW approach are
nonetheless solvable, either completely or in large degree.      Now consider the CSSW approach. In this case, we begin
It is more difficult to draw the same conclusion of the          with the assumption that a user has a particular interest in
SSSW. In other words, both approaches have bottlenecks,          creating structured content. For instance, a user may want
but the CSSW approach structures the task in such a way          to construct a KB containing assertions about artificial
that workable remedies seem possible. This suggests that         intelligence (AI) research labs in academia and industry,
the CSSW approach holds promise for more dynamic                 and perform research on whether there are new markets
progress in the mid, long, and even short term.                  emerging for AI-based techniques. The user can then a)
                                                                 specify the concepts of interest (e.g., research lab,
                                                                 university, corporation, AI techniques, products using AI
THE IMPORTANCE OF CONTEXT                                        techniques), b) extract these concepts and upload them into
A key difference between the two semantic web                    a KB, c) write inference rules that specifically conclude
approaches is in considerations of context. On the one           more information of interest from existing information in
hand, the SSSW requires developers to describe the content       the KB, such as:
of their Web pages in logic, so that the content is
understandable (processable) by other software agents with
a large range of different goals when visiting Web sites.                ((If        ResearchLab        hasResearchArea
The problem here is that the developer can’t be sure what        InformationExtraction and ResearchLab hasDirector
type of information will be most helpful, and so can’t make      JohnDoe) then JohnDoe is a ContactInArea-AI),
effective decisions on what to encode. For instance,
someone might host a travel site with content on different
cities, places, transportation options, fares, special offers,   and finally d) use the KB to ask and answer questions
monuments and places of interest. Well, what should they         within the context of the research, having now a persistent
represent logically? Of course it depends on what types of       knowledge source that is focused on a particular domain of
queries and inferences they can expect. It will probably         interest.
make sense to provide a taxonomy of types:                       Creating structured content in a context of inquiry also
                                                                 helps reduce information extraction customization
                                                                 requirements. For instance, in a particular context there
                  Car is a type of Vehicle.                      will typically be a relatively small set of high-value
                  Airplane is a type of Vehicle.                 concepts that constitute the main conceptual “framework”
                  Taxi is a type of Car.                         of the domain of interest.           In the “new market
                                                                 identification” context described above, one might choose,
                  Boeing737 is a type of Airplane.               say, the concepts “person”, “organization”, and “project.”
                                                                 An information extraction rule base identifying instances of
                                                                 these generic concepts will require less development time
But it is less clear what types of inference rules to spend
                                                                 and effort than a corresponding rule base that attempts to
time supporting: does one anticipate agents and queries
                                                                 match patterns for all subclasses of the generic classes
that want to check:
                                                                 (e.g., subclasses research lab, institution of higher
                                                                 education,      and      C-corporation     for      superclass
         ((If Place is a Destination and                         ‘Organization’).      When the user has an interest in
                                                                 classifying, say, the AI Lab at the University of Texas at
         Customer arrivesAt Destination on Day and
                                                                 Austin as an instance of ResearchLab in the ontology—not
         WeatherForecast for Day is Severe) then                 just an instance of Organization—this functionality can be
         Suggest Cancellation or a NewDay)?                      handled in the application UI, by providing a means for the
                                                                 user to view, navigate, and modify the ontology and the
                                                                 contents of the KB. The minimal set of ‘focused’ terms—
Not unreasonable, to be sure. But now creating vocabulary        person, organization, project—provide the pattern
for “WeatherForecast” as well as attributes like “Severe”        matching parameters to the IE system, while any finer-
will be pointless if an agent visiting the site doesn’t use      grained classification is handled by the user in the UI.
such a rule. Given that there might be tens, hundreds, or
                                                                 An alternative approach to “offloading” development effort
even thousands of software agents reading travel sites for
                                                                 from IE rule base customization for each specific term of
various reasons (to continue this example), and it is quite
                                                                 interest to UI based KB classification efforts, is to utilize
likely that there won’t be perfect matches between
                                                                 machine learning (ML) techniques to semi-automatically
inference rules and logical concepts and assertions on
                                                                 construct extraction rules for concepts (entities). This
source pages—in which case, nothing will be gained by
                                                                 approach presents a number of exciting possibilities, most
writing the concepts and assertions—it is hard to make a
                                                                 notably the possibility of training an IE system to identify
case for doing the knowledge representation at all.
                                                                 concepts of interest as a user “surfs” the Web. However,
because ML approaches typically require many training            that serve particular purposes co-exist with standard
examples before accuracy can be achieved (and again,             presentational markup, and the choice of whether to
100% accuracy in unconstrained domains is not likely),           enhance the Web is made by particular users within a
such an approach is not a panacea.                               context of interest.
For a large KB, training IE rules to find instances for each     IMPLEMENTATION
particular class in an ontology is likely still to be time and   A proof of concept for the CSSW approach is currently
effort intensive. However, the approach favored here is to       under development at Digital Media Collaboratory (DMC),
investigate the use of ML techniques for improving the           IC2 Institute, the University of Texas at Austin
identification of instances of a smaller set of focused terms    (http://dmc.ic2.org). The Focused Knowledge Base (FKB)
such as explained above, that capture the context of a           project implements a client server architecture that allows
particular research project. This application of machine         multiple users to login to the system, perform research on
learning seems highly promising. For instance, ML                the Web, and save facts and knowledge from the Web into
techniques could be used to customize an IE rule base to         a KB. The FKB system uses the AeroText™ information
identify research labs as instances of Organization. Users       extraction engine to tag ‘focused’ terms, where they are
wishing to re-classify research labs as instances of the         presented on a separate “knowledge page” in the UI
subclass ResearchLab in the ontology could then perform          together with a list of relations (taken from the ontology)
re-classification by simple specialization of the term in the    that can be easily connected to subject and object terms to
KB.                                                              form a “triple” subject-verb-object assertion in the
SCOPE OF THE CLIENT SIDE APPROACH                                DAML+OIL language.
The approach outlined above specifically addresses               Assertions, together with contextual information (e.g.,
limitations apparent in the SSSW approach. By using IE           login ID, project name, date, time, area of knowledge) are
techniques to semi-automatically extract relevant concepts,      uploaded into an ontology server. The KAON Ontology
and by focusing on a particular research context when            server is used to store knowledge in DAML+OIL format
undertaking more complicated annotation strategies (e.g.,        [KAON, 2003]. Users can thus browse the Web to identify
making assertions for automated inference), a usable KB          pages relevant to a research project, enhance the page using
can be constructed that facilitates more advanced Q&A and        AeroText™, add important information not supplied by the
reasoning in a particular domain.                                IE system (binary relations are presented in drop-down
However there are a number of considerations that should         boxes based on the concepts in the subject and object
be addressed here. One, because there is still a significant     locations), and easily update the KB with the new facts.
amount of work required to transform free text or HTML           (Domain specific facts that are uploaded into the KB are
markup into a structured, usable KB (some IE rule base           subsumed by a top-level (“upper”) ontology layer provided
customization will be required, as well as manual effort in      by the Standard Upper Merged Ontology (SUMO)
making relational assertions and classifying concepts in the     [SUMO, 2001].)
KB), the CSSW approach will not be suitable for non-             In addition to this functionality, DMC is investigating two
persistent “quick” projects that can be answered by              advanced enhancements to the system. One is the use of an
performing a few keyword searches on the Web. Such               embedded theorem prover.            Although DAML+OIL
projects are still best handled by existing technologies,        supports standard set-theoretic operations, it provides no
such as the Google™ search engine.                               facility for constructing rules in the form of logical
Construction of a KB makes the most sense when projects          implications. Such implications, together with a suitable
are complex, require the combining of many different types       theorem prover such as the JTP theorem prover of Stanford
of information, and are relatively long-term and require         Knowledge                  Systems               Laboratory
persistent repositories. In other words, research that spans     (http://www.ksl.stanford.edu), make possible the automatic
multiple days, weeks, or even months and that can’t easily       addition of new knowledge (consequences) in the KB from
be handled via conventional browser techniques (saving           existing knowledge [JTP, 2003]. Rule bases that are
links into “Favorites” in Internet Explorer) without losing      focused to add desired information that may be implicit but
track of the knowledge added and the knowledge still             not noticed in the KB can add significant value. Two,
needed, is suited for a more structured approach such as         DMC is investigating machine learning approaches to
that outlined here. Also, the assumption is that the time        speed construction of IE rule bases suitable for matching
involved creating a KB to facilitate reasoning about a           instances of focused terms. In particular, relational
particular problem will be offset by the amount of               inductive algorithms for learning information extraction
sustained use of the KB a user can expect. Ideally, the KB       rules such as those designed by Ray Mooney at the
becomes a semi-permanent repository for a user (or users)        University         of         Texas        at        Austin
that can be referenced, modified, and added to as needed.        (http://www.cs.utexas.edu/users/ml/)      show      promise
                                                                 especially for Web-based source data [Mooney, 1999].
Hence, the vision that emerges in the CSSW is a “hybrid”
notion of the next generation web, where structured KB’s
CONCLUSION                                                     http://mds.external.lmco.com/mds/products/gims/aero/
The CSSW is an intriguing alternative to the SSSW vision       docs/AeroText-V2.5-Whitepaper-April-2003.pdf
and ameliorates a number of recognized problems. The
                                                               6.   [Mooney, 1999] Mooney, R., Califf, M.
high performance of information extraction systems such
                                                                    “Relational Learning of Pattern-Match Rules for
as AeroText™ coupled with a clearly defined context for
                                                                    Information Extraction” In Proceedings of the
Web-based research make the construction of a client-side
                                                                    Sixteenth National Conference on Artificial
“virtual” Web with structured repositories of knowledge
                                                                    Intelligence (AAAI-99), Orlando, FL, pp. 328-
servicing users and communities of users not just a viable,
                                                                    334, July 1999.
but an intriguing, option. Further research will include the
use of KIF-like rules with DAML+OIL (or OWL) and an            7.   [OWL,                                      2003]
embedded theorem prover to generated additional                     http://www.w3.org/2001/sw/WebOnt/
knowledge from existing knowledge. Also, machine               8.   [RDF, 2003] http://www.w3.org/RDF/
learning techniques that work well with Web-based
information and can help speed the customization of IE         9.   [SUMO, 2001] Niles, I., and Pease, A. 2001.
systems are an active area of research that promise to make         Towards a Standard Upper Ontology. In
the CSSW approach even more appealing and feasible as               Proceedings of the 2nd International Conference
the “next-generation” Web takes shape.                              on Formal Ontology in Information Systems
                                                                    (FOIS-2001), Chris Welty and Barry Smith, eds,
ACKNOWLEDGMENTS
                                                                    Ogunquit, Maine, October 17-19, 2001.
I thank Melinda Jackson and the rest of the DMC staff for
providing helpful comments on previous versions of this        10. [Stevens, R. 2003] Stevens, R., Wroe, C.,
paper.                                                             Bechhofer, S., Lord, P., Rector, A., Goble, C.
                                                                   “Building ontologies in DAML+OIL” In
REFERENCES
                                                                   Comparative and Functional Genomics
                                                                   Volume: 4, Issue: 1, Date: January/February 2003,
1.   [Berners-Lee, 2001] Berners-Lee, T., Hendler, J.,             Pages: 133-141
     Lasilla, O. “The Semantic Web.” In Scientific
     American, May 2001.
2.   [DAML, 2003] http://www.daml.org
3.   [JTP, 2003] http://www.ksl.stanford.edu/software/JTP/
4.   [KAON 2003] http://kaon.semanticweb.org
5.   [Lockheed Martin Management and Data Systems,
     2001-2003]