A Client Side Approach to Building the Semantic Web Erik Larson Digital Media Collaboratory, IC2 Institute The University of Texas at Austin 1301 West 25th Street, Suite 300 Austin, TX 78705 USA 512 474 6312 Ext. 265 elarson@icc.utexas.edu ABSTRACT society nearby and to schedule a visit in the next few days, In this paper, I describe an alternative approach to building the agent will know that bird watching is a type of outdoor a semantic web that addresses some known challenges to activity and therefore that the weather is a relevant factor, existing attempts. In particular, powerful information checking the online local forecast (knowing that “nearby” extraction techniques are used to identify concepts of means “local”) for signs of thunderstorms, rain, or other interest in Web pages. Identified concepts are then used to kinds or weather incompatible with outdoor activities. The semi-automatically construct assertions in a computer- agent can then inform the person of the location of a bird readable markup, reducing manual annotation watching society, visiting hours, and good days during the requirements. It is also envisioned that these semantic week to go. The evolution of the Web into the Semantic assertions will be constructed specifically by communities Web, in other words, creates more opportunities for of users with common interests. The structured knowledge exploiting the rich content on the Web to create value and bases created will then contain content that reflects the uses provide services to everyone. they were designed for, thereby facilitating effective As laudable as this vision is, there are a number of automated reasoning and inference for real-world problems with its practical implementation. First, much of problems. the focus in the current Semantic Web activity is in Keywords transforming the HTML content sitting on Web servers to Semantic web, information extraction, ontology include semantic information in a machine-readable markup language, such as the Resource Description Format INTRODUCTION (RDF), the DARPA Agent Markup Language with The World Wide Web is a vast repository of information Ontology Inference Layer (DAML+OIL), or the updated annotated in a human-readable format. Unfortunately, version of DAML, the Ontology Web Language (OWL) annotation that is understood by humans is typically poorly [RDF, 2003], [DAML, 2003], [OWL, 2003]. This understood by machines. Because the Web was designed transformation requires, in effect, a re-writing of the for human and not machine understanding, facilitating the billions of pages of content comprising the current World development of enhancements to the Web such as better Wide Web—no small feat, particularly since the Semantic search and retrieval, question and answering (Q&A), and Web languages are much less user friendly than simple automated services through intelligent Web agents is HTML. True, the Semantic Web markup languages were difficult and in many cases not yet practically feasible. Not designed for greater ease of use than traditional knowledge surprisingly, work on a “next-generation” Web more representation (KR) languages based on first-order logic friendly to machines is already underway, most visibly in (they are also not as expressive, see [Stevens, R., 2003]), the Semantic Web activity championed by Tim Berners- but for non-experts unversed in logic systems, annotating Lee, inventor of the current Web and director of the W3C Web pages with RDF, DAML or OWL represents a whole (http://www.w3c.org), the Web standards and development new layer of effort, particularly in relation to the committee. WYSIWYG software for HTML annotation that is now a The vision of the Semantic Web activity is an “evolution” commonplace. of the existing Web into one that contains machine- Second, developers cannot effectively markup Web readable markup [Berners-Lee, 2001]. As seen by people, documents with semantic content unless they understand the Semantic Web remains indistinguishable from the clearly the context—what is the purpose of adding the new current one. Yet machines using the Semantic Web can information? What function is it serving? What questions read Web pages that contain semantic information encoded will it answer, or services will it provide, that represent a in a logic-based markup describing their content. This clear benefit in some well-defined context? Without this increased power that semantic markup gives to machines context, the average Web page developer won’t likely see a also benefits humans: if someone instructs their agent (i.e., clear point to creating logical markup. Such an effort their intelligent agent software) to find a bird watching would represent, in other words, a purely technical identification of binary relations [Lockheed Martin exercise. Management and Data Systems, 2001-2003]. Yet a semantic web that facilitates better machine There are a number of challenges to using information reasoning is indeed desirable and (it is hoped) practically extractions systems for the CSSW. First, no matter how feasible as well. The position taken here is that the “server- effective an IE system, one cannot yet expect 100% side” transformation of Web content in the current accuracy and recall on arbitrary source documents. This Semantic Web activity is, while perhaps helpful in certain means that false negatives and positives are unavoidable (at cases, nonetheless not a panacea and may even be a least in unconstrained domains). A CSSW system must hindrance to the task of enhancing the capabilities of the have functionality in the user interface (UI) to permit Web for many users. An alternative, “client-side” selection and editing of extracted results by a human user. approach that enables users to effectively transform the Second, there is a problem of specificity: IE systems existing (HTML) content of the Web into more usable, suitable for handling arbitrary source content (such as, for structured representations that facilitate reasoning within a instance, different Web pages) will not easily support context of interest will be presented. pattern matching for numerous specific concepts. The base THE CLIENT SIDE VISION OF THE SEMANTIC WEB functionality of AeroText, for instance, identifies In a “client-side” semantic web approach, the HTML distinctions between ‘organizations’ and ‘people’, but not content of the Web is used “as is.” Instead of adding (in the general case) between types of organizations such as additional markup, a suite of tools and applications are the Red Cross (non-profit organization), the University of envisioned that extract concepts from Web pages for Texas at Austin (higher education institution), Dell uploading into a structured knowledge base (KB). The KB Computer (corporation), and the Smithsonian Institute (art is then used for advanced querying, inference, and problem and science institution). A user working on a research solving. project on types of organizations in the United States The client-side approach has a significant advantage over would get all these types of organizations extracted merely the standard server-side semantic web (hereafter SSSW) as “Organization”—hardly helpful in this context. because it reduces the content development bottleneck. The Lastly, there is a problem of identifying proper relations: client-side semantic web (hereafter CSSW) enables the while IE systems excel at identifying patterns for particular semi-automatic construction of a “virtual” web on the things (e.g., proper nouns), they are less effective with user’s machine (or, in a multi-user environment, on a relations between things (e.g., binary relations). The server that is available to a number of users) that retains reason, to put it bluntly, is that natural language hypertext links back to the original Web content but adds a understanding by machines is in too rudimentary a state to set of logical assertions that captures the meanings germane handle the grammatical variations in free text occurrences to the user or users’ interests. It therefore helps solve both of relations. Extraction rules that do match multi-word problems with the SSSW approach explained above: patterns and can consistently resolve the semantics of manual annotation effort is reduced by semi-automatic relations embedded in natural language assertions are either extraction techniques, and because the KB is constructed domain specific or difficult to construct, or both. (For with a particular interest in mind, there is a clear context example, extracting the relation “managed” in “John Doe for the creation of logical assertions (i.e., the user is managed numerous food chains in California before creating a KB for a particular purpose that, ex hypothesi, is becoming vice president of operations” as an instance of known to the user in advance). the predicate “managerOf” in an ontology would require distinguishing between this semantics of ‘managed’ and the KNOWLEDGE ACQUISITION following: “Mary managed the sale of half of her stocks Because Web content is left as HTML, the CSSW before the market took a downturn.”) approach must solve a knowledge acquisition problem: how does one transform semi-structured content into A solution to the first and second problems above (and to structured representations? The short, technical answer to some extent the third) is to customize an information this question is: with an information extraction (IE) extraction system’s rule base to perform well on documents system. There are in fact a number of both commercial and containing certain targeted content such as the specific open-source IE systems available that can extract concepts concepts of interest in those documents. This type of and even simple relations from text sources, outputting solution, however, would not appear to be a complete them into XML or other structured languages (e.g., RDF, answer to engineering a CSSW approach, since the original DAML). Lockheed Martin’s AeroText™ IE system, for impetus of such an approach was to reduce technical time instance, can extract key phrases and elements from text and effort, but unfortunately customization of IE rule bases documents, as well as perform sophisticated analysis of is, like manual annotation for the semantic web, a non- document structure (identifying tables, lists, and other trivial technical effort. elements) in addition to complex event extraction and some However the position advanced here is that the knowledge acquisition challenges specific to the CSSW approach are nonetheless solvable, either completely or in large degree. Now consider the CSSW approach. In this case, we begin It is more difficult to draw the same conclusion of the with the assumption that a user has a particular interest in SSSW. In other words, both approaches have bottlenecks, creating structured content. For instance, a user may want but the CSSW approach structures the task in such a way to construct a KB containing assertions about artificial that workable remedies seem possible. This suggests that intelligence (AI) research labs in academia and industry, the CSSW approach holds promise for more dynamic and perform research on whether there are new markets progress in the mid, long, and even short term. emerging for AI-based techniques. The user can then a) specify the concepts of interest (e.g., research lab, university, corporation, AI techniques, products using AI THE IMPORTANCE OF CONTEXT techniques), b) extract these concepts and upload them into A key difference between the two semantic web a KB, c) write inference rules that specifically conclude approaches is in considerations of context. On the one more information of interest from existing information in hand, the SSSW requires developers to describe the content the KB, such as: of their Web pages in logic, so that the content is understandable (processable) by other software agents with a large range of different goals when visiting Web sites. ((If ResearchLab hasResearchArea The problem here is that the developer can’t be sure what InformationExtraction and ResearchLab hasDirector type of information will be most helpful, and so can’t make JohnDoe) then JohnDoe is a ContactInArea-AI), effective decisions on what to encode. For instance, someone might host a travel site with content on different cities, places, transportation options, fares, special offers, and finally d) use the KB to ask and answer questions monuments and places of interest. Well, what should they within the context of the research, having now a persistent represent logically? Of course it depends on what types of knowledge source that is focused on a particular domain of queries and inferences they can expect. It will probably interest. make sense to provide a taxonomy of types: Creating structured content in a context of inquiry also helps reduce information extraction customization requirements. For instance, in a particular context there Car is a type of Vehicle. will typically be a relatively small set of high-value Airplane is a type of Vehicle. concepts that constitute the main conceptual “framework” Taxi is a type of Car. of the domain of interest. In the “new market identification” context described above, one might choose, Boeing737 is a type of Airplane. say, the concepts “person”, “organization”, and “project.” An information extraction rule base identifying instances of these generic concepts will require less development time But it is less clear what types of inference rules to spend and effort than a corresponding rule base that attempts to time supporting: does one anticipate agents and queries match patterns for all subclasses of the generic classes that want to check: (e.g., subclasses research lab, institution of higher education, and C-corporation for superclass ((If Place is a Destination and ‘Organization’). When the user has an interest in classifying, say, the AI Lab at the University of Texas at Customer arrivesAt Destination on Day and Austin as an instance of ResearchLab in the ontology—not WeatherForecast for Day is Severe) then just an instance of Organization—this functionality can be Suggest Cancellation or a NewDay)? handled in the application UI, by providing a means for the user to view, navigate, and modify the ontology and the contents of the KB. The minimal set of ‘focused’ terms— Not unreasonable, to be sure. But now creating vocabulary person, organization, project—provide the pattern for “WeatherForecast” as well as attributes like “Severe” matching parameters to the IE system, while any finer- will be pointless if an agent visiting the site doesn’t use grained classification is handled by the user in the UI. such a rule. Given that there might be tens, hundreds, or An alternative approach to “offloading” development effort even thousands of software agents reading travel sites for from IE rule base customization for each specific term of various reasons (to continue this example), and it is quite interest to UI based KB classification efforts, is to utilize likely that there won’t be perfect matches between machine learning (ML) techniques to semi-automatically inference rules and logical concepts and assertions on construct extraction rules for concepts (entities). This source pages—in which case, nothing will be gained by approach presents a number of exciting possibilities, most writing the concepts and assertions—it is hard to make a notably the possibility of training an IE system to identify case for doing the knowledge representation at all. concepts of interest as a user “surfs” the Web. However, because ML approaches typically require many training that serve particular purposes co-exist with standard examples before accuracy can be achieved (and again, presentational markup, and the choice of whether to 100% accuracy in unconstrained domains is not likely), enhance the Web is made by particular users within a such an approach is not a panacea. context of interest. For a large KB, training IE rules to find instances for each IMPLEMENTATION particular class in an ontology is likely still to be time and A proof of concept for the CSSW approach is currently effort intensive. However, the approach favored here is to under development at Digital Media Collaboratory (DMC), investigate the use of ML techniques for improving the IC2 Institute, the University of Texas at Austin identification of instances of a smaller set of focused terms (http://dmc.ic2.org). The Focused Knowledge Base (FKB) such as explained above, that capture the context of a project implements a client server architecture that allows particular research project. This application of machine multiple users to login to the system, perform research on learning seems highly promising. For instance, ML the Web, and save facts and knowledge from the Web into techniques could be used to customize an IE rule base to a KB. The FKB system uses the AeroText™ information identify research labs as instances of Organization. Users extraction engine to tag ‘focused’ terms, where they are wishing to re-classify research labs as instances of the presented on a separate “knowledge page” in the UI subclass ResearchLab in the ontology could then perform together with a list of relations (taken from the ontology) re-classification by simple specialization of the term in the that can be easily connected to subject and object terms to KB. form a “triple” subject-verb-object assertion in the SCOPE OF THE CLIENT SIDE APPROACH DAML+OIL language. The approach outlined above specifically addresses Assertions, together with contextual information (e.g., limitations apparent in the SSSW approach. By using IE login ID, project name, date, time, area of knowledge) are techniques to semi-automatically extract relevant concepts, uploaded into an ontology server. The KAON Ontology and by focusing on a particular research context when server is used to store knowledge in DAML+OIL format undertaking more complicated annotation strategies (e.g., [KAON, 2003]. Users can thus browse the Web to identify making assertions for automated inference), a usable KB pages relevant to a research project, enhance the page using can be constructed that facilitates more advanced Q&A and AeroText™, add important information not supplied by the reasoning in a particular domain. IE system (binary relations are presented in drop-down However there are a number of considerations that should boxes based on the concepts in the subject and object be addressed here. One, because there is still a significant locations), and easily update the KB with the new facts. amount of work required to transform free text or HTML (Domain specific facts that are uploaded into the KB are markup into a structured, usable KB (some IE rule base subsumed by a top-level (“upper”) ontology layer provided customization will be required, as well as manual effort in by the Standard Upper Merged Ontology (SUMO) making relational assertions and classifying concepts in the [SUMO, 2001].) KB), the CSSW approach will not be suitable for non- In addition to this functionality, DMC is investigating two persistent “quick” projects that can be answered by advanced enhancements to the system. One is the use of an performing a few keyword searches on the Web. Such embedded theorem prover. Although DAML+OIL projects are still best handled by existing technologies, supports standard set-theoretic operations, it provides no such as the Google™ search engine. facility for constructing rules in the form of logical Construction of a KB makes the most sense when projects implications. Such implications, together with a suitable are complex, require the combining of many different types theorem prover such as the JTP theorem prover of Stanford of information, and are relatively long-term and require Knowledge Systems Laboratory persistent repositories. In other words, research that spans (http://www.ksl.stanford.edu), make possible the automatic multiple days, weeks, or even months and that can’t easily addition of new knowledge (consequences) in the KB from be handled via conventional browser techniques (saving existing knowledge [JTP, 2003]. Rule bases that are links into “Favorites” in Internet Explorer) without losing focused to add desired information that may be implicit but track of the knowledge added and the knowledge still not noticed in the KB can add significant value. Two, needed, is suited for a more structured approach such as DMC is investigating machine learning approaches to that outlined here. Also, the assumption is that the time speed construction of IE rule bases suitable for matching involved creating a KB to facilitate reasoning about a instances of focused terms. In particular, relational particular problem will be offset by the amount of inductive algorithms for learning information extraction sustained use of the KB a user can expect. Ideally, the KB rules such as those designed by Ray Mooney at the becomes a semi-permanent repository for a user (or users) University of Texas at Austin that can be referenced, modified, and added to as needed. (http://www.cs.utexas.edu/users/ml/) show promise especially for Web-based source data [Mooney, 1999]. Hence, the vision that emerges in the CSSW is a “hybrid” notion of the next generation web, where structured KB’s CONCLUSION http://mds.external.lmco.com/mds/products/gims/aero/ The CSSW is an intriguing alternative to the SSSW vision docs/AeroText-V2.5-Whitepaper-April-2003.pdf and ameliorates a number of recognized problems. The 6. [Mooney, 1999] Mooney, R., Califf, M. high performance of information extraction systems such “Relational Learning of Pattern-Match Rules for as AeroText™ coupled with a clearly defined context for Information Extraction” In Proceedings of the Web-based research make the construction of a client-side Sixteenth National Conference on Artificial “virtual” Web with structured repositories of knowledge Intelligence (AAAI-99), Orlando, FL, pp. 328- servicing users and communities of users not just a viable, 334, July 1999. but an intriguing, option. Further research will include the use of KIF-like rules with DAML+OIL (or OWL) and an 7. [OWL, 2003] embedded theorem prover to generated additional http://www.w3.org/2001/sw/WebOnt/ knowledge from existing knowledge. Also, machine 8. [RDF, 2003] http://www.w3.org/RDF/ learning techniques that work well with Web-based information and can help speed the customization of IE 9. [SUMO, 2001] Niles, I., and Pease, A. 2001. systems are an active area of research that promise to make Towards a Standard Upper Ontology. In the CSSW approach even more appealing and feasible as Proceedings of the 2nd International Conference the “next-generation” Web takes shape. on Formal Ontology in Information Systems (FOIS-2001), Chris Welty and Barry Smith, eds, ACKNOWLEDGMENTS Ogunquit, Maine, October 17-19, 2001. I thank Melinda Jackson and the rest of the DMC staff for providing helpful comments on previous versions of this 10. [Stevens, R. 2003] Stevens, R., Wroe, C., paper. Bechhofer, S., Lord, P., Rector, A., Goble, C. “Building ontologies in DAML+OIL” In REFERENCES Comparative and Functional Genomics Volume: 4, Issue: 1, Date: January/February 2003, 1. [Berners-Lee, 2001] Berners-Lee, T., Hendler, J., Pages: 133-141 Lasilla, O. “The Semantic Web.” In Scientific American, May 2001. 2. [DAML, 2003] http://www.daml.org 3. [JTP, 2003] http://www.ksl.stanford.edu/software/JTP/ 4. [KAON 2003] http://kaon.semanticweb.org 5. [Lockheed Martin Management and Data Systems, 2001-2003]