Automated OWL Annotation Assisted by a Large Knowledge Base

              Michael Witbrock, Kathy Panton, Stephen L. Reed, Dave Schneider, Bjørn Aldag,
                                  Mike Reimers and Stefano Bertolo

                         {witbrock, panton, sreed, daves, aldag, mreimers, bertolo}@cyc.com


       Abstract. Widespread adoption of the semantic web depends critically on lowering the “barriers to
       entry” facing document producers. We describe a system that applies automatic partial parsing of web
       pages into the representations of the large ResearchCyc ontology, combines this with convenient mixed
       initiative knowledge capture, and produces an OWL annotated document as output. Semantic web
       publishers can then use this document as a starting point for more elaborate, manual annotation.


Introduction

The rapid adoption of the World Wide Web, in its initial form, was driven in part by the ease with which
content could be produced; although specialized tools and techniques quickly evolved, web pages could be
produced, reasonably conveniently, by anyone with a text editor and an hour to read a description of the
available HTML tags. Semantic markup in languages like OWL has the potential to vastly increase the
utility of web content, but describing the logical content of a document is far from straightforward, even
without the requirement that that description be done in an XML-based markup language.
   In addition to the simple tools and syntax required for HTML authoring, the ready availability of
example pages with mark-up produced by others further flattened the already shallow learning curve for
Web authoring. Providing such examples for the semantic web would have similar utility but is not as
obviously straightforward. While the syntax of OWL is consistent, the conceptual tag set to be used is
highly dependent on the domain of the document, and, even within a domain, is set only by convention.
Rather than require prospective authors to identify the appropriate vocabulary, complex XML syntax, and
relevant set of example documents before semantic annotation can begin, it seems worthwhile to provide a
tool that, while imperfect, can make an initial, automatic pass at annotating a document. From that rough
annotation, it should be more straightforward for human content providers to incrementally improve the
representation of page content as they increase their understanding of relatively narrow components of the
relevant ontology and OWL syntax.
   In this paper, a system, based on Cyc, is described that can automatically produce initial OWL
annotations of arbitrary text documents. This is done in the vocabulary of the OpenCyc scaffolding
ontology, which is freely available1 and freely usable. The annotation process takes advantage of existing
Cyc system components for automated text analysis and guided knowledge entry, as well as newly-created
components for interactive disambiguation using natural language and reduction of internal CycL
representations to the OWL languages. Interactive components of the process are optional, and annotation
can proceed wholly automatically.


Document Analysis

The Cyc OWL annotation system operates in two phases. First, the page is read and as much of the content
as possible is represented in the CycL language. Second, the OWL export component of Cyc, developed as
part of the DARPA DAML project, is used to generate the appropriate annotation file.


1 http://www.cyc.com/2004/06/04/cyc


                                                             71
   Figure 1: The Cyc Document Annotator assists organizations and individuals interested in
   adapting their document production processes to the Semantic Web. By providing an
   approximate OWL annotation of an existing document, the system simplifies the initial learning
   curve, allowing editing to improve the annotation to replace the complex task of manually
   annotating a document from scratch. Interoperability is supported by annotation using the more
   than 60,000 freely usable terms in the OpenCyc scaffolding ontology.


   The OWL export component of the system is described in more detail later, but the core of the
annotation system depends on Cyc’s imperfect but growing ability to interpret free text into a detailed
logical representation in CycL. This is provided by combined application of Cyc’s natural language
processing subsystem, disambiguation dialogue, and the Factivore, a highly usable knowledge-driven
knowledge acquisition interface.


                                                      72
Parsing into the CycL Logical Language
CycL is a fully higher order and modal knowledge representation formalism2, which makes it suitable for
representing a wide range of natural language constructions. Cyc also allows the partition of knowledge
into separate ‘microtheories' arranged in a subsumption hierarchy which enables the consistent
management of contradictory information and the representation of context (e.g. statement of background
assumptions). The strategy followed by our annotation systems is to parse input documents, rendering as
much as currently possible into a CycL representation, to provide users with the opportunity, but not the
necessity, to interactively disambiguate and elaborate the CycL representation, and then to project the
resulting assertions onto the subset of representations allowed by the OWL language, yielding an XML
annotation file.

Extracting the Text Content of target web pages
We use two packages from the Apache Project (CyberNeko,3 and Xerces4) to convert an HTML document
into a Document Object Model (DOM) as a Java Object. The application traverses the DOM tree,
extracting the web page title, meta-description, and text leaf nodes. This will provide us with the ability, in
future versions of the annotator, to tailor its focus onto salient content and cause it to ignore distractions
(e.g. sidebars, menu items, advertisements, navigation links, and so forth often found with news articles).
This will be a substantial improvement over simple web page text extractors, which apply the simple
algorithm of stripping out HTML tags, thereby omitting most cues to salience and noise.

Chunking Input into Sentences, Phrases and Words
The second stage of the parsing pipeline populates a “TextDocument” object with sentences, phrases and
words obtained from the web page’s DOM. Currently, we use the LINGUA sentence splitting module5 to
extract whole sentences from text strings, and the remaining text fragments are then organized as phases
and words. All our web page annotation experiments to date have been conducted on English language
documents, but, since the character set used for parsing is UTF-8, it should in principle be straightforward
to apply this step of processing to other languages. Full processing of other languages will depend on
extending the Cyc Lexicon beyond its rudimentary coverage outside English, and extending the
segmentation and syntactic parsing infrastructure to handle a wider range of syntactic phenomena.

Natural Language Knowledge and English Parsing
Natural language processing in Cyc is supported by the Cyc Lexicon, an increasingly comprehensive
collection of syntactic and semantic knowledge about English, and a framework in which knowledge about
other languages can be embedded. The table below gives some indication of the current coverage.

                                                               Noun           Verb     Adjective
                  CycL terms representing Lexemes             15450           4454         4716
                  Denotations                                 14442           1838         1640
                  Semantic Translation Patterns                 464           3178         1787

     CycL terms representing lexemes include Burger-TheWord and Of-TheWord, representing the
English words “burger” and “of”, respectively; denotations connect word senses to KB concepts. For
example,
   (denotation Burger-TheWord CountNoun 0 HamburgerSandwich)
      means that “burger”, when used in its first CountNoun sense, refers to a hamburger sandwich;


2The Cyc inference engine however currently only supports the first order fragment and some of the second order and

  modal extensions.
3 http://www.apache.org/~andyc/neko/doc/html/
4 http://xml.apache.org/xerces-j/
5 http://people.brandeis.edu/~matthewg/cpan-lingua.html


                                                             73
    (verbSemTrans Venerate-TheWord 0 TransitiveNPCompFrame
              (feelsTowardsObject :SUBJECT :OBJECT Reverence highAmountOf)),
means that the word “venerate”, when used as the verb in a transitive verb frame taking an NP complement,
should be understood in the Cyc logical language, CycL, as meaning that the agent denoted by the subject
of the sentence feels a high degree of reverence towards the thing denoted by the object of the sentence.
Similarly,
(nounSemTrans Bride-TheWord 0 GenitiveFrame
           (and
                  (isa :NOUN FemaleHuman)
                  (isa ?W WeddingEvent-Entire)
                  (eventHonors ?W :NOUN)
                  (eventHonors ?W :POSSESSOR)))
tells Cyc that, for example, “Frankenstein’s Bride” or “the bride of Frankenstein” should be interpreted as
meaning that the bride is a female person, and that some wedding happened that honored both the bride and
Frankenstein.
      The third stage of the document annotation pipeline iterates over the sentences and phrases in the
TextDocument object. Phrases are treated as whole sentences on the first pass. Each sentence is parsed by
Cyc’s natural language parsing system, resulting in a list of CycL logical sentences. If the list is empty,
then Cyc could not determine a semantic interpretation that covered the entire sentence, and if more than
one CycL sentence is returned, then Cyc found one or more ambiguous concepts in the input natural
language sentence. Typical performance for a parsing run on a news article is:
                Total number of phrase parses attempted                                  210
                Number of phrases for which a CycL translation was found                 79
                Average time to translate                                                5 seconds
On the second pass over the TextDocument object, Cyc’s word denotation parser processes the
uninterpreted sentences, returning Cyc terms for lexically mapped words and phrases.

Parsing into Semantic Representations
Although a great deal of progress has been made over the past decade in the development of efficient
syntactic parsers for natural languages, semantic parsers, which attempt to reach a detailed understanding
of the NL input, have been less well studied and less successful. This may be due in part to the lack of a
suitable target representation, for which the existence of PropBank6 [Gildea and Palmer 2002], and, more
recently, the availably of OpenCyc and ResearchCyc 7 may offer some relief. The lack may also be due to
the difficulty of the process, since unlike syntactic parsing, semantic interpretation depends critically on
solutions to difficult linguistic problems, including anaphor resolution, disambiguation, interpretation of
metaphors, preposition interpretation, and quantification. It is therefore worth spending a little time to
explain the progress we have made during our research and how we have deployed it within this
application.
      Suppose one is faced with a sentence like “Bill Clinton bought a house in New York”. The first step
in interpretation is to perform a syntactic parse targeting the TreeBank tag set. For this prototype we made
use of the parser developed by Eugene Charniak at Brown University [Charniak 2000]8. This parser yields:
   [S [NP [NNP “Bill”] [NNP “Clinton”]]
        [VP [VBD “bought”]
               [NP [NP [DT “a”] [NN “house”]]
                      [PP [IN “in”]
                             [NP [NNP “New”] [NNP “York”]]]]]

6 http://www.cis.upenn.edu/~ace/
7 Open Cyc is a completely unrestricted subset of the Cyc KB and inference system, and includes a scaffolding

   taxonomy of approximately 60,000 terms that ensure interoperability with other Cyc KB versions. Research Cyc
   includes all of OpenCyc together with a large number of assertions and rules concerning the scaffolding terms; this
   high utility version of Cyc is currently in beta and will be available under a research purposes license.
8 The system, however, is not dependent on the use of this parser; in a current research project our team is collaborating

   with Stanford University in an effort to achieve semantic parses of English and Chinese using the Stanford Parser
   (Klein and Manning 2003). We are also exploring the use of the CMU Link parser [Sleator and Temperley 1993].


                                                                74
From this parse, the system identifies the main verb, “bought” in this case, and finds its denotation in the
KB (#$Buying) and the appropriate semantic translation pattern (SemTrans):
   (and (isa :ACTION Buying)
           (buyer :ACTION :SUBJECT)
           (objectPaidFor :ACTION :OBJECT))
This is used, in turn to understand the argument structure of the syntactic parse. The syntactic subject,
    [NP [NNP “Bill”] [NNP “Clinton”]],
and the syntactic object,
   [NP [NP [DT “a”] [NN “house”]]
                      [PP [IN “in”]
                             [NP [NNP “New”] [NNP “York”]]]
are isolated for the purposes of completing the retrieved SemTrans, and interpreted using the Cycorp-
developed recursive noun phrase parser, for the base NPs (“Bill Clinton”, “house”, “New York”9 in this
case10), combined and compositional parsing of modifiers (“in New York”, in this case), producing the
CycL interpretations #$BillClinton and
   (and
           (isa ?HOUSE House-Modern)
           (in-Underspecified ?HOUSE                     NewYork-State)).
Substituting these into the SemTrans, and replacing the remaining role key ‘:ACTION’ with an
existentially qualified variable, yields the final CycL interpretation:
   (thereExists ?ACTION
     (thereExists ?HOUSE
       (and (isa ?ACTION Buying)
           (buyer ?ACTION BillClinton)
           (objectPaidFor ?ACTION ?HOUSE)
           (isa ?HOUSE House-Modern)
           (in-Underspecified ?HOUSE NewYork-State))))
     The rendering of the prepositional phrase as “in-Underspecified” represents a residual ambiguity
which future versions of the system will attempt to resolve using background knowledge and discourse
context11. The current system typically produces translations that render much of the sense of input
sentences, but that omit some of the information they contain.

User Interaction in Annotating Partially Translated Documents.
To help ameliorate some of the imperfections in the semantic translation process, the system provides the
opportunity, but not the necessity, for users to interact with the current interpretation of a document,
resolving ambiguities and adding additional information. Analyzed documents can be displayed in an
interface that maintains correspondences between the text of the original document and the current logical
interpretation. Fully interpreted terms in the document are highlighted in green; clicking on them takes the
user to an appropriate “Factivore” knowledge acquisition form, allowing rapid knowledge entry in natural
language. While some of the most commonly used forms have had their representation in the KB hand-
crafted by knowledge engineers, the vast majority of forms are produced automatically by the system, using
background knowledge and inductive inference over known cases. In experiments performed in the course


9 Another possible interpretation is New York City.   For this example, we assume a user has correctly disambiguated.
10 In addition to being able to map single and multi-word tokens into CycL terms – e.g. "Bill Clinton" to #$BillClinton

   – the NP parser can interpret a wide variety of compound NPs, e.g. "Bronze age farmers" are farmers that were active
   during the Bronze age and "black leather jackets" are jackets made of leather and black in color.
11 To the predicate #$objectFoundInLocation, in this case.


                                                                75
of entering knowledge about terrorists and their activities, lightly trained domain experts have achieved
knowledge entry rates exceeding thirty facts12 per hour using this interactive interface.
The other interactions available to users are selection from amongst interpretation alternatives (via menus
rendered by the Natural Language Generation system) for terms highlighted in orange, and obtaining a
complete English paraphrase of the current logical interpretation of a sentence, before it is asserted.


  Fig. 2: After the system has analyzed a document, it can be made available to the user for further
  annotation. Terms recognized within sentences are marked in green, if fully interpreted, and orange, if
  ambiguous to the system. Users can chose to resolve ambiguities in pull down menus, forcing
  reinterpretation of the affected sentence, or can leave the ambiguity intact. The current interpretation can
  be disclosed to the user by automatically paraphrasing it back into English, as shown in the pop up. More
  information can be provided about terms in the document, at the users whim, by accessing “Factivore”
  knowledge entry forms, which provide a rapid, NL mechanism for assertion into the knowledge base.


Asserting CycL Sentences into a Unique Cyc Microtheory
The fourth stage of the parsing pipeline asserts the CycL sentences and Cyc terms into a unique Cyc
microtheory (context) within the knowledge base. The microtheory represents the propositional content of
the target web page, and it is placed within the Cyc microtheory inheritance lattice so that commonsense
assumptions about the target web page document are made explicit within Cyc. For example, a current

12 A fact is a single assertion made into the Cyc KB.
                                                    Facts can express simple concepts (such as “George W. Bush is a
  person”) or more complicated concepts (such as “something is consumed during every eating event”).


                                                             76
news article microtheory inherits rules and facts from Cyc’s CurrentWorldDataCollectorMt. Existential
variables are replaced by concrete terms during the CycL sentence assertion. Below is an assertion as
parsed from the text “Bill Clinton bought a house in New York”:
  (thereExists ?ACTION
    (thereExists ?HOUSE
      (and (isa ?ACTION Buying)
           (buyer ?ACTION BillClinton)
           (objectPaidFor ?ACTION ?HOUSE)
           (isa ?HOUSE House-Modern)
           (in-Underspecified ?HOUSE NewYork-State))))
Replacing the existentially quantified variables with their skolem equivalents in the formula yields:
   (and (isa Buying21 Buying)
       (buyer Buying21 BillClinton)
       (objectPaidFor Buying21 House-Modern22)
       (isa House-Modern22 House-Modern)
       (in-Underspecified House-Modern22 NewYork-State))))

     “Government officials believe the men were planning an
     attack in the lead-up to Spain 's general election.”
     PATH:HTML[2]/BODY[1]/TABLE[3]/TR[1]/TD[3]/TABLE[2]/
            TR[2]/TD[1]/FONT[1]/P[2]/
        (thereExists :INF-COMP, ?PLANNING0397, ?MEN0411, ?ATTACK0413,
                      ?LEADUP0415, ?ELECTION0407, ?SPAIN0416,
                      ?GOVERNMENT-OFFICIALS040
           (and
              (isa ?GOVERNMENT-OFFICIALS0409 PublicOfficial)
               (beliefs ?GOVERNMENT-OFFICIALS0409
                     (and
                       (and
                        (equals ?SPAIN0416 Spain)
                        (isa ?ELECTION0407 Election)
                        (to-UnderspecifiedLocation ?LEADUP0415 ?ELECTION0407
                        (in-UnderspecifiedContainer ?ATTACK0413 ?LEADUP0415)
                        (isa ?ATTACK0413 AttackOnObject)
                        (isa ?MEN0411 AdultMaleHuman)
                        (and
                          (isa ?PLANNING MakingAPlan)
                          (performedBy ?PLANNING0397 ?MEN0411)
                          (isa ?PLAN PlanSpecificationMicrotheory
                          (scheduledEvents ?PLAN :INF-COMP)
      Paraphrase: there(resultMt
                            is some?PLANNING0397
                                    :INF-COMP such?PLAN))))))
                                                      that
       some public official believes some other individual ?ELECTION3835 is an election,
       some purposeful composite physical and mental activity is an attack,
       someone ?MEN3839 is a man, Spain has ?ELECTION3835, in some sense,
       ?ELECTION3835 is the location of some other individual ?LEADUP3843,
       that purposeful composite physical and mental activity is in ?LEADUP3843,
       and some other action ?PLANNING3825 is a planning, some plan is a plan,
       ?MEN3839 deliberately performs ?PLANNING3825, that plan for :INF-COMP,
       and the plan is the result of ?PLANNING3825


  Figure 3: The result of translating one sentence of a document into CycL. These translations are
  often quite complex, and, as in this case, imperfect, but provide a good basis for editing the OWL
  representation into an accurate reflection of document semantics. The paraphrase is the result of
  automatic conversion of the CycL translation back into English, and is given as an aid to reading.
  Paraphrase into English is not present in the Cyc Annotator output.


                                                          77
Exporting CycL into OWL
The fifth and final stage of the web page annotation pipeline exports the document microtheory contents
into an OWL XML document. All the built-in OWL Classes and properties have CycL equivalents. Here
are sample rules for exporting some CycL predicates that happen to have built-in OWL definitions:
   #$disjointWith --> owl:disjointWith
   #$equals --> owl:sameAs
   #$genlPreds --> rdfs:subPropertyOf
   #$genls --> rdfs:subClassOf
   #$isa --> rdf:type
   #$TransitiveBinaryPredicate --> owl:TransitiveProperty
The sample CycL formula results in the following OWL RDF triples, with boldface to indicate the
transformation of CycL predicates that are defined in Cyc’s OWL ontology:
   <Buying rdf:ID=”Buying21”>
       <buyer rdf:resource=”#BillClinton”>
       <objectPaidFor rdf:resource=”#House-Modern22”>
   </Buying21>
   <House-Modern>
       <in-Underspecified rdf:resource=”#NewYork-State”>
   </House-Modern>
A portion of the OWL output for a particular news story is included in Figure 1, above. The primary
difficulty in the OWL export process was the expressiveness limitation of OWL with respect to CycL. We
overcame this by ensuring that the CycL assertions were ground atomic formulae, without functional terms
and using only binary predicates. For cases such as rules, where the representation is not amenable to
OWL export, we omit them from the OWL markup.

Conclusions and Future Work
The Cyc OWL annotator seeks to lower the barriers to the acceptance and growth of the semantic web by
using the Cyc system to produce fully automatic, partial OWL markup for unrestricted text documents.
This is done by applying lexical information and background knowledge from the Cyc knowledge base,
subsystems for text analysis, optional interactive knowledge acquisition and disambiguation, isolation of
incomplete knowledge within a microtheory structure, and down-projection of CycL logical representations
into OWL.
      One of the central thrusts of our research is improving the process of translation from unrestricted
natural language text into full logical representations; over the next year we expect substantial
improvements in the quality of English interpretation, and initial results for Chinese interpretation; these
improvements should directly improve the resulting OWL annotations.
      An independent research direction involves adding the ability for the system to optionally produce
OWL extended with RuleML and other proposed extensions to the language of the semantic web,
improving the quality of the output produced by down-projection from CycL. These extensions should be
straightforward to produce once the relevant standards are adopted.
      This work was supported by DARPA’s DAML program, and used additional technology supported by
ARDA’s AQUAINT program and a Phase I SBIR grant.

References
Burns, Kathy J. and Anthony R. Davis. 1999. “Building and Maintaining a Semantically Adequate Lexicon Using
   CYC” in Evelyne Viegas, Breadth and Depth of Semantic Lexicons. Kluwer: Dordrecht.
Charniak, Eugene. 2000. “A Maximum-Entropy-Inspired Parser”. Proceedings of the First Annual Meeting of the
   North American Chapter of the Association for Computational Linguistics (NAACL'2000), Seattle, Washington.
Gildea, Daniel and Martha Palmer. 2002. “The Necessity of Parsing for Predicate Argument Recognition” In
   Proceedings of ACL 2002, Philadelphia, PA.
Klein, Dan and Christopher D Manning. 2003. “A* Parsing: Fast Exact Viterbi Parse Selection.” HLT-NAACL 2003,
   Edmonton, Canada.
Sleator, Daniel and Davy Temperley. 1993. “Parsing English with a Link Grammar”. Third International Workshop on
   Parsing Technologies, Tilburg, The Netherlands and Durbuy, Belgium.


                                                           78