Learning to Generate Semantic Annotation for Domain
                   Specific Sentences
                                   Jianming Li, Lei Zhang, Yong Yu
                             Department of Computer Science and Engineering,
                                      Shanghai JiaoTong University,
                                      Shanghai, 200030, P.R.China
                    Jianming119@sina.com, tozhanglei@hotmail.com, yyu@mail.sjtu.edu.cn

ABSTRACT                                                       learning based approach is effective. Our approach is
Seas of web pages in the Internet contain free texts in        independent on any ML algorithm. In the prototype
natural language that are only read by human beings. To        ALPHA system, we employed instance-based learning.
be understandable for machines, these pages should be          Link Grammar is first used to get the syntactic structures
annotated with semantic markups. Manually annotating           of sentences. The learning process then learns to map the
large amounts of pages is an arduous work. This has made       syntactic structures to semantic structures – RDF graphs.
automatic semantic annotation an urgent challenge. In this     WordNet [7] and the domain relation hierarchy are used as
paper, we propose a machine-learning based automatic           the domain ontology in the whole semantic analysis and
annotation approach. This approach can be trained for          representation process. Preliminary results gained from
different domains and requires nearly no manual rules.         the ALPHA system demonstrated the feasibility of the
The annotation is on the sentence level and is in RDF          approach.
format. We adopt a dependency grammar – Link Grammar           The paper is organized as follows. Section 1.1 explains the
[2] –for this purpose. ALPHA system, a prototype of this       concept of “Domain Specific Sentences” used in this
approach has been developed with IBM China Research            paper. Section 1.2 briefly shows what the result RDF looks
Lab. We expect many improvements are possible for this         like. Section 1.3 explains the reason to adopt Link
approach and our work may be selectively adopted or            Grammar. Section 2 outlines the whole approach by giving
enhanced.                                                      an overview. Section 3 presents the detailed process that
1 Introduction                                                 generates RDF graph from domain specific sentences.
There are seas of web pages in the Internet and nearly all     Section 4 discusses the result of ALPHA system. Section
of them contain free texts in natural language that are only   5 concludes our work by comparing related work.
read by human beings. Annotating these pages with              1.1 Domain Specific Sentences
semantic markups is one promising way to make them             Domain specific sentences point to those sentences that
understandable for machines. Unfortunately, automatic          are frequently occurring in one certain application domain
semantic annotation for the natural language sentences in      text but scarcely in others. They are assumed to own the
these pages is a daunting tas k and we are often forced to     following characteristics:
do it manually or semi-automatically using handwritten
                                                               I. vocabulary set is limited
rules. In this paper, we propose a machine-learning (ML)
                                                               II. word usage has patterns
based automatic semantic annotation approach that can
                                                               III.semantic ambiguities are rare
be trained for different domains and require almost no
                                                               IV.terms and jargon of the domain appear frequently
manual rules. The annotation resulted form this approach
                                                               The notion of sublanguage [3,4] has been well discussed
lies in the sentence level, i.e., we will annotate each
                                                               last decade. Domain specific sentences actually can be
sentence or prime sentences in a web page. This approach
                                                               seen as sentences in a domain sublanguage. As previous
stems from our previous research on semantic analysis on
                                                               study has shown, a common vocabulary set and some
natural language sentences using Conceptual Graphs
                                                               specific patterns of word usage can be identified in a
(CG).
                                                               domain sublanguage. These results provide ground for us
   Free texts in the Internet contain various information in   to assume the above characteristics about domain specific
diverse domains. The method we proposed in this paper is       sentences. In the rest of this paper, we will show how
for domain specific sentences that are sentences occur in      characteristics I to III are employed in our work. Terms
a specific application domain. Though the sentences are        and jargon will be dealt with in the following section by
limited in one domain, our method itself is domain             adding them to the Link Grammar dictionary.
independent and the system can be trained for various
domains.
Domain specific sentences are usually very stylish in the      1.2 RDF Graph
words, phrases, grammar and semantics they employ,             After the annotation, sentences from web pages will be
which lead to a strong patterned text for which machine        marked up with RDF statements. We illustrate the
                                                               representation by using an example sentence “I go to
Shanghai”. The corresponding RDF statement will be like         Each word in Link Grammar has a linking requirement
the following:                                                  stating what types of links it can attach and how they are
<rdf:RDF                                                        attached. The link requirements are stored in a Link
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"         Grammar dictionary. The parse result is called a linkage or
xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"              a link structure. The Link Grammar parser is called a link
xmlns="http://cs.sjtu.edu.cn/apex/alpha-schema#"                parser. Currently, the link parser from CMU [5] has a
>                                                               dictionary of about 60000 words together with their linking
<Concept rdf:ID="1">                                            requirements. Although the CMU link parser still has
<rdfs:label>I</rdfs:label>                                      difficulties in parsing complex syntactic structures in real
   <WordNetSenseIndex>WN16-2-012345                             commercial environment, it is now ready for use in
   </WordNetSenseIndex>                                         relatively large prototypes. Applying Link Grammar to
</Concept>                                                      languages other than English (e.g. Chinese [19]) is also
<Concept rdf:ID="2">                                            possible.
<rdfs:label>go</rdfs:label>                                     The most important reason that makes us adopt Link
   <WordNetSenseIndex>WN16-2-012345                             Grammar in our work is the structure similarity between
   </WordNetSenseIndex>                                         Link Grammar parse result and RDF graph. Fig.2 shows
</Concept>                                                      this similarity by comparing the Link Grammar parse result,
<Concept rdf:ID="3">                                            the typical parse tree of a constituent grammar and the
   <rdfs:label>shanghai</rdfs:label>
</Concept>                                                            The link structure:
<rdf:Description about="#1">
<AGNT rdf:resource="#2"/>                                                       Sp*I          MVp              Js
</rdf:Description>
                                                                      I                 go           to             Shanghai
<rdf:Description about="#2">
<DEST rdf:resource="#3"/>
</rdf:Description>                                                    The grammar tree:
</rdf:RDF>
                                                                                         S
             AGNT                   DEST
   I                     Go                   Shanghai

                                                                            NP                 VP
Fig. 1. RDF graph for the example sentence “I go to Shanghai”
Class “Concept” represents concept in sentence. In the                    PRO            V                PP
current implementation, we are using WordNet [7] as
experimental        concept        ontology.     Property                   I            go     PREP                    N
“WordNetSenseIndex” uniquely identifies a word sense
(concept) in WordNet database. Properties such as                                                   to               Shanghai
“AGNT” (agent), “DEST” (destination) are sub-properties
derived from a general property “Relation”. All the sub-                  The RDF graph:
properties of “Relation” are organized as a hierarchy and
thus form the relation ontology. [18]                                                  AGNT                DEST
                                                                      I                         Go                       Shanghai
The RDF statement can also be diagramed as a directed
labeled graph with nodes and arcs as depicted in figure 1.
Since the diagram is simpler and easier to understand, we                    Fig. 2. Link structure is more like a RDF graph
will use the diagram, which we call RDF graph, to
represent RDF statements instead of writing long RDF            RDF graph for the same example sentence. In fact, this
statements in the rest of the paper.                            similarity comes from the common foundation of both RDF
                                                                graph and Link Grammar. RDF graph consists of concepts
1.3 Link Grammar                                                and relations. The relations denote the semantic
Link Grammar is a dependency grammar system we employ           associations between concepts. Similarly, link structure
in our work. For the same sentence “I go to Shanghai”, the      consists of words and links. The links directly connect
Link Grammar parse result is shown in the top of Fig.2.         syntactically and semantically related words [2]. Open
The labeled arcs between words are called links. The            words [17] (such as noun, adjective and verb) access
labels are the types of the links. For example, the “Sp*I”      concepts from the catalog of conceptual types, while
between “I” and “go” represents the type of links               closed words [17] (such as prepositions) and links help
between “I” and a plural verb form. “MVp” connects verb         clarify the semantic relationships between the concepts.
to its modifying prepositional phrases. “Js” connects
prepositions to their objects. In Link Grammar, there is a
finite set of such link types.
Based on this similarity and restricted to a specific                    ontology. Since semantic ambiguities are rare in domain
domain, we propose to automatically generate annotation                  specific sentences (see section 1.1), it is relatively easy to
by learning the mapping from link structure to RDF graph.                perform these mapping operations (the process of
Another important feature of Link Grammar is that the                    semantic analysis).
grammar is distributed among words [2]. There can be a                   What training phase does is a preparation. Before the
separate grammar definition (linking requirement) for each               system can learn to do the mapping in the generating
word. Thus, expressing the grammar of new words or                       phase, we convert the mapping into machine learning area.
words that have irregular usage is relatively easy. We can               Most of studied tasks in machine learning area are to infer
just define the grammar for these words and add them to                  a function that classifies a feature vector into one of a
the dictionary. This is how we deal with terms and jargon                finite set of categories [6]. We thus translate the mapping
of a domain in our approach. Because the vocabulary set                  operation into classification operation by encoding the
for a domain is limited (see section 1.1), we can add all                operation as category and encoding the context in which
unknown words (including terms and jargon) to the                        the operation is performed as feature vector. We call the
current dictionary of Link Grammar with affordable amount                feature vector context vector since it encodes the context
of work.                                                                 information of an operation. The vector generator in the
                                                                         left down corner of Fig.3 is the component that executes
                                                                         this task.
2 Overview of the approach
                                                                         After sufficient training vectors and categories are
Our approach of automatic page annotation is a process
                                                                         obtained in the training phase, the system can enter into
consisting of two phases: the training phase and the
                                                                         the generating phase. RDF generator, the main part of the
generating phase, as shown in Fig.3.
                                                                         generating phase will implement the following algorithm
                          Training Phase                                     Generating Phase

                                                                              Domain
                                                                              Specific
                         Training                        Link                               RDF
                                                                             Sentence
                         Corpora                        Parser

                                               link                 link
                                            structure            structure
                          Training                                                     RDF
                         Interface                                                   Generator
         Domain
        Knowledge
          Expert                                    Domain                       context
                         Mapping                    Ontology                     vector classification
                        Operations


                          Vector                                                      Machine
                                                        Training                      Learning
                         Generator                      Vectors                        Engine


                                    Fig. 3. Overview of the approach
The first step of both phases is to invoke Link Grammar,                 under the help of ML engine and Link Grammar after it is
and parse the sentence into its link structure, which will be            given a sentence from object domain.
mapped to RDF through different means in the two
phases.
                                                                             1 get the link structure for the sentence from link parser.
In the training phase, some domain experts will go through
                                                                             2 generate an empty RDF graph.
a three-operation process to transfer the link structure into
RDF graph manually based on their domain knowledge.                          3 for (i = 1 to 3) { //perform the three kinds of operations
Each operation maps a certain part of the syntactic                          4       generate all possible context vectors from link
structure to its corresponding semantic representation                               structure for the i-th kind of operation.
according to the syntactic and semantic context.                             5       for (every context vector) {
Concepts, schemata 1 and relations contained in the                          6           if (an operation is needed for this vector) {
semantic representation are selected from the domain
                                                                             7           classify the vector using ML engine.
                                                                             8           decode the classified category as an operation.
1
    Schemata, is a set of RDF graphs describing background                   9           perform the operation on the link structure and
    information in a domain.
     10          modify the RDF graph according to the operation        In the training phase, domain experts select all the open
     11          result (using concepts, schemata and relations         words in the link structure one by one. Once an open
     12                from the domain ontology).                       word is selected, the training interface can provide the
                                                                        expert a list of possible concepts or schemata retrieved
     13          }
                                                                        from the domain ontology. The expert then chooses the
     14      }                                                          appropriate one from the list.
     15 }                                                               This operation is then encoded by the vector generator
     16 do integration on the RDF graph.                                into a context vector and its category. For example, the
     17 output the final RDF annotation for the sentence.               context vector for the open word “polo” in the example
                                                                        sentence may be <polo, NN, Dmu, Mp>2 in which “NN” is
          Algorithm 1. The algorithm of the RDF Generator
                                                                        the POS (part-of-speech) tag3 and “Dmu” and “Mp” are
  Our approach is independent of specific ML algorithm                  the innermost left and right link types of “polo” (see
used. In ALPHA system, we adopt IBL (Instance Based                     Fig.4). All the context information is obtained from the link
Learning) for the ML engine because IBL makes it easy to                structure.
determine whether an operation is needed for an arbitrary               The category for the context vector is encoded as the
vector in the above algorithm (line 6). IBL can return a                result of the operation – the ID of the selected concept or
distance value along with the classification result. If the             schemata in the domain ontology. The encoding is
distance value is too large, it can be determined that no               something like “WN16-2-330153” which can be used later
operation is needed for the vector because it is far from               as a key to retrieve concept (in WordNet terminology,
                                                   Ss
                                                   Js                                   MVp
                           Dmu       Mp                 Ds             Pv         MVa                  Jp

                     The      polo     with      an      edge     is   refined     enough       for   work

                                          Fig. 4. The link structure for the example sentence
being similar to existing training vectors and may be                   word sense) from the WordNet database.
deemed as noise. For other learning methods, this                       Since WordNet is not specific for any domain, some
determination may not be easily achieved.                               words in a certain domain may not exactly match any
In the following section, we will explain Algorithm.1 and               sense in the list. For those words the experts are asked to
the three operations by taking an example sentence “The                 choose the most similar sense instead of adding a new
polo with an edge is refined enough for work”, which is                 sense to WordNet so as to preserve the hierarchy in
excerpted from a corpora of clothes descriptions collected              WordNet for further research.
from many clothes shops on the Web. In the sentence,                    III. Generating
“Polo” is a brand and represents a certain kind of shirts
and “edge” actually means collar. The link structure for                Generating all possible context vectors (line 4 of
this sentence is shown in Fig.4.                                        Algorithm 1.) is actually to generate one context vector for
                                                                        each open word in the link structure of the sentence. The
3 Learning to generate RDF                                              generated context vector is then sent to the ML engine as
In this section we will introduce the three operations that             to do a classification. The returned category is an
map a link structure to RDF: word-conceptualization, link-              encoding of concept or schema ID. In line 9 of Algorithm
folding and relationalization. These three kinds of                     1, the RDF generator retrieves the concept or schema from
operations must be performed exactly in the right order in              the domain ontology according to the decoded ID and
both the training phase and the generating phase because                creates a concept node in the RDF graph.
a later operation may use information generated in the
                                                                        Because word usage has patterns in domain specific
previous operations. In section 3.4, the integration on
                                                                        sentences, we expect that similar context vectors appear
RDF graph (line 16 of Algorithm.1) is explained.
3.1 Word-Conceptualization                                              2
                                                                             The vector is just an example. For brevity, we are not
I.     Function
                                                                            trying to make the vector encoding perfect in this paper.
Word-conceptualization is the first operation to be                         Actually, what context information is encoded into the
performed. Its function is to annotate open words as                        vector is a separate problem. This problem is isolated
concepts in the sentence to form the skeleton of the initial                into the vector generator component. In the current
empty RDF graph and mark close words for further                            implementation, we defined a configuration file for the
operation. This operation can be seen as a word sense                       vector generator to address the issue.
disambiguation operation.                                               3
                                                                             We can augment the link parser with a POS tagger so
II. Training                                                                that the accurate POS tag information can be added to
                                                                            the link structure and be obtained from it later.
for a given open word on a specific word sense. Based on         And the category is the encoding of the ID of the PART
these similar context vectors, we expect the ML engine           relation in the domain ontology.
can return correct classification with a high possibility        III. Generating
since the semantic ambiguity is also rare in domain
specific sentences.                                              In the generating phase, generating all possible context
                                                                 vectors for this operation (line 4 of Algorithm 1.) is
After this step, all concept nodes of the RDF graph              actually generating one context vector for every possible
should be created. The RDF graph for the example                 case in which a closed word connects two concepts. This
sentence is shown in Fig.5. For convenience, we use              needs consult the concept information generated in the
simple concept names in Fig.5. The “S-WORK” is the               word-conceptualization operations. If an operation is
“SUITABLE-FOR-WORK” schema in domain ontology.                   needed for the vector, it is sent to the ML engine to do a
                                                                 classification. The returned category is an encoding of the
                                                                 relation ID in domain ontology. In line 9 of Algorithm.1,
           POLO               EDGE             REFINE            the RDF generator retrieves the relation from domain
                                                                 ontology according to the ID and creates the relation
                                                                 between the two concepts.
                   ENOUGH          S-WORK: *x                    For the example sentence, there are three closed words
                                                                 that need link-folding operation: ‘with’, ‘is’and ‘for’, as
       Fig. 5. RDF graph after word-conceptualization            shown in Fig.4. Among them, the word ‘is’is an auxiliary
                                                                 verb and ‘with’and ‘for’are prepositions.
3.2 Link-Folding                                                 The relation implied by the auxiliary verb ‘is’is THEME
I.      Function                                                 and the ‘for’ between ‘refined’ and ‘work’ implies a
The following two operations focus on creating the               RESULT relation. The RDF graph after this step has
semantic relations between the concepts. Closed words            relations added between concepts. As to our example
(especially prepositions) with their links imply                 sentence, the RDF graph has grown to Fig.6.
relationships between the concepts of the words they
connect. In the example sentence, “… polo --- with ---
                                                                                    PART         EDGE           ENOUGH
edge …” fragment implies a PART relation between                      POLO:#
[POLO:#] and [EDGE]. We then “fold” the 'with' and its
left and right links and replace them with a PART relation.           THME
This is just what the link-folding operation does.                             REFINE    RSLT      S-WORK: *x
Closed words with their links representing semantic
relations can be seen as word usage patterns. In domain
specific sentences, such patterns are expected to occur                Fig. 6. RDF graph after link-folding
frequently. This actually enables the machine to learn the       3.3 Relationalization
patterns from training corpora. In addition, since semantic      I.    Function
ambiguities are rare in domain specific sentences, it can be
                                                                 Semantic relation can also be implied by a link that directly
expected that the result of the learning converge on the
                                                                 connects two concepts in the link structure. For example,
correct relation. Similar analysis also applies to the next
                                                                 the ‘MVa’link between ‘refined’and ‘enough’in the link
operation –relationalization in section 3.3.
                                                                 structure of example sentence implies a MANNER relation.
II. Training                                                     The relationalization operation translates this kind of links
In the training phase, the domain expert can select any          into corresponding semantic relations.
closed word that connects two concepts and implies a             II. Training
semantic relation and map it to the responding semantic
                                                                 In the training phase, domain knowledge expert can select
relation from the relation ontology4.
                                                                 any link that implies a semantic relation between concepts
The context vector for this operation may encode context         it connects. The expert then selects the semantic relation
information such as the POS tag of the closed word, the          from the domain ontology for the connected two
left and right link types and the two concepts. The              concepts.
category is an encoding of the relation ID in the domain
                                                                 The context vector for this operation can include
ontology. For the “… polo --- with --- edge … ”case, the
                                                                 information such as the link type and the concepts. For
context vector may be <with, IN, Mp, Js, POLO, EDGE>5.
                                                                 the “… refined –MVa – enough … ”, the context vector
                                                                 may be <MVa, REFINE, ENOUGH>. The category for the
                                                                 context vector can be encoded as the relation ID in the
4
     For brevity, we omitted the direction of a relation here.   domain ontology. For the above vector, it is the ID of the
5                                                                MANNER relation.
     The POLO and EDGE in the vector are actually the
    concept IDs in the domain ontology. We will use the          III. Generating
    same convention in the following vector examples.
In the generating phase, generating all possible context       After the expansion, we can do a simple co-reference
vectors for this operation (line 4 of Algorithm 1.) is         detection that draws a co-reference line between the
actually generating one context vector for every link that           type SUITABLE-FOR-WORK(x) is
connects two concepts. If an operation is needed for the
vector, it is sent to the ML engine to do a classification.                               SUTB
The returned category is an encoding of the relation ID in               CLOTHES: x                WORK-SITUATION
domain ontology. In line 9 of Algorithm.1, the RDF
generator retrieves the relation from domain ontology
according to the ID and creates the relation between the
                                                                       Fig. 8. The definition for SUITABLE-FOR-WORK
two concepts.
After this step, more relations may be created in the RDF      undetermined variable x and the current topic [POLO:#].
graph. As to the example sentence, the MANNER relation         After this step, the final graph is generated. Fig.9 is the
will be created to connect the [REFINE] concept and the        result for our example sentence “The polo with an edge is
[ENOUGH] concept and the whole graph grows to Fig.7.           refined enough for work”.
                                                               3.5 Summary
        POLO:#      PART          EDGE                         Through the sections from 3.1 to 3.4, we have explained
                                                               how we map link structure to RDF graph and convert the
      THME
                  REFINE RSLT       S-WORK: *x                                      PART
                                                                        POLO:#                      EDGE
              MANR
                        ENOUGH                                        THME
                                                                                           MANR       ENOUGH
      Fig.7.RDF graph after relationalization                                    REFINE

3.4 Integration                                                                  RSLT
Integration is the last step (line 16) in Algorithm.1. This
step is not a part of the training phase. It only appears in                                SUTB
                                                                           CLOTHES: x                WORK-SITUATION
the generating phase and it is the only step that uses
manually constructed heuristics. What it does includes
simple co-reference detection and nested graph creation.                 Fig. 9. The final RDF graph of the example sentence
In the discussion of the previous three operations, we         mapping     to    machine      learning    area.   Word-
don’t involve lambda expressions for brevity. In fact, they    conceptualization builds concepts in the RDF graph. Link-
may appear when words for concepts are missed in the           folding and relationalization connect concepts with
sentence. They may als o be introduced when schema is          semantic relations. In the last step, we use manually
selected in word-conceptualization phase. In order to          constructed heuristics to do simple co-reference detection
complete the RDF graph, we need to draw co-reference           and nested graph creation.
lines between the variables in these lambda expressions.
                                                               4 Results
Although there is machine-learning based approach for
                                                               We have developed a prototype called ALPHA system
co-reference detection [9], in our work we mainly focus on
                                                               written in C. ALPHA system is now running on Solaris. It
the generation of RDF graph for a single sentence.
                                                               can be trained for different domains. Currently in our work,
Discourse analysis and co-reference detection is left for a
                                                               clothes domain is chosen as the sample domain. Nearly
separate research work. For different domains, we may
                                                               300 clothes descriptions, 500 sentences have been
construct different heuristics for them. In our current wok
                                                               collected from clothes shops on the Web6 and are trained
we simply make all undetermined references to point to the
                                                               in ALPHA system. Among them, 34 descriptions and 93
topic currently under discussion.
                                                               sentences are reserved for testing. The test result is
Nested graph (context) may be introduced by expanding          shown in Fig.10. Using different IBL algorithms, the
schema definition or removing modal/tense modifiers of a       accuracy7 of concepts varies from 60% ~ 80%, and that of
concept. Although RDF specification lacks a clear              relation varies from 45% ~ 60%.
semantics about RDF reification, we are now using RDF
                                                               The result demonstrated the feasibility of our approach.
reification mechanism to represent nested graph (context).
In our example, we have mentioned in section 3.1 that the
concept type S-WORK is actually a “SUITALBE-FOR-               6
                                                                    Those online shops include www.brooksbrothers.com
WORK” schema from the domain ontology. We can do an                and www.gap.com, etc.
expansion on it. Fig.8 is the definition for the “SUITALBE-    7
FOR-WORK” schema. SUTB represents the relation                      Here the accuracy of concepts = concepts annotated
SUITABLE.                                                          correctly / total concepts, and the accuracy of relations =
                                                                   links annotated correctly / links that should be
                                                                   annotated.
Link Grammar has a strong impact on the accuracy of            according to construction rules on how to fill the slots.
ALPHA system. Although its characteristics make it             Although this approach has been successfully applied in
relatively easily to add domain grammar, it has some           many applications, it heavily depends on manually created
trouble in disambiguating the syntactic structure of over-     construction rules on the parse tree.
abridged sentences in clothes domain, such as “Back            Another kind of technique advanced in previous work is
vent.”, which causes a serious failure in ALPHA system.        to directly map between syntactic structure and semantic
Though we are aware of the problem, we will let it be at


                          Fig. 10. The accuracy of concepts and relations about different algorithm

present because we want to pay more attention to               structure such as [13]. We call them structure-mapping. In
semantic disambiguation.                                       this respect, they are more similar to our work. To map to
 To improve the accuracy of ALPHA system, we are               more flat structures of conceptual graphs, [13] uses
considering developing new algorithms that can compute         syntactic predicates to represent the grammatical relations
the distances of vectors more accurately. We are also          in the parse tree. Instead, in our work, Link Grammar is
considering making changes in the architecture so as to        employed to directly obtain a more flat structure. Different
support the analysis of clauses and idioms. Further more,      from [13]’s approach, our work doesn’t use manual rules.
other application domains will be selected to test our         Moreover, we separate the semantic mapping into several
approach.                                                      steps that greatly reduce the total number of possibilities.
                                                               In another work in [14], parse tree is first mapped to a
5 Related works                                                “syntactic conceptual graph”. The “syntactic conceptual
Ontology-based annotation is most studied such as [15],        graph ” is then mapped to a real conceptual graph. This
[16]. [15] extends HTML with semantic extensions and           approach again heavily uses manually constructed
builds an interactive and graphic tool to help annotate        mapping rules.
web pages manually. What it does is to associate an
                                                               Up to now most methods for annotation are by hand or
object in HTML with a concept from their ontology. After
                                                               heavily depend on rules created manually. These methods
gaining experiences from manually annotation, they also
                                                               will have difficulty in applying to the Web because of the
conceive an information extraction-based approach for
                                                               tremendously large amounts of pages. Our approach
semi-automatic annotation of natural language texts by
                                                               provides an automatic way to annotate them in a faster
mapping terms to ontological concepts. Different from it,
                                                               and robust way. Research on machine learning in natural
our approach is fully automatic after the training phase.
                                                               language processing using corpora data [6] has increased
Our approach also generates the semantic markup in
                                                               significantly and there are growing numbers of successful
standard RDF format.
                                                               applications     of      symbolic     machine     learning
In natural language annotation, grammar-based approach         techniques[10,11]. Our work presents a preliminary
is often used. They can roughly be divided into slot-filling   inquest into the use of traditional machine learning
and structure-mapping categories according to their            techniques to automatically generate semantic markups
generating techniques.                                         for domain specific sentences. We expect that many
Slot-filling techniques such as [12] fill template semantic    improvements are possible and our work may be
graphs with thematic roles identified in the parse tree.       selectively adopted or enhanced.
Often the graph of one tree node in the syntactic parsing
tree is constructed using the graph of its child nodes
References                                                           12. Cyre,W.R., Armstrong J.R., and Honcharik,A.J., Generating
1 Walter Daelemans, Jakub Zavrel, Kovan der Sloot, and                   Simulation Models from Natural Language Specifications, in
    Antal van den Bosch, TiMBL: Tilburg Memory Based                     Simulation 65:239-251, 1995.
    Learner version 3.0 Reference Guide, March 8, 2000               13 Paola Velardi, et.,all, Conceptual Graphs for the analysis
2. Daniel D.Sleator and Davy Temperley, Parsing English with             and generation of sentences, in IBM Journal of Research and
    a Link Grammar, in the Third International Workshop on               Development, 32(2), pp.251-267, 1988.
    Parsing Technologies, August 1993.                               14 Caroline Barrière, From a Children’s First Dictionary to a
3. Sager Naomi, "Sublanguage: Linguistic Phenomenon,                     Lexical Knowledge Base of Conceptual Graphs, Ph.D
    Computational Tool," In R. Grishman and R. Kittredge                 thesis, School of Computing Science, Simon Fraser
    (eds.), Analyzing Language in Restricted Domains:                    University,           1997.           Available           at
    Sublanguage Description and Processing, Lawrence                     ftp://www.cs.sfu.ca/pub/cs/nl/BarrierePhD.ps.gz
    Erlbaum, Hillsdale, NJ, 1986                                     15 M.Erdmann, A.Maedche, H.-P.Schnurr, and Steffen Staab.
4. R. Kittredge and J.Lehrberger, “Sublanguage: Study of                 From manual to semi-automatic semantic annotaiton: about
    language in restricted semantic domain”, Walter de Gruyter,          ontology -based text annotation tool, In P. Buitelaar & K.
    Berlin and New York, 1982.                                           Hasida (eds). Proceedings of the COLING 2000 Workshop
5. The information about the link parser from Carnegie Mellon            on Semantic Annotation and Intelligent Content, August
    University              is           available             at:       2000
    http://link.cs.cmu.edu/link/index.html                           16 Michael Schenk. Ontology-based semantical annotation of
6. Raymond J.Mooney and Claire Cardie, Symbolic Machine                  XML. Master's thesis, Univeritat (TH) Karlshruhe, 1999
    Learning for Natural Language Processing, in the tutorial of     17. James Allen, “Natural Language Understanding”, 2nd
    ACL'99, 1999. Available at http://www.cs.cornell.edu/Info/           edition, pp.24-25, the Benjamin/Cummings Publishing,
    People/cardie/tutorial/tutorial.html                                 1995.
7. George A.Miller, WordNet: An On-line Lexical Database, in         18 John F. Sowa, Knowledge Representation: Logical,
    the International Journal of Lexicography, Vol.3, No.4,              Philosophical, and Computational Foundations, Brooks
    1990.                                                                Cole Publishing Co., Pacific Grove, CA, 2000.
8. Mitchell P.Marcus, Beatrice Santorini, and Mary Ann               19 Carol Liu, Towards A Link Grammar for Chinese,
    Marcinkiewicz, Building a large annotated corpus of                  Submitted for publication in Computer Processing of
    English: the Penn Treebank, Computational Linguistics,               Chinese and Oriental Languages - the Journal of the
    19:313-330, 1993.                                                    Chinese Language Computer Society.
9. McCarthy,J., and Lehnert,W., Using Decision Trees for
    Coreference Resolution. In Mellish, C. (Ed.), Proceedings of
    the Fourteenth International Conference on Artificial
    Intelligence, pp. 1050-1055. 1995.
10. Claire Cardie and Raymond J.Mooney, Machine learning
    and natural language (introduction to special issue on natural
    language learning). Machine Learning, 34, 5-9, 1999.
11. Brill, E. and Mooney, R.J. An overview of empirical natural
    language processing, AI Magazine, 18(4), 13-24, 1997.