Semantic Annotation of Texts
with RDF Graph Contexts
H. Cherfi1 , O. Corby1 , C. Faron-Zucker1,2, K. Khelif1 and M.T. Nguyen1
1
INRIA Sophia Antipolis - Méditerranée
2004 route des Lucioles - BP 93
FR-06902 Sophia Antipolis cedex
{Hacene.Cherfi,Olivier.Corby,Khaled.Khelif}@sophia.inria.fr
2
I3S, Université de Nice Sophia Antipolis, CNRS
930 route des Colles - BP 145
FR-06903 Sophia Antipolis cedex
Catherine.Faron-Zucker@unice.fr
Abstract. The basic principle of the Semantic Web carried by the RDF
data model is that many RDF statements coexist all together and are uni-
versally true. However, some case studies imply contextual relevancy and
truth - this is well known in the Conceptual Graph community and han-
dled through the notion of contexts. In this paper, we present an approach
and a tool for semantic annotation of textual data using graph contexts.
We rely on both Natural Language Processing and Semantic Web tech-
nologies and propose a model of RDF contexts inspired by the nested Con-
ceptual Graphs. Sentences are primarily analysed and their grammatical
constituents (subject, verb, object) are extracted and mapped to RDF
triples. Links between these triples are then established within a seman-
tic scope (i.e., context). The context definition allows us to validate the
generated annotations by disambiguating the misleading RDF triples. We
show how far our approach is applicable to texts in Engineering Design.
1 Introduction
The semantic annotation of texts consists in extracting semantic relations be-
tween domain relevant terms in texts. Several studies address the problem of cap-
turing complex relations from texts - more complex relations than subsumption
relations between terms identified as domain concepts. They combine statistical
and linguistic analyses. The main applications are in the biomedical domain [1]
by relating genes, proteins, and diseases. Basically, these approaches consist of
the detection of new relations between domain terms; whereas in the seman-
tic annotation generation, we aim to identify existing relations, belonging to
the domain ontology, within instances in texts and to complete them with the
description of the domain concepts related by these identified relations.
The core issue of the methodology we propose stands in the mapping between
grammatical elements of each sentence in the analysed text and the correspond-
ing entities in the dedicated-domain ontology. We base upon the MeatAnnot
approach previously designed to support text mining and information retrieval
in the biological domain [2]. It consists of: (i) the detection of relations described
in a biomedical ontology, (ii) the detection of terms linked by the identified re-
lations based on term linguistic roles (subject, object, etc.) in the sentence, and
(iii) the generation of a corresponding annotation of the analysed biomedical
text. We generalize this approach (a) by handling any domain ontology associ-
ated to the text to analyse: we do not restrict to the biomedical ontology and
rather propose a domain independent approach; (b) by distinguishing between
the ontological level and the instance level when linking a term in the text to
the ontology: a term is identified to an instance of a concept rather than to
the concept itself; (c) by enriching the extracted instances of conceptual rela-
tions with contextual knowledge. We rely upon the Corese3 semantic search
engine [3] which implements the RDF [4] graph-based knowledge representation
language and the SPARQL query language [5]. Moreover, Corese was extended to
handle RDF contextual metadata, hereafter called contexts.
SPARQL is provided with query patterns on named graphs enabling to choose
the RDF dataset against which a query is executed. This is a first step to handle
contextual metadata. A named graph can be used to limit the scope of an RDF
statement to the context in which it is relevant to query it. Furthermore, by
naming contextualized RDF graphs, they can be themselves associated with RDF
metadata, enabling querying on several “levels” of (meta-)annotations. This is
close to the notion of nested graphs in the Conceptual Graphs model [6]. We
base upon a feature proposed in [7] to declare RDF sources and we use it to
handle named RDF graphs representing different contexts. Corese is provided
with two RDF/SPARQL design patterns and SPARQL extensions to represent and
query contexts. A first pattern is dedicated to the handling of a hierarchical
organization of RDF graphs which can represent inclusions of contexts [8]. The
second pattern is described in this paper and addresses the problem of querying
for the contextual relations holding between recursively nested contexts. We
take advantage of these Corese features to make explicit the rhetorical relations
contained in texts and represent them in the semantic annotations as relations
between RDF graph contexts. The methodology we present is implemented and
applied to the Engineering Design domain within the framework of the European
project SevenPro [9].
This paper is organised as follows. We give in section 2 the Natural Language
Processing (NLP) technique we use to annotate a given text with RDF triples by
relating terms occurring in the text. We introduce in section 3 the Corese de-
sign pattern we use to represent and handle nested contexts. We show how we
use it to enrich our primary text annotations. We explain how these contextual-
ized annotations provide further information retrieval capabilities when applied
to Engineering Design domain. Related work is discussed in section 4. Finally,
concluding remarks are provided in section 5.
3
http://www.inria.fr/acacia/soft/corese
2 NLP-Driven Semantic Annotation of Texts
Extraction of relations from texts We use the RASP [10] parser for English
texts in order to extract NLP relations (i.e., verb) and their arguments (i.e.,
subject, object). The RASP parser is in charge of assigning a grammatical category
to each word by constructing a syntactical tree of each sentence of the text. For
example, let us consider the following simple sentence S as our running example
throughout this paper:
S: The L1 luggage compartment contains 100cc.
Hence, we give a simplified RASP syntax tree in Table 1. The sentence S consists
of: (1) noun phrase NP, on the left branch of the syntactical tree, which represent
the subject subj: determiner and two modifiers; and (2) verbal phrase VP, on
the right hand side, constituted of the main verb and the direct object dobj.
Table 1. Simplified RASP syntax tree for the running example sentence S
p8 S hPPPPP
ppppp PPP
pp PPP
p PPP
ppp
NP (subj) VP (verb + dobj)
]<< O ^<<
mmm
6
m mm << <<
mm m << <<
mmm << <<
NP((det + mod) + mod) << <<
O hQQQ << <<
QQQ << <<
QQQ << <<
QQQ < <
The L1 luggage compartment contain + s 100cc
Mapping of grammatical constituents to RDF triples Let us show on
the running example the correspondence between a sentence and its translation
to an RDF graph triple. Provided that the domain ontology conveys the follow-
ing knowledge (as it is the case of the ontology we have built for the SevenPro
project): a Luggage compartment is part of a Car; a Luggage compartment
is related to a Capacity; property contain has for rdfs:domain Car parts
(i.e., Luggage compartment, Door, etc.); property contain has for rdfs:range
a Capacity unit. We can state that the triple L1, contain, 100cc is a valid
instance of property contain and we add it to the text annotation set. The
RDF/XML syntax of this statement is given in Table 2 (the spro namespace iden-
tifies the SevenPro ontology).
From simple- to complex-sentence semantic annotation We showed
above how we generate RDF annotations for simple sentences with grammati-
cal patterns subject, verb, object, hereafter called S − V − O (some possible
ambiguity conveyed by the textual material put aside). Here we discuss the
Table 2. From RASP output to RDF triples
The L1 luggage compartment contains 100cc.
RASP syntactic tree analysis RDF annotation
("S"
("NP" ("NP" "The" "L1") "luggage" "compartment")
("VP" "contain::s" ("NP" "100cc"))
".")
handling of more complex sentences and the annotations which we generate. In
addition to the S − V − O (sentence in active form) and O − V − S (sentence in
passive form) grammatical patterns, we correctly parse and annotate sentences
with subordinate phrases when these phrases are “independent” from the main
sentence.
However, for other complex sentences, the semantics of the connection be-
tween the subordinate and the main sentence is not so simple and cannot be cap-
tured in RDF –which is limited to the representation of conjunctive knowledge.
It is, for instance, the case of disjunctive sentences where alternative statements
co-exist in implicit different contexts. It is also the case when rhetorical relations
play a key role in the sentences to be annotated, like the following one including
a conditional premise: “If the car C3 has part door D4, then the 100cc are con-
tained in the L1 luggage compartment.”, or this other one containing a causal
premise: “The L1 luggage compartment capacity contains 100cc because the car
C3 has part door D4.”. In some applications, it constitutes a major problem and
may lead to a deadlock issue when querying the RDF graph with SPARQL. Hence,
we define the so-called RDF graph context, with recursive capability, in order to
tackle the current expressiveness capability lack.
3 Extension of SPARQL to Handle Contextual Relations
and Nested Contexts
3.1 RDF graph context definition
The SPARQL query language [5] offers capabilities for querying by graph patterns.
The retrieval of solutions (i.e., RDF triple sets) is based on graph pattern match-
ing, close to Conceptual graphs (CG) projection. A SPARQL query is executed
against an RDF dataset which represents a collection of graphs. The SPARQL key-
word GRAPH is used as primitive to match patterns against named graphs in the
query of the RDF dataset, as shown hereafter:
1. SELECT * WHERE {
2. GRAPH ?s1 {?x c:prop ?y}
3. }
In line 2 of this example, we can state that the pattern graph ?s1 {?x c :
prop ?y} is named as graph ?s1. It can provide a URI to select one graph or
use a variable which will range over the URIs of named graphs in the dataset. A
complementary feature is proposed in [7] and implemented in Corese to declare
RDF sources. For instance, We can define the source of the graph, as in line
1 below cos : graph = ”http : //www.sevenpro.org/car/ctx1”, for the following
RDF triples corresponding to the sentence with subordinate: “The L1 luggage
compartment, that contains 100cc, is separated from tailgate T2.”. This graph
source is used as the context ctx1 for these triples within SevenPro car domain.
1. cos:graph="http://www.sevenpro.org/car/ctx1"
2. {
3. spro:#T2 spro:separate spro:#L1
4. spro:#L1 spro:contain spro:#100cc
5. }
In RDF/XML syntax, the first triple in line 3 above can be written extensively as:
We use the SPARQL GRAPH primitive to handle RDF named graphs repre-
senting different contexts within which alternative metadata can be described.
Furtehrmore, we provide an extension of SPARQL to query for contextual rela-
tions holding between recursively nested contexts. Once contextual knowledge is
represented into RDF named graphs identified by URIs and queried with GRAPH
query patterns, these graphs can themselves be described into other separate
named graphs. This process of meta-annotating named graphs identifying con-
texts leads to a recursive nesting of contexts –contexts nested one into another.
This is of prime interest for use cases where context graphs are annotated with
rhetorical or temporal relations. The unstacking of contexts should make explicit
the progress in which nested graphs are involved.
We propose an extension of SPARQL with a REC GRAPH keyword whose gram-
mar rule is similar to the standard SPARQL GRAPH one. The following query en-
ables to retrieve the triples from nested graphs related to a given contextual rela-
tion c_Rel. Moreover, all sub-properties of c_Rel –following rdfs:subPropertyOf
subsumption relations having c_Rel as value in the RDFS ontology– are matched
with the SPARQL query.
SELECT * WHERE {
REC GRAPH ?s {?gr1 c_Rel ?gr2} .
}
In addition, when the property is not specified, e.g., a variable ?p replacing
c_Rel, Corese retrieves the RDF triples having any property (cf. details in [11]).
3.2 Application example to Engineering design domain
We have used Corese Graph context capabilities within Sevenpro textual corpus
in Engineering Design and the subsequent spro ontology. We show the practical
use of the contexts for giving additional metadata with a sentence of the form:
If [C1] then [C2], unless [C3]. Then, we show how to improve the SPARQL
triple set results with corresponding context-augmented SPARQL queries. We
comment the RDF graph context representation, we justify the SPARQL queries,
followed by a presentation of the possible RDF triple results. Moreover, in the
sentence depicted in Table 3, we show the use of nested contexts. In the second
column of Table 3, we describe the corresponding RDF triples for the sentence aug-
mented with RDF graph contexts g1 to g3. The third column describes how these
graphs are defined as URI resources (with rdf:Description syntax) and nested
within nesting graphs c1 and c2 through the domain relations spro:then and
spro:unless. In so doing, we are able to query, with context-augmented SPARQL
language using the keyword REC GRAPH. Then, Corese matches the triples in the
RDF graph corresponding to triples matching the contextual relations spro:then
and spro:unless. We extensively obtain the triples shown in column three of
Table 3, (lines 3 to 5 in the result part), alongside with the contextual relations
spro:then and spro:unless (first two lines in the result part). We show the
context-augmented triple results compared to the mere results which we query
with standard SPARQL without contexts.
Table 3. Result analysis example in Engineering design domain
Sentence RDF triple with context Context relation
If the vehicle V2 ctx:g1 {
satisfies the ctx:c1 {
requirement R1,
then inlet headliner
H3 should be lifted
by metal bar B4, }
unless H3 is in }
position P5. ctx:g2 {
ctx:c2 {
}
}
ctx:g3 {
}
SPARQL query Context-augmented SPARQL query
SELECT * SELECT ?g ?x ?p ?y
WHERE {?x ?p ?y} WHERE {
REC GRAPH c2 {?w ?q ?z} }
Triple results of SPARQL query Context-augmented triple results
#V2 spro:satisfy #R1 1. ctx:c1 ctx:g1 spro:then ctx:g2
#B4 spro:lift #H3 2. ctx:c2 ctx:c1 spro:unless ctx:g3
#H3 hasPosition #P5 3. ctx:g1 #V2 spro:satisfy #R1
4. ctx:g2 #B4 spro:lift #H3
5. ctx:g3 #H3 hasPosition #P5
The named graphs in the sentence of Table 3 are nested as it is shown in
Fig. 1. They are organised in the hierarchy of contexts: [c1] : [g1]then[g2];
[c2] : [c1]unless[g3]. Hence, we can relate the RDF triple “a p b” to “c q d”
by traversing the hierarchy of Fig. 1. In so doing, the semantics of the example
sentence is fully captured with annotation capability of nested graph contexts.
c2
c1 unless
then
g3
g2 cqd
g1
apb
Fig. 1. In Table 3 sentence: g1 and g2 are nested in c1, which is nested, with g3, in c2.
4 Discussion and Related Work
The mechanism introduced by RDF graph contexts is powerful enough to rep-
resent a variety of NL expressions. First, with the RDF context expressiveness,
we can represent the logical disjunction or, the negation not as RDF graph con-
texts. Moreover, we can describe the modal primitives can, may, as in: The
headliner may be projected beyond the vertical of the external surface. There are
a number of other relations which we can model: temporal (i.e., after, mean-
while, etc), spatial (i.e., below, behind, etc.), comparative (i.e., more... than,
etc.). Presently, we fail to model the correct annotations of sentences having
an ambiguous subject/object constituents. Moreover, a variant in the exam-
ple sentence raises the still-open problem of anaphora resolution in NLP. The
inlet headliner H1 should be lifted by metal bar B2 [. . . ] unless it is in position
P5; where the pronoun it represents H1.
In the Semantic Web domain, the work of [12] addresses the problem of
provenance and trust on the web and proposes an extension of RDF to handle
RDF graphs named by URIs, enabling RDF statements describing RDF graphs. The
notion of context is used in [13] to separate statements that refer to different
contextual information. They describe a practical solution to explicitly tie con-
textual information to RDF statements. They identify SPARQL as the query lan-
guage satisfying their requirements with its patterns on named graphs, however
they do not propose any extension of RDF or SPARQL representation paradigms.
5 Conclusion and Future Work
The objective of this paper is twofold: (i) to show how we generate accurate RDF
triples from texts using NLP techniques, and (ii) to augment the semantic annota-
tion generation with RDF graph context metadata in order to catch the semantics
of the analysed texts, and consequently to enhance the retrieval capabilities. Lin-
guistic analysis is used to suggest appropriate annotations to the text. The text
analysis process strongly depends on the background knowledge (i.e. ontologies,
terminology, etc.) of the analysed domain. The more precise ontologies and re-
lated terminology - list of domain terms, e.g. car manufacturer names, etc. -,
the more significant the extracted annotations are. We have started to generate
RDF annotation triples from simple (S − V − O) sentences. Then, a number of
features were designed to generate more complex annotations, e.g., sentences
containing subordinate phrases. Based upon the context graph capability, we
have shown new capabilities of high usefulness in the query of the graph by us-
ing named graphs and nested contexts. The RDF graph context paradigm can be
used recursively. Hence, the text annotation allows us to produce the accurate
corresponding semantic annotation. Finally, our approach is domain indepen-
dent. The analysis process remain the same provided that ontologies have been
adapted according to the text domain.
In the future, we aim at developing more complex sentence analysis follow-
ing the rhetorical relations studied in RST [14] based on the RDF graph context
expressiveness. In so doing, a more precise evaluation can be conducted.
References
1. Staab, S.: Mining information for functional genomics. IEEE Intelligent Systems
and their Applications 7 (March-April 2002) 66–80
2. Khelif, K., Dieng-Kuntz, R., Barbry, P.: An ontology-based approach to support
text mining and information retrieval in the biological domain. Journal of Universal
Computer Science (JUCS) 13(12) (2007) 1881–1907
3. Corby, O., Dieng-Kuntz, R., C.Faron-Zucker: Querying the semantic web with the
corese search engine. In: In Proc. of the 16th Eur. Conf. on Artificial Intelligence
ECAI’04/PAIS’04, Valencia, Spain, IOS Press (2004) 705–709
4. Manola, F., Miller, E., McBride, B.: rdf primer. Technical report, W3C Recom-
mendation (2004) w3.org/TR/2004/REC-rdf-primer-20040210/.
5. Prud’hommeaux, E., Seaborne, A.: sparql query language for rdf. Technical
report, W3C Recommendation (2008) www.w3.org/TR/rdf-sparlq-query/.
6. Chein, M., Mugnier, M.L., Simonet, G.: Nested Graphs: A Graph-based Knowledge
Representation Model with FOL Semantics. In: Proc. of the 6th Int’l Conf. on
Principles of Knowledge Representation and Reasoning (KR’98), Trento, Italy,
Morgan Kaufmann Publishers (June 1998) 524–534
7. Gandon, F., Bottollier, V., Corby, O., Durville, P.: RDF/XML Source Declaration.
In: Proc. of IADIS WWW/Internet, Vila Real, Portugal (2007) 5 pages
8. Corby, O., Faron-Zucker, C.: Implementation of SPARQL Query Language based
on Graph Homomorphism. In: Proc. of the 15th Int’l Conf. on Conceptual Struc-
tures (ICCS’07), Sheffield, UK, IEEE Computer Science Press (July 2007) 472–475
9. SevenPro: Semantic virtual engineering environment for product design European
Special Targeted Research Project: FP6-027473, www.sevenpro.org.
10. Watson, R., Carroll, J., Briscoe, T.: Efficient extraction of grammatical relations.
In: Proc.of the Ninth International Workshop on Parsing Technologies (IWPT),
Vancouver, Association for Computational Linguistics (October 2005) 160–170
11. Corby, O.: Web, Graphs & Semantics. In: Proc. of the 16th In’l Conf. on Conceptual
Structures (ICCS), Toulouse (July 2008)
12. Carroll, J., Bizer, C., Hayes, P., Stickler, P.: Named Graphs, Provenance and Trust.
In: Proc. of the 14th WWW Conf. Volume 14., Chiba, Japan (2005) 613–622
13. Stoermer, H., Palmisano, I., Redavid, D., Iannone, L., Bouquet, P., Semeraro, G.:
rdf and Contexts: Use of sparql and Named Graphs to Achieve Contextualiza-
tion. In: Proc. of the 1st Jena User Conference, Bristol, UK (2006) 613–622
14. Mann, W.C., Matthiessen, C.M., Thompson, S.A.: Rhetorical Structure Theory
and text analysis. In: Discourse Description: Diverse Linguistic Analyses of a
Fund-Raising Text. John Benjamins (1992) 39–78