=Paper=
{{Paper
|id=Vol-99/paper-4
|storemode=property
|title=Using RDF to Annotate the (Semantic) Web
|pdfUrl=https://ceur-ws.org/Vol-99/Phil_Cross-et-al.pdf
|volume=Vol-99
|dblpUrl=https://dblp.org/rec/conf/kcap/CrossMP01
}}
==Using RDF to Annotate the (Semantic) Web==
Using RDF to Annotate the (Semantic) Web
Phil Cross Libby Miller Sean Palmer
Institute for Learning & Research Institute for Learning & Research SWAG
Technology, University of Bristol, Technology, University of Bristol, sean@mysterylights.com
Bristol, UK Bristol, UK
phil.cross@bristol.ac.uk Libby.miller@bristol.ac.uk
Abstract The data model discussed in this paper has been imple-
Collections of annotations need indexes for certain of their mented within a system that enables third-party experts to
properties. When we have a set of created annotations, we evaluate the quality of health-related Web sites, using a
want to be able to ask questions concerning these properties metadata set developed for the project. Other aspects of the
such as: agency: "Who said this?"; timing: "is this the most project are looking at how such evaluative metadata can
recent annotation about this object?"; the annotated object usefully be amalgamated with other related RDF-encoded
itself: "What annotations have been made about this ob- information. The system is currently undergoing evaluation
ject?"; and content: "Which objects which fail this test?". through a test-bed in Finland.
This document examines an approach to creating annota- EARL, the Evaluation And Report Language, is an RDF
tions on the web in two projects, EARL and MedCER- based framework being developed by the Evaluation and
TAIN. Both projects use RDF (Resource Description Repair Tools group of the Web Accessibility Initiative; a
Framework) to model annotations in a flexible and extensi- domain of the World Wide Web Consortium. The language
ble way, so that these types of questions can easily be started out life as an experiment in producing a standard-
asked. ized language that could be produced by accessibility
evaluation tools, but soon grew to become a generic evalua-
Keywords
tion reporting framework that can be used to make evalua-
RDF, reification, trust, XML
tions by anyone, about anything, and against any set of cri-
INTRODUCTION teria.
When we create annotations or aggregate them together, we Both these projects had very specific requirements about
will want to access the information they contain in various the sorts of questions that could be asked of repositories of
ways. The structure of the annotation should therefore ac- annotations. This paper examines these requirements with
commodate the types of questions we will want to ask. Ex- respect to the four types of questions suggested above,
amples of these are queries about: namely: Agency, timing, annotated object and content.
The person or agent who made the annotation
AGENCY: "WHO SAID THIS?"
"Who said this?" "What else do we know about the person An annotation is an annotation because, in some sense, it is
who made this annotation?" separate from the object being annotated. Its separateness
The date and time when the annotation was made can most often be distinguished by the difference in author;
"Is this the most recent annotation about this object?" for example the difference between a paper and a criticism,
a book and a book review. Its significance can also be de-
The object of the annotation
termined by its author: a critique of an academic paper can
"What annotations have been made about this object?" itself be evaluated by the quality of the individual who
The meaning of the content of the annotation wrote the criticism. Knowing the author of an annotation
"Which annotated objects fail this test?" can provide plenty of information about the quality of the
annotation, either from contextual information already
"What are the descriptions of the annotated objects?"
known about the annotator (things written before by them,
This paper describes two projects which have used RDF information known about their experience and interests) or
(Resource Description Framework) to describe annotations from specific information that is discoverable about them
of web pages or parts of web pages: MedCERTAIN [1] and (their academic qualifications, their reputation, their stand-
EARL [2]. ing in the community).
MedCERTAIN is a project funded by the European Union, The MedCERTAIN project faced a problem over how it
which is looking at means for establishing an international should store annotations relating to health sites. The project
trustmark for health information on the Web. MedCER- grew out of the subject gateways community, which em-
TAIN will be a decentralised system based on the coopera- ploys subject specialists to create metadata about Web sites.
tion of individuals and organisations that evaluate, assess, This model is based on that used for many years by tradi-
accredit or recommend health information on the Internet. tional bibliographic services within the library community,
and uses the underlying assumptions that the data will have where the metadata creation process is decentralized as with
been entered by a trusted, trained individual, and that the the EARL and MedCERTAIN projects. The accuracy and
data can be assumed to be correct. therefore the trustworthiness of the annotation is very likely
This approach was not appropriate for MedCERTAIN since to change as time passes. As third-party annotations, there
the expectation was that both publishers of information and is no way to ensure that over time the evaluation is correct
expert annotators would provide data about sites. Initially without repeated checking, which may not be feasible in
the publisher-provided data is not verified but is still to be terms of person effort.
made available to end-users. MedCERTAIN consequently In the MedCERTAIN annotation model, each time a new
needed a system where the metadata output is seen as a set piece of information is created about a site, or old informa-
of separate statements, made by a particular person, on a tion is edited, a new annotation is created. When the system
particular date, and which may or may not be accurate. It is is queried, this new annotation replaces any information
initially up to the end-user to chose how much trust to place about the same aspect of the site created at an earlier date.
in the information. For the second step in the trustmarking The query that is made is essentially: "Find me the most
process, the project needed medical experts who could say recent annotation concerning this metadata for this site".
further things about the web site relevant to its quality, but There are problems with this simple approach, however.
who could also verify that the publisher-supplied data was What happens if someone wishes only to delete information
correct (or not), and have a means of indicating this to the that is no longer correct, without providing updated infor-
end-user. MedCERTAIN is therefore concerned with the mation to replace it? Also, what happens when multiple
provenance of metadata, and with helping end-users to de- values can legitimately be provided for some aspect of the
cide how much trust they may place in it. site - such as for creators/authors - which may be added at
EARL’s model is based on the requirement that Web acces- different times? In practice, the MedCERTAIN system
sibility annotations should be made by humans and tools in simply maintains a database that has tables for ’current’ an-
a machine readable format. notations and ’superceded’ annotations. Any annotations
Evaluating for accessibility is a very difficult process, in- that are deleted or edited have their old version moved to
volving many complicated steps and procedures, often with the ’superceded’ tables. Only data from the ’current’ tables
highly ephemeral or ambiguous results. Ratings of the effi- are shown to end users. For exporting and amalgamating
cacy of things, such as alternative textual content for im- data though, MedCERTAIN is experimenting with the idea
ages, are open to conjecture and opinion. On the other of generating ’Retraction’ annotations, that would annotate
hand, certain sorts of evaluations can be done automatically existing annotations, with a statement to the effect that that
with tools, which for example can test for the presence of a annotation is now considered to be false.
alt tag inside an image tag. Nonetheless, by adding a time to an annotation, we implic-
In EARL, the context of the evaluation contains information itly refer to an annotation event that has occurred, and at
pertaining to the actual creator or generator of the evalua- that time we can say something about the relationship be-
tion itself, e.g. giving details of the tool or person running tween the content of the annotation and the object anno-
the test, and the platform settings of the machine on which tated.
the test was run. EARL attaches a date in an ISO standard format to the re-
Where the evaluation is automatic, the precise settings of source which it is evaluating. So instead of evaluating the
the tools used and the hardware and software run is of cru- resource as is it for all time, it evaluates how it was on a
cial importance for the repeatability of the test, for the same certain date, guaranteeing persistence. EARL also allows
reasons that bug-testing software has to specify these char- you to link to a stored version of the resource as it was on
acteristics. In the human evaluated case, the complexity and that date.
subjectivity of the evaluation means that who made the It is sometimes useful to include descriptions of how the
evaluation - and by implication the experience, qualifica- context of the content changes throughout time. For exam-
tions and other contextual information behind that evalua- ple, what happens when an evaluation is made about a para-
tion - be as explicit as possible. graph of text, and later that paragraph of text moves? EARL
Both projects have the requirement that the agency making does have a property that lets the user assert some notion of
the annotation should be explicit. This transparency means "equivalence" between two pieces of content, but the exact
that individuals who use the MedCERTAIN service or the semantics of this property are not stated.
EARL tools can apply their criteria for scepticism to evalu- These properties are not essential for EARL to be func-
ate the quality of the annotation. tional, but future work on EARL may include adding a
range of properties that describe the way a piece of content
TIMING: "IS THIS THE MOST RECENT changes through time to some finer degree of granularity.
ANNOTATION ABOUT THIS OBJECT?"
With third party annotations the accuracy of the content of
Objects on the web change frequently, and any metadata
the annotation cannot usually be guaranteed; but dating the
about such objects has to take this into account, particularly
annotation provides clues to the user about the likely accu- RDF is very well suited to pointing at annotated objects on
racy of the annotation. the web, but because it depends on the names given to ob-
jects by their creators, it cannot guarantee that these objects
THE ANNOTATED OBJECT: "WHAT will not disappear or change.
ANNOTATIONS HAVE BEEN MADE ABOUT THIS
OBJECT?" CONTENT: "WHICH ANNOTATED OBJECTS FAIL
In order to describe the relationships between annotations THIS TEST?", "WHAT ARE THE DESCRIPTIONS
and their objects accurately, it is necessary that we point OF THE OBJECTS?"
unambiguously at the object of the annotation - that we can RDF can do more than associate one page to another with a
give it a name. typed link. RDF can be used to say something about the
RDF (Resource Description Framework) allows you to say meaning of any given annotation in machine-readable form.
anything about anything with a URI. This means that it is One early application of this was used in the Desire project
particularly suited to web-based annotations applications for a shared bookmarks server [8]. In this project an annota-
over HTML at the scale of the HTML document or internal tion was a description of a web page. The annotation object
reference. If the document to be annotated is XML, parts of represented a Web page, and the structured content of the
the document can be pointed to using XPointer [3], and so annotation represented the Web page’s title, description,
RDF becomes more flexible in what it can be used to anno- date, and URL, for example, as in Figure 2:
tate.
Both EARL and MedCERTAIN use RDF to model their
annotations. At its simplest level, RDF is a series of typed
directional links between objects represented by URI refer-
ences. So one can say:
[http://example.com/documentA.html, annotates,
http://example.com/documentB.html]
rather like the way in which the HTML tag is
an untyped link between two documents [4]. This approach
to annotating HTML documents with other HTML docu-
ments using typed links is similar to that used by the W3C’s
Annotea project [5], which makes an annotation object with
a body (an (X)HTML document which annotates another
(X)HTML document - see the example in Figure 1, taken
from [6]).
Figure 2. DESIRE annotation
The difficulty with this approach was that some of the se-
mantics were implicit, namely, it looked as if the annotation
had the specified title, description and so on, even though
these were supposed to represent properties of the anno-
tated Web page. Also, it was unclear that it was the person
Figure 1. Annotea annotation who made the annotation who gave the properties these
values. This approach puts some of the interpretation of the
The difficulty with pointing into documents, or pointing at semantics at the application level.
the first paragraph of the second page, say, is that Xpointer
and similar methods depend on the syntactic representation A more transparent approach is to use the apparatus pro-
of the document rather than its meaning. In practice, this vided by RDF to talk about objects and the links between
means that if the document changes, the meaning of the them as objects in themselves: reification. This mechanism
pointers can change, which could be an accidental result of allows us to say:
making any edits to the document. This is part of a more [PersonA, stated, StatementB]
general problem of the changing web. Documents on the StatementB: [siteC, hasTitle, D]
web change, and documents may change their location on
Because the statement is itself an object that can be talked
the web, even though ’Cool URIs don’t change’ [7].
about in RDF, we can associate information to this very
basic atom of data, such as who made this particular state- fine various kinds of annotation, depending on what type of
ment and when. object is being annotated and the type of statement being
With MedCERTAIN, an annotation object is used to link made. For example, the publisher can annotate a page with
the RDF statement to the annotated object, and this is given basic metadata such as the title by using a SiteDescription
the properties of annotator and date. This allows us to de- annotation:
Figure 3. MedCERTAIN annotation in which the information stated is a single RDF statement
The SiteDescription annotation shown above is used for annotation, and takes a similar form to the SiteDescription
storing general information about a Web site. This could annotation. The difference is that the annotated object will
be basic bibliographic information such as that repre- be a SiteDescription annotation and the RDF statement
sented by the Dublin Core metadata set, but may also be that is ’stated’ consists of only one possible predicate:
information relating to the publisher’s internal quality pro- validation, and one of four possible values: ’Not checked’,
cedures, compliance to the Web Accessibility Initiative, ’Valid’, ’Invalid’, or ’Cannot say’.
etc. For this purpose, the MedCERTAIN project is defin- The creator object for any of these annotations can also
ing a set of quality criteria metadata. have a type. In our implementation, the creator of a Vali-
MedCERTAIN also uses medical experts to validate the dation annotation must always be of type evaluator,
information provided by the site publisher; to check, if whereas a SiteDescription annotation may be created by
possible, whether the information is correct and sufficient. an evaluator or a creator of type publisher. The creator
This evaluator consequently needs to comment upon, or object may, of course, be linked to further information
annotate, an existing SiteDescription annotation. The type about the creator.
of annotation designed to do this is called a Validation
Figure 4. MedCERTAIN validation annotation
There is a third type of annotation used by the MedCER- statement with one allowable predicate: comment, that has
TAIN system called a Comment annotation. This is used a free-text value. Another predicate that we may use in
to provide further details about the result of the validation future with this type of annotation might be called, for
if these are required. The Comment annotation can there- example, “reference” and be used to contain specific ref-
fore annotate a Validation annotation. The information erences to supporting evidence.
stated by such an annotation is also restricted to an RDF
Figure 5. MedCERTAIN comment annotation
The EARL model is similar but has a slightly different son running the test, and the platform settings of the ma-
focus. Although it uses the same RDF statements format chine on which the test was run. This set of content infor-
as MedCERTAIN, meaning that it is extensible and flexi- mation is hung off of the node called the "Assertor". In
ble as RDF, EARL is focused on evaluation with respect other words, the context properties are attached to the
to an evaluative principle that can itself be identified on person or tools that ran the test.
the web. So statements concern whether a web site or part The second part of the evaluation is the main assertion.
of a Web site passes or fails (or variants of these) with This is a simple 3-ary relationship comprised of the re-
respect to a URL-identifiable test. source being evaluated, a result property, and the evalua-
An EARL evaluation is an RDF statement, with a context tion criteria that the resource is being tested against. This
and an assertion. The context of the evaluation contains main assertion is linked to the context using an "asserts"
information pertaining to the actual creator or generator of predicate. An example is shown in Figure 6.
the evaluation itself, e.g. giving details for the tool or per-
Figure 6. EARL example annotation
Because the test has a URI, RDF can be used to say more Trust models for annotations of any type are usually con-
about the test, for example, machine-readable expected text based. It is easy to "trust" certain annotations servers
results information, or a human-readable purpose of the that you know can only be accessed by a trusted entity of
test. some sort. Likewise, if XYZ company produce a report of
In both models RDF can be used to say more about the some content and post it on the XYZ Website, you can be
person or agent stating the RDF statement, for example, certain to a fair degree of satisfaction that this report can
their qualifications (if a person) or their software and be trusted. The question arises when annotations have an
hardware if a machine. unknown state of trust. Digital signatures and Public Key
Infrastructure may help to solve some of these problems in EARL is demographically targeted to a wide range of dis-
the future. parate entities: corporations, Web accessibility organiza-
tions, and even the general public, and so it was important
RDF AND ANNOTATIONS that an interoperable model was chosen. Because the RDF
There are various disadvantages with using RDF in gen- model is already very accurately documented, and discus-
eral. RDFS (Schema) is currently only a candidate rec- sions clarifying the structure have been made over many
ommendation. There are some difficulties with the seman- years, by using RDF, the group was able to forgo the usual
tics of RDF, which are currently being resolved by the operation of deciding upon a generic framework onto
RDFCore working group. The syntax of XML/RDF is which EARL would fit. In other words, for EARL the
verbose and can be difficult for humans to read and un- question was: why invent a data model when RDF pro-
derstand compared with what might be called "vanilla" vides one? Why repeat the work?
XML.
The third major reason for choosing RDF for the EARL
However, RDF is ideally suited to modeling annotations data model was that the group behind EARL (a chartered
because with its node-arc-node structure every RDF link Working Group of the W3C) hopes to use tools related to
is like an annotation on an object. the Semantic Web activity of the W3C where possible.
RDF was used in MedCERTAIN because of its flexibility This was actually a decidedly useful step to take; in the
and because it can be used to model higher order state- development of the EARL schema, we proved that it was
ments (or ’statements about statements’). It can be used to possible to roughly map version 0.9 of the language into
model and describe annotations about anything, not just version 0.95 using a forwards chaining query/inference
about webpages, but about anything with a URI, (for ex- engine written in Python (Tim Berners-Lee’s CWM [12]).
ample anything with an Xpointer or XPath) as well as Although version 0.9 of the language had not been widely
things without a universal identifier (using so called deployed, this proved importantly that the EARL model,
’anonymous’ resources). This includes annotations them- thanks to it being based on RDF, is evolvable and extensi-
selves: RDF enables the modeller to attach provenance to ble. Although there is always some trade-off when upgrad-
statements using its implementation of higher order state- ing a language, Semantic Web technologies make it eas-
ments (the ’reification’ mechanism), which is essential for ier.
trust for annotations. EARL have made sure that the model is as syntax inde-
What RDF provides over vanilla XML is its built in node- pendent as possible, and takes a wary standpoint with re-
arc-node model. You can of course describe a set of struc- spect to much-debated model constructs such as reifica-
tures which would describe links and provenance in arbi- tion. EARL does use higher order statements, but these
trary XML, which syntactically might look more simple. can be expressed as N3 [13] contexts, RDF reification, or
However, this would essentially mean inventing a new something else entirely. What matters is that they are
syntax for RDF, since it would be describing the same higher-order statements, these will always be around when
model, but using a different syntax. Despite the faults and you have 3-ary relationships, because it is so easy to in-
verbosity of the RDF syntax, there are now many tools vent properties.
which can parse, store and query XML/RDF data, and this
advantage is lost with an invented RDF syntax. SUMMARY
EARL and MedCERTAIN are two examples of projects
The syntax of RDF must be clearly distinguished both
where the agency, the timing, the objects and the meaning
from its model and from the storage implementation.
of the content of annotations must be clearly defined. The
MedCERTAIN makes use of the RDF model to describe
trustworthiness of evaluative applications of annotations
annotations, but stores the annotations in an SQL database
such as these depends on the unambiguous identification
optimised for the data and queries that are made on the
of the objects of the annotations. Trustworthiness also
data. The XML/RDF syntax is used for transferring the
depends on the possibility of making the annotations dis-
data.
tinguishable on the grounds of the agency creating them
In the case of EARL, the group responsible for EARL and the time and date on which they were created.
took some months in weighing up the options for data
Both EARL and MedCERTAIN chose to use RDF to
representation, and eventually the choice was narrowed
model their annotations, for three principal reasons:
down to two models: a proprietary XML schema based
model, or an RDF based model. Investigating the trade- • the usefulness of associating objects with URIs as the natu-
offs between the two, it was found that the RDF based ral choice for identification of objects on the web;
solution was more appropriate. The benefit of using the • the flexibility and extensibility of RDF in creating relation-
RDF model with respect to the efficiency of the deploy- ships between arbitrary objects such as agents and dates;
ment of EARL is highly obvious: there are many generic
RDF parser implementations available that can thus be
used to handle EARL.
• the ability of RDF to model statements about objects [2] EARL
as objects in themselves, enabling the content of an http://www.w3.org/2001/03/earl
annotation to be made machine-readable. [3] Xpointer
In addition there are many tools available for processing http://www.w3.org/TR/xptr
and storing RDF, and many projects using RDF, enabling [4] Brickley,D. Nodes and Arcs 1989-1999
interoperability between systems. http://www.w3.org/1999/11/11-
RDF is a natural tool for modelling annotations because it WWWProposal/thenandnow
is all about describing the properties of objects (such as [5] Koivunen, M. Annotea
the title of a webpage, the date changed for this part of a http://www.w3.org/2001/Annotea
webpage). This includes the ability to describe the proper-
[6] The W3C Collaborative Web Annotation Project ...
ties of the assignment of a property to an object (such as
or how to have fun while building an RDF infrastruc-
the creator of an annotation). RDF allows the modeling of
ture
all the aspects of annotations, such as who made the anno-
http://www.w3.org/2000/Talks/www9-
tation, when they made it, and what the content is, because
annotations/Overview.html
it allows both the content of the annotation and the
properties of the annotation itself to be modeled in the [7] Berners-Lee, T. Cool URIs don’t change
same system. http://www.w3.org/Provider/Style/URI
[8] Miller, L. Rudolf Project
ACKNOWLEDGMENTS http://ilrt.org/discovery/2000/09/rudolf/
Thanks to Dan Brickley and Martin Poulter for reading
[9] Lassila, O., Swick, R. (eds.) RDF model and Syntax
earlier versions of this paper
http://www.w3.org/TR/REC-rdf-syntax/
The MedCERTAIN project is funded under the EU Safer
Internet Action Plan, and consists of: the University of [10] Lassila, O., Swick, R. (eds.) RDF Model and Syntax
Heidelberg, Dept. of Clinical Social Medicine; the Uni- Reification
versity of Bristol, Institute for Learning and Research http://www.w3.org/TR/REC-rdf-syntax/#higherorder
Technology (ILRT); and The Finnish Office for Health [11] RDF Interest Group Daily Chump
Care Technology Assessment (FinOHTA) at the Finnish http://rdfig.xmlhack.com
National Research and Development Centre for Welfare [12] Berners-Lee, T. CWM
and Health (STAKES). http://www.w3.org/2000/10/swap/cwm.py
Thanks to the IMesh toolkit which Libby Miller is par- [13] Berners-Lee, T. Primer: Getting into RDF & Seman-
tially funded by. tic Web using N3
http://www.w3.org/2000/10/swap/Primer
REFERENCES
[1] MedCERTAIN
http://www.medcertain.org
Appendix
Sample Rudolf Annotation
Biz/ed
A subject gateway for Economics and Business
Sample MedCERTAIN Annotations
Please note that the annotation schema and namespace has not been finalized.
12-07-01
adult patients or consumers
13-07-01
invalid
13-07-01
Much of the material is not suitable for a lay audience
Sample EARL Annotation
2001-03-17
checking HTML4 dtd content model
And, in N3:
@prefix earl: .
@prefix : .
@prefix rdf: .
:Validator earl:asserts
[ a rdf:Statement; rdf:subject :MyPage;
rdf:predicate earl:fails;
rdf:object :ULTest ];
a earl:Tool, earl:Assertor;
:uri .
:MyPage
earl:testSubject ;
earl:date "2001-03-17" .
:ULTest
earl:test ;
earl:testMode earl:Auto;
earl:purpose "checking html4 dtd content model";
earl:repairInfo
[ earl:expectedResult ] .