=Paper=
{{Paper
|id=Vol-209/paper-9
|storemode=property
|title=SALT: Semantically Annotated LATEX
|pdfUrl=https://ceur-ws.org/Vol-209/saaw06-full06-groza.pdf
|volume=Vol-209
|dblpUrl=https://dblp.org/rec/conf/semweb/GrozaKH06
}}
==SALT: Semantically Annotated LATEX==
SALT: Semantically Annotated LTEX A
Tudor Groza Siegfried Handschuh Hak Lae Kim
Digital Enterprise Research Institute
IDA Business Park, Lower Dangan
Galway, Ireland
{tudor.groza, siegfried.handschuh, haklae.kim}@deri.org
ABSTRACT is the basis for semantic documents, which stores both a document
Machine-understandable data constitutes the basis for the Seman- and the related metadata in a single file. To achieve this we provide
tic Desktop. We provide in this paper means to author and annotate a framework, SALT that extends the Latex writing environment and
Semantic Documents on the Desktop. In our approach, the PDF supports the creation of metadata for scientific publications. SALT
file format is the basis for semantic documents, which store both lets the scientific author create metadata while putting together the
a document and the related metadata in a single file. To achieve content of a research paper.
this we provide a framework, SALT that extends the Latex writ- Previous work in the creation of semantic metadata and anno-
ing environment and supports the creation of metadata for scien- tation of documents is mainly concentrated on the annotation of
tific publications. SALT lets the scientific author create metadata HTML documents for the semantic web. Most of these HTML
while putting together the content of a research paper. We discuss annotation tools [14, 26, 5] were following an a-posteriori annota-
some of the requirements one has to meet when developing such tion step. In order to provide metadata about the contents of a web
an ontology-based writing environment and we describe a usage page, the author must first create the content and second annotate
scenario. the content in an additional, a-posteriori, annotation step.
The a-posteriori approach is reasonable when the annotator is not
the owner of the web document, as it is a common use case in the
Categories and Subject Descriptors web. However, a-posteriori annotation puts an additional load on
H.3 [Information Storage and Retrieval]: Miscellaneous; I.2.7 the author, when he is identical with the annotator. As a way out
[Artificial Intelligence]: Natural Language Processing; I.7.1 [Doc- of this problem is the possibility to easily combine authoring of a
ument and Text Processing]: Document and Text Editing; I.7.2 document with the creation of the metadata describing its content.
[Document and Text Processing]: Document Preparation First steps towards this for HTML documents in the web context
are described in [13].
General Terms HTML is the document format for the web and thus research on
semantic annotation is centered around this. But, an important and
Semantic Authoring
dominant format on the desktop is the portable document format.
PDF can be seen at the moment as the de facto standard in terms
Keywords of electronic publishing, especially in the research area. However,
LATEX, semantic annotation, semantic document, authoring we observed that there exists a small number of solutions for a-
posteriori semantic annotation of PDF documents ([7]). Also – to
1. INTRODUCTION our knowledge – there is no clear defined approach yet for a priori
PDF annotation.
The vision of the Semantic Desktop aims on the integrated per-
Our approach proposes a method for creating a priori annota-
sonal information management as well as on information distribu-
tions for PDF documents, by exploiting the rich environment pro-
tion and collaboration. This will be enabled by the use of ontolo-
vided by LATEX. We support the method with a document ontology
gies, semantic metadata, which is machine-understandable data,
mapping the internal structure of the document, an rhetorical struc-
and semantic web protocols. Hence, semantic metadata constitutes
ture ontology describing the argumentative structure of research pa-
the basis for the Semantic Desktop. To author and annotate se-
pers, and an annotation ontology gluing the annotation to the doc-
mantic documents on the desktop is one mean to create semantic
ument and providing additional metadata information. The annota-
metadata.
tion process takes place while writing and the actual integration is
In this paper we provide means to author and annotate Semantic
realized at syntax level by exploiting regular LATEX command plus
Documents on the Desktop. In our approach, the PDF file format
the introduction of special annotation commands. The final result is
represented by a semantic PDF document encapsulating instances
of the aforementioned ontologies.
In the following we describe the preliminaries of this work (Sec-
Permission to make digital or hard copies of all or part of this work for tion 2), sketch a use-case in Section 3. Then, we give an overview
personal or classroom use is granted without fee provided that copies are (Section 4) of the annotation and publication process. In Section 5,
not made or distributed for profit or commercial advantage and that copies we describe the modularization of the used ontologies and intro-
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
duce the annotation syntax. Before we conclude, we give a overview
permission and/or a fee. of related work and discuss some aspects of our solution.
Copyright 200X ACM X-XXXXX-XX-X/XX/XX ...$5.00.
2. PRELIMINARIES 2.2 HTML, PDF and XMP
In this Section we provide definitions of important terms we use While HTML documents offer the possibility of accessing their
subsequently and we explain basic design decisions. composing objects, like the text or images, because of its implicit
structured text-based format, not the same thing can be said about
2.1 Terminology PDF documents. They have a totally different internal organization
representing a combination of several types of complex objects and
• Semantic Document A semantic document includes any in- streams [17] together with their associated properties. Thus, post-
formation regarding the document and its relationship with creation analysis of the content depends on a handful of parameters,
other documents. In our cases this is a PDF document en- as accessing rights, image analysis or text retrieval algorithms ac-
riched with semantic annotations. A Semantic Document curacy.
can explicitly refer to another document by using ontolog- A similar situation can be found also when analyzing the annota-
ical relations. For example, document A refers to a claim in tion support. HTML documents enable metadata (annotation) stor-
Document B – by refereing to the URI of the claim – and age directly inside them, without the need of complex operations
provide counter arguments. (including instances of ontologies). In the PDF documents case,
• Semantic Annotation The term Semantic Annotation de- this support is split between capturing metadata using a limited set
scribes a process as well as the outcome of the process. Hence of DublinCore [1] elements, in the XMP [16] field and creating
it describes i) the process of the addition of semantic data or annotations in forms of, for example, notes or markups. There is
metadata to the document given an agreed ontology and ii) it no natural way of embedding instances of ontologies in PDF doc-
describes the semantic data or metadata itself as a result of uments, without either changing the document internal structure,
this process. In our context semantic annotation is a set of which can be done using Adobe SDK [15], either re-modeling the
instantiations attached to a PDF document. We distinguish XMP field. Our approach follows the second possibility, by encap-
i) instantiations of RDF classes, ii) instantiated properties sulating in the XMP field instances of the document, the annotation
from one class instance to datatype instance – also called at- ontology and the rhetorical structure ontology, as well as arbitrary
tribute instance, and ii) instantiated properties from one class annotations of the user.
instance to another class instance.
• Annotation Ontology: We use this term to denote a vocab- 3. USE-CASE
ulary which relates instance of an document ontology with In the following, we describe a use-case1 that is supported by
annotations. The annotation could be instances of an arbi- SALT and that has guided our development of the framework. The
trarily ontology. In our case these are either instances of i) use-case requires the generation of metadata given by a PDF docu-
the rhetorical structure ontology or ii) a domain ontology as- ment.
sociated to the topic of the document (e.g. about biology). The Use-Case shows how a semantic document enables an easy,
The annotation ontology describes what an annotation is and low-effort information distribution, collaboration and integration
which relations are possible between the subject and the ob- for the purpose of an innovative online workshop proceeding. The
ject of annotation. Further, the annotation contains attributes, goal is not only to ease the process of the creation of the online
which are useable to describe the metadata of a document, proceedings but also to provide added-value to the reader of the
such as author, title of the document (cf. Section 5.1.2). proceedings. In a way that the scientific contributions in the papers
are easier to read and browse in the online proceedings.
• Document Structure and Type Ontology: in our context
is a explicit shared formal specification of a document. This
contains the document structure, the type, organization and
the relationship between documents and other concepts. We
will call this ontology Document Ontology (cf. Section 5.1.1)
for short.
• Rhetorical Structure Ontology: We use this term to de-
note a vocabulary modeling the rhetorical structure of the
text (RST) inside a document (cf. Section 5.1.3). RST cap-
tures the roles of every part of the text and tries to provide a
plausible reason for its presence. RST describes the text on a
generic and on a specific level.
– The generic level describes parts of a scientific docu-
ment such as motivation, background, scenario or con-
tribution. The generic level is thus a modification and
extension of the ABCDE format[6]. Apposed to the
ABCDE format, we did not have an application for the Figure 1: Information workflow in the workshop proceedings
Annotation and Entities part, since this is covered by publication scenario.
our Annotation Ontology. But we missed other parts
such as Motivation and Scenario.
The process for the online publication of the accepted workshop
– The specific level is denoting rhetorical relations, for papers is usually done manually. The editor creates typically a list
example, Concession, Circumstance or Means (cf. [22])
and thus allows a fine-grained description of the argu- 1
The use-case is inspired by discussions with Anita de Waard, see
mentation in a scientific document. also http://wiki.ontoworld.org/index.php/ABCDEF
composed of the authors and the titles and links the corresponding and two sets of metadata: one which serves the population
PDF document to it 2 . of the semantic layer, i.e. the ontologies with instances and
However, additional information can easily be retrieved given the other creating the foundation for the PDF visual notes.
that each scientific author will utilize the SALT framework for the Based on the output provided by this step, the following 2
writing of his scientific document. SALT enables a combination of steps could be theoretically performed in a parallel manner.
automatic retrieved annotation based on i) an analysis of the used
Latex commands, ii) annotation from the user about the rhetorical PDF notes embedding. The Syntax Transformer component takes
structure of the document, and iii) arbitrary annotation of the docu- the second metadata set (as described above) and based on its
ment. Hence, among other the semantic metadata will describe the analysis creates the appropriate PDF visual notes, by mak-
underlaying ideas in the paper which can easily be exploited when ing use of the special commands provided by LATEX. All the
presenting the proceeding. annotations are then introduced into the LATEX intermediary
Figure 1 depicts the information workflow in the current sce- document in their original positions (extracted together with
nario. We assume that the accepted papers are enriched with our the original annotations).
rhetorical structure ontology, thus we take advantage of it the first
Annotation analysis and ontology instantiation. In parallel to the
processing phase and generate an individual HTML page for each
previous step, the first metadata set is also analyzed. In this
paper, containing the usual metadata plus the annotations captured
case, the focus is on the N3-like statements introduced in the
by rhetorical structure. The second phase of the process, iterates
usual LATEX commands and on the commands pointing to the
over all created pages and generates an entry point in the form of
rhetorical structure of the document. Using the Syntax Ana-
an index page.
lyzer in combination with Jena’s [4] N3 to RDF transformer,
The index page gives a short overview of all papers, but more
the result of this step is the creation of the appropriate in-
information – generated from the metadata – is available. Readers
stances of the ontologies, in RDF format.
can quickly glance through the contribution and skip to the section
they are interested in. For example, the context of each paper is Final PDF document compilation. This final step has as input the
shown, the background and the contribution, but also the individual LATEX intermediary document and the instances created in the
claims are available. previous step. Its goal is to combine a PdfLatex compiler (in
our case MiKTeX3 ) with the XMP LATEX package [19] and
4. ANNOTATION AND PUBLISHING transform the input set from LATEX to PDF. The resulted PDF
document will have incorporated the instances of the two on-
tologies and the visual notes.
The whole module is packed as a stand-alone component and
it can be used from a command line interpreter or integrated as a
library in a writing environment.
4.2 The publishing Process
The publishing module takes as input a PDF document, or a list
of PDF documents and provides as output one or several HTML
documents. The transformation process contains the following steps:
• extraction of the instances of the ontologies embedded in the
Figure 2: Component view. PDF document(s)
• interpretation of the extracted metadata
We implemented SALT and the workshop proceedings publica-
tion scenario as two independent modules. The first module creates • creation of the HTML documents based on some preferences
and embeds the metadata into the document, while the second one expressed by the user
is using them to achieve the needed functionality. In figure 2 we
present the organization of the two modules together with the third- The first step is realized using the BFO PDF library 4 which pro-
party used components. Following, we will detail them separately. vides the means for metadata extraction from PDF files. The re-
sulted stream is passed to the Metadata Extractor component which
4.1 The SALT Process separates the instances of the document, annotation and rhetorical
The SALT module is responsible for embedding the instances ontologies and prepares them for interpretation.
of the mentioned ontologies into the resulting PDF document. In For publication, the user can specify a series of parameters deal-
order to reach the final result, there are a series of processing steps ing with visual aspects of the publication, like font sizes, position-
that need to be taken described as following. ing or color, and with content aspects, for example, which anno-
tated parts (or metadata) should be published. All these preferences
Syntax analysis and annotation extraction. This first step takes are taken into account when interpreting the extracted instances and
as input the LATEX document, parses it (Parser component) applied during the creation of the HTML documents. The whole
and extracts the annotations present in it (Syntax analyzer process is iterative, starting from the first specified file to the last
component), based on the three types of syntactic modifica- one. The finishing touch done by the HTML Builder is the creation
tions detailed in Section 5.2. The result of this analysis pro- of the index file pointing to all previously created HTML docu-
cess is a second LATEXdocument (in an intermediary stage) ments.
2 3
For examples see the workshop online proceedings at CEUR- http://www.miktex.org/
4
WS.org http://big.faceless.org/products/pdf/
5. THE SYNTACTIC AND SEMANTIC
LAYERS
As briefly discussed in Section 2.2 one can embed annotations
in PDF documents by filling the XMP field with DublinCore meta-
data elements or by making use of notes, bookmarks or markups.
We propose a method for the creation of Semantic Documents by
exploiting and extending the two aforementioned approaches. The
actual transformation combines two interlinked layers: a semantic
layer and a syntactic layer.
The semantic layer consists of the three ontologies, the document
ontology, the annotation ontology and the rethorical structure on-
tology (cf. 2.1). The metadata based on these ontologies is places
in the XMP field and thus extending the regular DublinCore ele-
ments of a PDF document.
The syntactic layer proposes the enrichment of the LATEX syn-
tax with i) an analysis of the used commands, ii) the provision of
additional commands and iii) arbitrary annotation of the document
based on N3 statements. This level has the goal to create a semantic
bridge between the actual document and its metadata.
The motivation for introducing these two layers relies in the ne-
cessity of a much richer platform for embedding semantic annota-
tions, which should also profit by the visual impact offered by the
usual PDF annotation means. Following we will detail both the
semantic and syntactic layers.
5.1 The semantic layer
The goal of the semantic layer is to define a proper semantic
framework supporting the entire annotation process. We used three
levels, each level represented by an ontology:
Document structure level capturing the ordinary structure of the
document.
Annotation level , creating the bridge between the rhetorical struc-
ture and ordinary structure. It also captures additional meta-
data about the document.
Rhetorical level which models the document in terms of rhetorical
elements and builds its rhetorical structure.
An overall image of the organization of the semantic layer is pre- Figure 3: The internal organization of the semantic layer.
sented in Figure 3. In the following we will detail each of the three
ontologies.
instances of the Annotation concept and attaching them to the
5.1.1 The Document Ontology appropriate parts of the text.
The document ontology, depicted in Figure 4, captures the struc- A second role of the ontology is to provide metadata about the
tural layout of the document and to maintain instances of the anno- publication as a whole. This part can be seen as an alignment to
tated parts of the document. This represents an intermediary solu- the DublinCore initiative, showing also our support for it. Each of
tion, until we will be able to use the XPointer framework [11] (cf. the concepts, part of the metadata, has a direct correspondence in
Section 7). a DublinCore element. For the future, we intend to maintain this
The motivation behind this level of decomposition is given by alignment by extending the ontology in parallel with the evolution
the need of to instantiate the annotated parts of the text. Also, the of the DublinCore schema.
sentence represents at the moment the finest granularity for cre-
ating annotations and the referenced base for the construction of 5.1.3 The Rhetorical Structure Ontology
rhetorical structure. As an example, an populated instance of the The rhetorical structure ontology represents a perfect union be-
document ontology will contain instances for all the words anno- tween the knowledge captured by the rhetorical relations created
tated during the writing process. between some parts of the text, the rhetorical structure modeling
the positioning of the contained information chunks and the argu-
5.1.2 The Annotation Ontology mentative support providing the mean for building a stable founda-
As mentioned before, the main role of the annotation ontology tion for the rhetoric elements. Following, we will analyze the three
(Figure 5) is to relate the document ontology and the rhetorical mentioned sides of the ontology.
structure ontology. Conceptually, the rhetorical structure represents The first side of the ontology deals with modeling the informa-
an annotation of the ordinary structure. Thus, one is able to enrich tion chunks present in the document as rhetoric elements. This
the document with rhetoric elements by attaching semantic anno- approach has its roots in the Rhetoric Structure of the Text (RST)
tations to it. In ontological terms, this would translate to creating theory [22], which describes the text in terms of the rhetoric rela-
Figure 6: The Rhetorical Structure Ontology.
Figure 5: The annotation ontology.
an extension of the ABCDE format proposed in [6] that stands for:
Annotation, Background, Contribution, Discussion, Entities.
Figure 4: The document ontology. As a starting point, this organization reflects a good image of
a typical scientific document. But we argue that it is not enough.
Therefore we propose its modification and extension with a small
tions existing between a Nucleus (modeled by us as the Claim) and number of concepts, giving birth to a comprehensive rhetorical
a Satellite (in our case, the Explanation). Although the theory con- structure which could be adopted for all scientific documents.
tains around 30 such relations, we considered only the ones which The modification is the replacement of the Annotation concept
have a bigger impact (and relevance) when annotating a scientific with the Abstract concept, since the whole rhetorical structure rep-
document (e.g. Antithesis, Concession or Means). The main role resents in the end an annotation of the document. In terms of exten-
of these rhetoric relations (modeled by us as concepts) is to pro- sion, we propose the introduction of the following concepts: Moti-
vide a reason for the existence of the claims and the explanations in vation, Scenario and Conclusion, which have as foundation rhetor-
the text. Furthermore, we considered their placement in the frame ical relations, but we considered that by using them as concepts
created by the rhetorical structure (captured by the second side of of the rhetorical structure, we are able to model a complete best
the ontology) as a natural integration and thus we introduced a rela- practice structure for scientific documents.
tion between the rhetorical relation concept and rhetorical structure The two sides of the ontology described above are part of the a
concept. priori annotation process. This third side, deals with the discus-
The second side of the Rhetorical Structure Ontology takes care sions in terms of Arguments and CounterArguments, that can be
of capturing the rhetorical structure of the document. It represents initiated based on the existing claims. The motivation relies on
building a stable foundation for the claims by augmenting them document as subject, or about an arbitrary subject. This en-
with positive and negative argumentations. Therefore, we have richment was inspired by the N3 notation [2] and we believe
foreseen the need for a posteriori annotations modeling these dis- that it represents a lightweight and easy enough notation to
cussions and provided as part of the Rhetorical Structure Ontology be adopted for creating semantic annotations in a scientific
the Argument and CounterArgument concepts, together with document.
their subconcepts and relations.
New commands. In order to be able to manually annotate the doc-
In order to have a full understanding of how the result of the an-
ument with the rhetorical structure, we introduced a series of
notation process looks like from the Rhetorical Structure Ontology
new commands, similar to the usual LATEXones. For exam-
point of view, we provided in Figure 7 an example of instantiation.
ple, \Background or \Motivation. All the newly introduced
The example shows how a part of the text can be modeled in terms
commands support also the extension described above.
of rhetorical elements, and how can the rhetorical relations be cre-
ated. Figure 8 depicts the result of the overall annotation process using
Consider the given phrase: ... the visual system resolves con- SALT. The first operation is the parsing of the existing LATEXdocument
fusion by applying some tricks that reflect a built-in knowledge of and the metadata extraction from the usual commands and the N3-
properties of the physical world. The writer splits it into the Claim like statements. In the figure these are represented by the Author
and the Explanation, and therefore instantiates two rhetorical and Title commands and the N3 statements about the topic of the
elements, which can be further identified by their unique associated paper, having as foundation the SWRC ontology.
ID. Now, based on the definition of the Means rhetorical relation5 , The second operation is the instantiation of the Document Ontol-
the writer can make the reader aware of it, and thus emphasize his ogy (also presented in the figure), based on the document’s struc-
idea, by creating an instance of the concept modeling this relation. tural information. Following, SALT analyzes the command exte-
This instance is then linked with the appropriate concept from the sions (like the one for Use-case section in the figure) and the newly
rhetorical structure, exemplified in this case by Contribution. introduced commands and environments (like claim, explanation or
In terms of argumentative discussions, the example shows how can the scenario environment). It builds the rhetorical structure based
the claim be afterwards linked to instances of positive or negative on them and represents it as an RDF graph. To be remarked that
arguments and how are the counter arguments instances modeled each rhetoric element has a label attached (here, c1, e1 and respec-
in relation to the initial arguments. tively p1) with the purpose of future referencing.
The final step is embedding the necessary information in the PDF
5.2 The syntactic layer document for the creation of the visual notes. The example shows
The second layer introduced for embedding annotations into the the visual note attached to the Use-case section and the visual notes
PDF documents, is the syntactic layer. Since we are targeting a representing the beginning and the end of the Scenario rhetorical
priori annotations, created manually during the writing process, branch. The latter cotains also the information about the rhetorical
our approach proposes the enrichment of the LATEX syntax in three relations found as part of this branch.
ways: In general, the three main phases of the overall process are:
• through command syntax extension The creation of the semantic annotations and thus the document
enrichment during the authoring process.
• by embedding N3-like statements in usual commands The ontology instantiation from the created annotations, together
• by introducing new commands with the creation of the semantic links between the three lev-
els of the semantic layer.
Our goal for this modified syntax structure is to have a lightweight The visual representation of some of the annotations in the re-
form and as close as possible to the usual one, in order to avoid an sulted PDF document.
overkill for the ordinary users. Therefore, the first two types of
modification, i.e. command syntax extension and N3-like state- In conclusion, we make a short analysis of the modifications.
ments integration, maintain the syntactical core of the command, In the first case, switching from the usual LATEXcommands to the
while the third one introduces simple new commands having sim- extended ones by adding the annotation field should be straightfor-
ilar syntax as the usual ones. The resulted mixture of commands ward, and should be considered an enrichment rather than a way
has the most natural LATEX form possible. for confusing the ordinary users. The second modification, i.e. the
introduction of the N3-like statements, enables the author to insert
Command syntax extension. The syntax extension process was arbiters annotations. The last category of modifications represented
developed for the commands which have as main goal the by the addition of new commands, was necessary in order to repre-
structuring of the document. Therefore, commands like ab- sent the rhetorical structure of document.
stract, section or subsection were extended with a new field
meant for assigning comments – free text annotation – to the 6. RELATED WORK
corresponding part of the document. The field is delimited To ease the reasoning or retrieval of documents published on the
by a pair of curly brackets. Desktop or Web, the documents should be classified in a way that
Example: section{Introduction}{[...]} users find helpful and meaningful. There exist several activities
focused on semantic annotation as a way to enrich a document,
N3-like statements integration. This second type of modifications making it machine-readable and also accessible to humans. The
is the usage of N3-like statements in conjunction with LATEX Writing in the Context of Knowledge(WiCK) project aims to pro-
commands. These statements model information about the duce a novel writing tool to help authors improve the coherence and
5
The Means rhetorical relation states that the Explanation presents consistency of the documents they are creating by helping to assim-
a method or instrument which tends to make realization of the ilate key knowledge in each new document[3]. CREAM is a com-
Claim more likely prehensive framework which is specialized for populating HTML
Figure 7: An example of instantiation of the Rhetorical Structure Ontology.
pages with ontological concepts. It allows authors to build the doc- Also, our solution places the annotations in their natural environ-
uments by dragging and dropping concepts and property from the ment, i.e. as part of the document to which they are attached, and
ontology browser to a text editor [13]. thus transforming it into a semantic document.
Most activities have proposed their own semantic structure based The second mentioned interesting reference was [9]. It models
on ontologies. Ontological structures allows not only fundamental the process of transforming semantic graphs into multimedia pre-
values for semantic annotation, but also for additional possibilities sentations, using domain knowledge and discourse analysis. Their
such as inferencing or semantic retrieval [20]. The Semantic Web work is focusing more on using parts of the text for presentation
Research Community (SWRC) ontology is originating from On- purposes, as compared with our, which provides a method for en-
toWeb, which can be used to provide detailed information about riching the normal documents with semantic annotations, based
research work. It models the Semantic Web research community also on discourse analysis.
included researchers, publications, tools, and topics. In this paper, we propose the document ontology to express much
Generally speaking, semantic documents include any informa- richer semantics in documents including the extension of the ABCDE
tion regarding the document and its relationship with other docu- format[6] for semantic structures of the document. From a rep-
ments [12]. Therefore, a semantic annotation of documents for- resentational and technical perspective, our approach differs from
mally identifies concepts and relations between concepts in docu- other approaches, in that ontologies support more sophisticated mod-
ments, and is intended primarily for use by machines[24]. There eling for specifying relations of scientific documents. Moreover, an
are several efforts relevant related semantic documents such as Se- embedding technology using XMP provides efficient sharing sup-
manticWord[23], OntoOffice[10], and SemTalk[8]. Eriksson[7] pro- port which makes it possible to share about the document itself.
pose the PDF backend approach which is to use PDF as the basis
for Protege storage backend. It allows users to store ontologies and
knowledge bases inside PDF files. In some previous work however, 7. DISCUSSION
an ontological information or metadata would exist in a different In this Section we will raise some of the most interesting issues
place than the document itself. XMP is a formats for embedding that appeared while researching the concepts presented in the cur-
knowledge in documents[3]. Adobe’s XMP[16] is a labeling tech- rent paper. Although the list could be much longer, we will resume
nology that allows RDF constructs to be embedded in HTML, PDF ourselves to two of them: i) document instance maintenance and
documents and all Adobe formats. ii) object identification and reference, the latter being the source of
In terms of the rhetorical structure of the text, [21] provides a problem also for the first one.
deep analysis of the application domains in which it is used, e.g. Our current approach solves the document instance maintenance
computational linguistics, cross-linguistic studies or dialogue and issue by creating an instance for every annotated information chunk,
multimedia. From our perspective, the work done by Geurts et. the finest granularity being the word. The main reason is the (gen-
all[9] and Uren et. all[25] seems interesting, because they are eral) lack of a proper reference mechanism inside the PDF docu-
among the only reference – to our knowledge – which try to model ment, especially when created from LATEX. Analyzing the provided
the rhetorical structure as an ontology. solution, we could argue that it presents an possible advantage for
[25] describes a framework for sensemaking tools in the context a future development but in the same time also a quite clear disad-
of the Scholarly Ontologies Project. Their starting point is rep- vantage. The advantage consists in the possibility of representing
resented by the requirements for a discourse ontology, having as the whole documents as instances of the document ontology and
foundation the structure of the claim. The resulted ontology finds then using the instances for versioning purposes and semantic diff
its roots in the CCR (Cognitive Coherence Relations) Theory and operations. Obviously, the semantics of the diff operation has to be
models the rhetorical links in terms of similarity, causality or chal- firstly defined as part of a proper context, maybe in a similar way as
lenges. Their goal is to create and visualize claim networks us- realized in [27]. The disadvantage of this approach is the explosion
ing scholarly documents (represented as HTML files) using a cen- in space of the document, considering the number of triples that
tral knowledge server. One of our future goals is also to create need to be created for each word.
such knowledge networks, but using active reference embedded in The second issue deals with object identification and reference.
the semantic document as an opposition to their central approach. PDF documents have an internal organization represented by tree-
Figure 8: The result of the annotation process using SALT.
based structures of complex objects and streams [17], together with text of the document.
their associated properties. Post-creation analysis of the document,
and thus the reconstruction of this internal organization, represent a • In terms of reference, we would have the opportunity of us-
hard task, due to the dependency on a handful of parameters, such ing the XPointer framework [11] in conjunction with the doc-
as accessing rights, image analysis or text retrieval algorithms’ ac- ument’s model.
curacy. As a consequence, object referencing inside the document The combination of the two afore mentioned issues could start a
becomes also hard to accomplish. new direction for creating semantic knowledge networks using in-
On the other hand, we are dealing with a priori annotations, formation chunks from documents, by means of active references,
which makes the situation even more complex. The annotation rather than the existing static links. One would we able to directly
process takes place during the writing process, in the LATEX envi- embed a certain information object, or discuss a certain claim, in
ronment, and thus, the targeted PDF document does not even exist her scientific document, by providing only its active reference. The
yet. Still, to be able to reference the annotated parts of the docu- resulted semantic network tends come close to Ted Nelson’s Xanadu
ment, we adopted the following solution: The document structure vision[18].
was captured in the document ontology, and therefore giving us
the means of referencing the information chunks having a sentence
granularity. For referencing inside the sentence (word granularity) 8. CONCLUSION
we introduced a base and an offset, pointing to the needed part of In the paper we have described the authoring and annotation of
the sentence. a semantic documents to provide semantic annotation for the desk-
As a future improvement of this process, i.e. reference inside top. SALT leaves semantic data where it can be handled best, to-
the document, we intend build a DOM-like model (or a B-Tree gether with the document. Thus SALT provides a means to create
model) of the LATEX document and map its structure to the tree- Semantic Documents in a comparatively simple and intuitive way
based internal structure of the PDF document. This approach would to use for LATEX authors.
give us the following advantages: To attain this objective we have defined a SALT process, the ap-
propriate Ontologies and the architecture. We have incorporated
• In terms of identification, we would be able to provide a the means for rhetorical markup of a document that allows for ex-
unique identification for each information chunk, in the con- ample the scientific author to explicit markup his contribution and
the claims he made and the support for this claims. This explicit [8] C. Fillies, G. Wood-Albrecht, and F. Weichardt. A Pragmatic
annotation provides, as shown in our scenario, a innovate and im- Application of the Semantic Web using SemTalk. In
proved presentation and navigation of online proceedings. Further- Proceedings of the Eleventh International World Wide Web
more, it will enables other authors to explicit and directly reference Conference, Honolulu, Hawaii, USA., pages 686–692, 2002.
these claims and other related information. In the end this will lead [9] Joost Geurts, Stefano Bocconi, Jacco van Ossenbruggern,
to interconnected Semantic Documents. and Lynda Hardman. Towards Ontology-driven Discourse:
For the future, there is a long list of open issues concerning the From Semantic Graphs to Multimedia Presentations.
authoring of semantic PDF documents – from the more mundane, Technical report, Centrum voor Wiskunde en Informatica
though important ones (top) to far-reaching ones (bottom): (INS-R0305), May 31, 2003.
[10] Ontoprise GmbH. OntoOffice Tutorial, 2003.
1. PDF referencing, as we described it in Section 7
http://www.ontoprise.de/documents/tutorial ontooffice.pdf.
2. Creation of semantic knowledge networks using PDF docu- [11] P. Grosso, E. Maler, J. Marsh, and N. Walsh. XPointer
ment, by active references, also introduced in Section 7. element() Scheme, 2003.
http://www.w3.org/TR/xptr-element/.
3. Automatic derivation of markup. [12] W. Guoren, W. Bin, H. Donghong, and Q. Baiyou. Design
and Implementation of a Semantic Document Management
4. Other information structures (or formats), for example, in-
System. Information Technology Journal 4, 1:21–31, 2005.
corporating not only the annotations created on the text, but
also the ones created for the pictures, part of the Semantic [13] S. Handschuh and S. Staab. Authoring and Annotation of
Document. Web Pages in CREAM. In Proceedings of the 11th
International World Wide Web Conference, WWW 2002,
We believe that these options make SALT a rather intriguing ap- Honolulu, Hawaii, May 7-11, 2002, pages 462–473. ACM
proach on which a considerable amount of scientific semantic doc- Press, 2002.
uments might be build. [14] S. Handschuh, S. Staab, and A. Maedche. CREAM —
Creating Relational Metadata with a Component-Based,
Acknowledgments Ontology-Driven Annotation Framework. In Proceedings of
the First International Conference on Knowledge Capture
This work is funded by the European Commission 6th Framework (K-Cap 2001), pages 76–83, Victoria, B.C., Canada, October
Programme in context of the EU IST NEPOMUK IP - The So- 2001. ACM Press.
cial Semantic Desktop Project, FP6-027705. Special thanks to Big [15] Adobe Systems Incorporated. Adobe Acrobat SDK.
Faceless Organization (big.faceless.org) for providing the http://partners.adobe.com/public/developer/acrobat/sdk/
PDF library used in the metadata analysis process. Further we index.html.
thank Anita de Waard for fruitful discussions at ISWC 2005 and
[16] Adobe Systems Incorporated. Extensible Metadata Platform.
ESWC 2006.
http://www.adobe.com/products/xmp/.
[17] Adobe Systems Incorporated. PDF Reference - Adobe
9. REFERENCES Portable Document Format, April 2004.
[1] DublinCore Metadata Initiative. http://dublincore.org/. http://partners.adobe.com/public/developer/en/pdf/
[2] Tim Berners-Lee. An readable language for data on the web - PDFReference16.pdf.
notation 3, 1998. [18] Ted Nelson. Literary Machines: The report on, and of,
http://www.w3.org/DesignIssues/Notation3. Project Xanadu concerning word processing, electronic
[3] L. Carr, T. Miles-Board, G. Wills, A. Woukeu, and W. Hall. publishing, hypertext, thinkertoys, tomorrow’s intellectual...
Towards a Knowledge-Aware Office Environment. In including knowledge, education and freedom. Mindful Press,
D. Karagiannis and U. Reimer, editors, Proceedings of 5th Sausalito, California, 1981 edition: ISBN 089347052X,
International Conference on Practical Aspects of Knowledge 1981.
Management (PAKM 2004), volume LNAI 3336, pages [19] Maarten Sneep. The XMP inclusion package, 2005.
129–140, 2004. [20] S. Staab, A. Maedche, and S. Handschuh. An annotation
[4] J. Carroll, I. Dickinson, C. Dollin, D. Reynolds, A. Seaborne, framework for the semantic web. In Proceedings of the First
and K. Wilkinson. Jena: Implementing the Semantic Web Workshop on Multimedia Annotation, Tokyo, Japan, January
Recommendations. Technical Report HPL-2003-146, 30-31 2001.
Hewlett-Packard, Dec 2003. [21] Maite Taboada and William C. Mann. Applications of
http://www.hpl.hp.com/techreports/2003/HPL-2003- Rhetorical Structure Theory. Discourse Studies, 8, No.
146.html. 4:567–588, 2006.
[5] Fabio Ciravegna, Alexiei Dingli, Daniela Petrelli, and Yorick [22] Maite Taboada and William C. Mann. Rhetorical Structure
Wilks. User-System Cooperation in Document Annotation Theory: looking back and moving ahead. Discourse Studies,
Based on Information Extraction. volume 2473, pages 122+, 8, No. 3:423–459, 2006.
January 2002. [23] Marcello Tallis. Semantic Word Processing for Content
[6] Anita de Waard and Gerard Tel. The ABCDE format - Authors. In Proceedings of the Knowledge Markup &
Enabling Semantic Conference Proceeding. In Proceedings Semantic Annotation Workshop, Florida, USA, Part of the
of 1st Workshop: ”SemWiki2006 - From Wiki to Semantics”, Second International Conference on Knowledge Capture,
Budva, Montenegro, 2006. K-CAP 2003., 2003.
[7] Henrik Eriksson. A PDF Storage Backend for Protege. In [24] Victoria Uren, Philipp Cimiano, Jos Iria, Siegfried
Proceedings of the 9th Protege International Conference, Handschuh, Maria Vargas-Vera, Enrico Motta, and Fabio
Stanford, California, USA, 2006.
Ciravegna. Semantic Annotation for Knowledge
Management: Requirements and a Survey of the State of the
Art. Journal of Web Semantics 4, 1:14–28, 2006.
[25] Victoria Uren, Simon Buckingham Shum, Gangmin Li, and
Michelle Bachler. Sensemaking Tools for Understanding
Research Literatures: Design, Implementation and User
Evaluation. Int. Jnl. Human Computer Studies, 64,
No.5:420–445, 2006.
[26] M. Vargas-Vera, E. Motta, J. Domingue, M. Lanzoni,
A. Stutt, and F. Ciravegna. MnM: Ontology Driven
Semi-Automatic and Automatic Support for Semantic
Markup. In EKAW02, 13th International Conference on
Knowledge Engineering and Knowledge Management,
LNCS/LNAI 2473, pages 379–391, Sigüenza, Spain,
October 2002. Springer.
[27] Max Voelkel and Tudor Groza. SemVersion: RDF-based
Ontology Versioning System. In Proceedings of the IADIS
International Conference WWW/Internet (ICWI 2006),
Murcia, Spain, 2006.