=Paper= {{Paper |id=Vol-209/paper-9 |storemode=property |title=SALT: Semantically Annotated LATEX |pdfUrl=https://ceur-ws.org/Vol-209/saaw06-full06-groza.pdf |volume=Vol-209 |dblpUrl=https://dblp.org/rec/conf/semweb/GrozaKH06 }} ==SALT: Semantically Annotated LATEX== https://ceur-ws.org/Vol-209/saaw06-full06-groza.pdf
                                  SALT: Semantically Annotated LTEX                                                A




                        Tudor Groza                                Siegfried Handschuh                           Hak Lae Kim
                                                     Digital Enterprise Research Institute
                                                      IDA Business Park, Lower Dangan
                                                                 Galway, Ireland
                                           {tudor.groza, siegfried.handschuh, haklae.kim}@deri.org



ABSTRACT                                                                             is the basis for semantic documents, which stores both a document
Machine-understandable data constitutes the basis for the Seman-                     and the related metadata in a single file. To achieve this we provide
tic Desktop. We provide in this paper means to author and annotate                   a framework, SALT that extends the Latex writing environment and
Semantic Documents on the Desktop. In our approach, the PDF                          supports the creation of metadata for scientific publications. SALT
file format is the basis for semantic documents, which store both                    lets the scientific author create metadata while putting together the
a document and the related metadata in a single file. To achieve                     content of a research paper.
this we provide a framework, SALT that extends the Latex writ-                          Previous work in the creation of semantic metadata and anno-
ing environment and supports the creation of metadata for scien-                     tation of documents is mainly concentrated on the annotation of
tific publications. SALT lets the scientific author create metadata                  HTML documents for the semantic web. Most of these HTML
while putting together the content of a research paper. We discuss                   annotation tools [14, 26, 5] were following an a-posteriori annota-
some of the requirements one has to meet when developing such                        tion step. In order to provide metadata about the contents of a web
an ontology-based writing environment and we describe a usage                        page, the author must first create the content and second annotate
scenario.                                                                            the content in an additional, a-posteriori, annotation step.
                                                                                        The a-posteriori approach is reasonable when the annotator is not
                                                                                     the owner of the web document, as it is a common use case in the
Categories and Subject Descriptors                                                   web. However, a-posteriori annotation puts an additional load on
H.3 [Information Storage and Retrieval]: Miscellaneous; I.2.7                        the author, when he is identical with the annotator. As a way out
[Artificial Intelligence]: Natural Language Processing; I.7.1 [Doc-                  of this problem is the possibility to easily combine authoring of a
ument and Text Processing]: Document and Text Editing; I.7.2                         document with the creation of the metadata describing its content.
[Document and Text Processing]: Document Preparation                                 First steps towards this for HTML documents in the web context
                                                                                     are described in [13].
General Terms                                                                           HTML is the document format for the web and thus research on
                                                                                     semantic annotation is centered around this. But, an important and
Semantic Authoring
                                                                                     dominant format on the desktop is the portable document format.
                                                                                     PDF can be seen at the moment as the de facto standard in terms
Keywords                                                                             of electronic publishing, especially in the research area. However,
LATEX, semantic annotation, semantic document, authoring                             we observed that there exists a small number of solutions for a-
                                                                                     posteriori semantic annotation of PDF documents ([7]). Also – to
1.     INTRODUCTION                                                                  our knowledge – there is no clear defined approach yet for a priori
                                                                                     PDF annotation.
   The vision of the Semantic Desktop aims on the integrated per-
                                                                                        Our approach proposes a method for creating a priori annota-
sonal information management as well as on information distribu-
                                                                                     tions for PDF documents, by exploiting the rich environment pro-
tion and collaboration. This will be enabled by the use of ontolo-
                                                                                     vided by LATEX. We support the method with a document ontology
gies, semantic metadata, which is machine-understandable data,
                                                                                     mapping the internal structure of the document, an rhetorical struc-
and semantic web protocols. Hence, semantic metadata constitutes
                                                                                     ture ontology describing the argumentative structure of research pa-
the basis for the Semantic Desktop. To author and annotate se-
                                                                                     pers, and an annotation ontology gluing the annotation to the doc-
mantic documents on the desktop is one mean to create semantic
                                                                                     ument and providing additional metadata information. The annota-
metadata.
                                                                                     tion process takes place while writing and the actual integration is
   In this paper we provide means to author and annotate Semantic
                                                                                     realized at syntax level by exploiting regular LATEX command plus
Documents on the Desktop. In our approach, the PDF file format
                                                                                     the introduction of special annotation commands. The final result is
                                                                                     represented by a semantic PDF document encapsulating instances
                                                                                     of the aforementioned ontologies.
                                                                                        In the following we describe the preliminaries of this work (Sec-
Permission to make digital or hard copies of all or part of this work for            tion 2), sketch a use-case in Section 3. Then, we give an overview
personal or classroom use is granted without fee provided that copies are            (Section 4) of the annotation and publication process. In Section 5,
not made or distributed for profit or commercial advantage and that copies           we describe the modularization of the used ontologies and intro-
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
                                                                                     duce the annotation syntax. Before we conclude, we give a overview
permission and/or a fee.                                                             of related work and discuss some aspects of our solution.
Copyright 200X ACM X-XXXXX-XX-X/XX/XX ...$5.00.
2.     PRELIMINARIES                                                      2.2    HTML, PDF and XMP
  In this Section we provide definitions of important terms we use           While HTML documents offer the possibility of accessing their
subsequently and we explain basic design decisions.                       composing objects, like the text or images, because of its implicit
                                                                          structured text-based format, not the same thing can be said about
2.1     Terminology                                                       PDF documents. They have a totally different internal organization
                                                                          representing a combination of several types of complex objects and
     • Semantic Document A semantic document includes any in-             streams [17] together with their associated properties. Thus, post-
       formation regarding the document and its relationship with         creation analysis of the content depends on a handful of parameters,
       other documents. In our cases this is a PDF document en-           as accessing rights, image analysis or text retrieval algorithms ac-
       riched with semantic annotations. A Semantic Document              curacy.
       can explicitly refer to another document by using ontolog-            A similar situation can be found also when analyzing the annota-
       ical relations. For example, document A refers to a claim in       tion support. HTML documents enable metadata (annotation) stor-
       Document B – by refereing to the URI of the claim – and            age directly inside them, without the need of complex operations
       provide counter arguments.                                         (including instances of ontologies). In the PDF documents case,
     • Semantic Annotation The term Semantic Annotation de-               this support is split between capturing metadata using a limited set
       scribes a process as well as the outcome of the process. Hence     of DublinCore [1] elements, in the XMP [16] field and creating
       it describes i) the process of the addition of semantic data or    annotations in forms of, for example, notes or markups. There is
       metadata to the document given an agreed ontology and ii) it       no natural way of embedding instances of ontologies in PDF doc-
       describes the semantic data or metadata itself as a result of      uments, without either changing the document internal structure,
       this process. In our context semantic annotation is a set of       which can be done using Adobe SDK [15], either re-modeling the
       instantiations attached to a PDF document. We distinguish          XMP field. Our approach follows the second possibility, by encap-
       i) instantiations of RDF classes, ii) instantiated properties      sulating in the XMP field instances of the document, the annotation
       from one class instance to datatype instance – also called at-     ontology and the rhetorical structure ontology, as well as arbitrary
       tribute instance, and ii) instantiated properties from one class   annotations of the user.
       instance to another class instance.
     • Annotation Ontology: We use this term to denote a vocab-           3.    USE-CASE
       ulary which relates instance of an document ontology with             In the following, we describe a use-case1 that is supported by
       annotations. The annotation could be instances of an arbi-         SALT and that has guided our development of the framework. The
       trarily ontology. In our case these are either instances of i)     use-case requires the generation of metadata given by a PDF docu-
       the rhetorical structure ontology or ii) a domain ontology as-     ment.
       sociated to the topic of the document (e.g. about biology).           The Use-Case shows how a semantic document enables an easy,
       The annotation ontology describes what an annotation is and        low-effort information distribution, collaboration and integration
       which relations are possible between the subject and the ob-       for the purpose of an innovative online workshop proceeding. The
       ject of annotation. Further, the annotation contains attributes,   goal is not only to ease the process of the creation of the online
       which are useable to describe the metadata of a document,          proceedings but also to provide added-value to the reader of the
       such as author, title of the document (cf. Section 5.1.2).         proceedings. In a way that the scientific contributions in the papers
                                                                          are easier to read and browse in the online proceedings.
     • Document Structure and Type Ontology: in our context
       is a explicit shared formal specification of a document. This
       contains the document structure, the type, organization and
       the relationship between documents and other concepts. We
       will call this ontology Document Ontology (cf. Section 5.1.1)
       for short.
     • Rhetorical Structure Ontology: We use this term to de-
       note a vocabulary modeling the rhetorical structure of the
       text (RST) inside a document (cf. Section 5.1.3). RST cap-
       tures the roles of every part of the text and tries to provide a
       plausible reason for its presence. RST describes the text on a
       generic and on a specific level.
          – The generic level describes parts of a scientific docu-
            ment such as motivation, background, scenario or con-
            tribution. The generic level is thus a modification and
            extension of the ABCDE format[6]. Apposed to the
            ABCDE format, we did not have an application for the          Figure 1: Information workflow in the workshop proceedings
            Annotation and Entities part, since this is covered by        publication scenario.
            our Annotation Ontology. But we missed other parts
            such as Motivation and Scenario.
                                                                            The process for the online publication of the accepted workshop
          – The specific level is denoting rhetorical relations, for      papers is usually done manually. The editor creates typically a list
            example, Concession, Circumstance or Means (cf. [22])
            and thus allows a fine-grained description of the argu-       1
                                                                            The use-case is inspired by discussions with Anita de Waard, see
            mentation in a scientific document.                           also http://wiki.ontoworld.org/index.php/ABCDEF
composed of the authors and the titles and links the corresponding                 and two sets of metadata: one which serves the population
PDF document to it 2 .                                                             of the semantic layer, i.e. the ontologies with instances and
   However, additional information can easily be retrieved given                   the other creating the foundation for the PDF visual notes.
that each scientific author will utilize the SALT framework for the                Based on the output provided by this step, the following 2
writing of his scientific document. SALT enables a combination of                  steps could be theoretically performed in a parallel manner.
automatic retrieved annotation based on i) an analysis of the used
Latex commands, ii) annotation from the user about the rhetorical         PDF notes embedding. The Syntax Transformer component takes
structure of the document, and iii) arbitrary annotation of the docu-          the second metadata set (as described above) and based on its
ment. Hence, among other the semantic metadata will describe the               analysis creates the appropriate PDF visual notes, by mak-
underlaying ideas in the paper which can easily be exploited when              ing use of the special commands provided by LATEX. All the
presenting the proceeding.                                                     annotations are then introduced into the LATEX intermediary
   Figure 1 depicts the information workflow in the current sce-               document in their original positions (extracted together with
nario. We assume that the accepted papers are enriched with our                the original annotations).
rhetorical structure ontology, thus we take advantage of it the first
                                                                          Annotation analysis and ontology instantiation. In parallel to the
processing phase and generate an individual HTML page for each
                                                                              previous step, the first metadata set is also analyzed. In this
paper, containing the usual metadata plus the annotations captured
                                                                              case, the focus is on the N3-like statements introduced in the
by rhetorical structure. The second phase of the process, iterates
                                                                              usual LATEX commands and on the commands pointing to the
over all created pages and generates an entry point in the form of
                                                                              rhetorical structure of the document. Using the Syntax Ana-
an index page.
                                                                              lyzer in combination with Jena’s [4] N3 to RDF transformer,
   The index page gives a short overview of all papers, but more
                                                                              the result of this step is the creation of the appropriate in-
information – generated from the metadata – is available. Readers
                                                                              stances of the ontologies, in RDF format.
can quickly glance through the contribution and skip to the section
they are interested in. For example, the context of each paper is         Final PDF document compilation. This final step has as input the
shown, the background and the contribution, but also the individual             LATEX intermediary document and the instances created in the
claims are available.                                                           previous step. Its goal is to combine a PdfLatex compiler (in
                                                                                our case MiKTeX3 ) with the XMP LATEX package [19] and
4.    ANNOTATION AND PUBLISHING                                                 transform the input set from LATEX to PDF. The resulted PDF
                                                                                document will have incorporated the instances of the two on-
                                                                                tologies and the visual notes.

                                                                             The whole module is packed as a stand-alone component and
                                                                          it can be used from a command line interpreter or integrated as a
                                                                          library in a writing environment.

                                                                          4.2       The publishing Process
                                                                            The publishing module takes as input a PDF document, or a list
                                                                          of PDF documents and provides as output one or several HTML
                                                                          documents. The transformation process contains the following steps:
                                                                                • extraction of the instances of the ontologies embedded in the
                    Figure 2: Component view.                                     PDF document(s)

                                                                                • interpretation of the extracted metadata
   We implemented SALT and the workshop proceedings publica-
tion scenario as two independent modules. The first module creates              • creation of the HTML documents based on some preferences
and embeds the metadata into the document, while the second one                   expressed by the user
is using them to achieve the needed functionality. In figure 2 we
present the organization of the two modules together with the third-         The first step is realized using the BFO PDF library 4 which pro-
party used components. Following, we will detail them separately.         vides the means for metadata extraction from PDF files. The re-
                                                                          sulted stream is passed to the Metadata Extractor component which
4.1    The SALT Process                                                   separates the instances of the document, annotation and rhetorical
   The SALT module is responsible for embedding the instances             ontologies and prepares them for interpretation.
of the mentioned ontologies into the resulting PDF document. In              For publication, the user can specify a series of parameters deal-
order to reach the final result, there are a series of processing steps   ing with visual aspects of the publication, like font sizes, position-
that need to be taken described as following.                             ing or color, and with content aspects, for example, which anno-
                                                                          tated parts (or metadata) should be published. All these preferences
Syntax analysis and annotation extraction. This first step takes          are taken into account when interpreting the extracted instances and
     as input the LATEX document, parses it (Parser component)            applied during the creation of the HTML documents. The whole
     and extracts the annotations present in it (Syntax analyzer          process is iterative, starting from the first specified file to the last
     component), based on the three types of syntactic modifica-          one. The finishing touch done by the HTML Builder is the creation
     tions detailed in Section 5.2. The result of this analysis pro-      of the index file pointing to all previously created HTML docu-
     cess is a second LATEXdocument (in an intermediary stage)            ments.
2                                                                         3
 For examples see the workshop online proceedings at CEUR-                    http://www.miktex.org/
                                                                          4
WS.org                                                                        http://big.faceless.org/products/pdf/
5.    THE SYNTACTIC AND SEMANTIC
      LAYERS
   As briefly discussed in Section 2.2 one can embed annotations
in PDF documents by filling the XMP field with DublinCore meta-
data elements or by making use of notes, bookmarks or markups.
We propose a method for the creation of Semantic Documents by
exploiting and extending the two aforementioned approaches. The
actual transformation combines two interlinked layers: a semantic
layer and a syntactic layer.
   The semantic layer consists of the three ontologies, the document
ontology, the annotation ontology and the rethorical structure on-
tology (cf. 2.1). The metadata based on these ontologies is places
in the XMP field and thus extending the regular DublinCore ele-
ments of a PDF document.
   The syntactic layer proposes the enrichment of the LATEX syn-
tax with i) an analysis of the used commands, ii) the provision of
additional commands and iii) arbitrary annotation of the document
based on N3 statements. This level has the goal to create a semantic
bridge between the actual document and its metadata.
   The motivation for introducing these two layers relies in the ne-
cessity of a much richer platform for embedding semantic annota-
tions, which should also profit by the visual impact offered by the
usual PDF annotation means. Following we will detail both the
semantic and syntactic layers.
5.1     The semantic layer
   The goal of the semantic layer is to define a proper semantic
framework supporting the entire annotation process. We used three
levels, each level represented by an ontology:
Document structure level capturing the ordinary structure of the
    document.
Annotation level , creating the bridge between the rhetorical struc-
    ture and ordinary structure. It also captures additional meta-
    data about the document.
Rhetorical level which models the document in terms of rhetorical
     elements and builds its rhetorical structure.
An overall image of the organization of the semantic layer is pre-        Figure 3: The internal organization of the semantic layer.
sented in Figure 3. In the following we will detail each of the three
ontologies.
                                                                        instances of the Annotation concept and attaching them to the
5.1.1     The Document Ontology                                         appropriate parts of the text.
   The document ontology, depicted in Figure 4, captures the struc-        A second role of the ontology is to provide metadata about the
tural layout of the document and to maintain instances of the anno-     publication as a whole. This part can be seen as an alignment to
tated parts of the document. This represents an intermediary solu-      the DublinCore initiative, showing also our support for it. Each of
tion, until we will be able to use the XPointer framework [11] (cf.     the concepts, part of the metadata, has a direct correspondence in
Section 7).                                                             a DublinCore element. For the future, we intend to maintain this
   The motivation behind this level of decomposition is given by        alignment by extending the ontology in parallel with the evolution
the need of to instantiate the annotated parts of the text. Also, the   of the DublinCore schema.
sentence represents at the moment the finest granularity for cre-
ating annotations and the referenced base for the construction of       5.1.3     The Rhetorical Structure Ontology
rhetorical structure. As an example, an populated instance of the          The rhetorical structure ontology represents a perfect union be-
document ontology will contain instances for all the words anno-        tween the knowledge captured by the rhetorical relations created
tated during the writing process.                                       between some parts of the text, the rhetorical structure modeling
                                                                        the positioning of the contained information chunks and the argu-
5.1.2     The Annotation Ontology                                       mentative support providing the mean for building a stable founda-
   As mentioned before, the main role of the annotation ontology        tion for the rhetoric elements. Following, we will analyze the three
(Figure 5) is to relate the document ontology and the rhetorical        mentioned sides of the ontology.
structure ontology. Conceptually, the rhetorical structure represents      The first side of the ontology deals with modeling the informa-
an annotation of the ordinary structure. Thus, one is able to enrich    tion chunks present in the document as rhetoric elements. This
the document with rhetoric elements by attaching semantic anno-         approach has its roots in the Rhetoric Structure of the Text (RST)
tations to it. In ontological terms, this would translate to creating   theory [22], which describes the text in terms of the rhetoric rela-
                                               Figure 6: The Rhetorical Structure Ontology.




                                                                                       Figure 5: The annotation ontology.


                                                                        an extension of the ABCDE format proposed in [6] that stands for:
                                                                        Annotation, Background, Contribution, Discussion, Entities.
               Figure 4: The document ontology.                            As a starting point, this organization reflects a good image of
                                                                        a typical scientific document. But we argue that it is not enough.
                                                                        Therefore we propose its modification and extension with a small
tions existing between a Nucleus (modeled by us as the Claim) and       number of concepts, giving birth to a comprehensive rhetorical
a Satellite (in our case, the Explanation). Although the theory con-    structure which could be adopted for all scientific documents.
tains around 30 such relations, we considered only the ones which          The modification is the replacement of the Annotation concept
have a bigger impact (and relevance) when annotating a scientific       with the Abstract concept, since the whole rhetorical structure rep-
document (e.g. Antithesis, Concession or Means). The main role          resents in the end an annotation of the document. In terms of exten-
of these rhetoric relations (modeled by us as concepts) is to pro-      sion, we propose the introduction of the following concepts: Moti-
vide a reason for the existence of the claims and the explanations in   vation, Scenario and Conclusion, which have as foundation rhetor-
the text. Furthermore, we considered their placement in the frame       ical relations, but we considered that by using them as concepts
created by the rhetorical structure (captured by the second side of     of the rhetorical structure, we are able to model a complete best
the ontology) as a natural integration and thus we introduced a rela-   practice structure for scientific documents.
tion between the rhetorical relation concept and rhetorical structure      The two sides of the ontology described above are part of the a
concept.                                                                priori annotation process. This third side, deals with the discus-
   The second side of the Rhetorical Structure Ontology takes care      sions in terms of Arguments and CounterArguments, that can be
of capturing the rhetorical structure of the document. It represents    initiated based on the existing claims. The motivation relies on
building a stable foundation for the claims by augmenting them                  document as subject, or about an arbitrary subject. This en-
with positive and negative argumentations. Therefore, we have                   richment was inspired by the N3 notation [2] and we believe
foreseen the need for a posteriori annotations modeling these dis-              that it represents a lightweight and easy enough notation to
cussions and provided as part of the Rhetorical Structure Ontology              be adopted for creating semantic annotations in a scientific
the Argument and CounterArgument concepts, together with                        document.
their subconcepts and relations.
                                                                          New commands. In order to be able to manually annotate the doc-
   In order to have a full understanding of how the result of the an-
                                                                               ument with the rhetorical structure, we introduced a series of
notation process looks like from the Rhetorical Structure Ontology
                                                                               new commands, similar to the usual LATEXones. For exam-
point of view, we provided in Figure 7 an example of instantiation.
                                                                               ple, \Background or \Motivation. All the newly introduced
The example shows how a part of the text can be modeled in terms
                                                                               commands support also the extension described above.
of rhetorical elements, and how can the rhetorical relations be cre-
ated.                                                                        Figure 8 depicts the result of the overall annotation process using
   Consider the given phrase: ... the visual system resolves con-         SALT. The first operation is the parsing of the existing LATEXdocument
fusion by applying some tricks that reflect a built-in knowledge of       and the metadata extraction from the usual commands and the N3-
properties of the physical world. The writer splits it into the Claim     like statements. In the figure these are represented by the Author
and the Explanation, and therefore instantiates two rhetorical            and Title commands and the N3 statements about the topic of the
elements, which can be further identified by their unique associated      paper, having as foundation the SWRC ontology.
ID. Now, based on the definition of the Means rhetorical relation5 ,         The second operation is the instantiation of the Document Ontol-
the writer can make the reader aware of it, and thus emphasize his        ogy (also presented in the figure), based on the document’s struc-
idea, by creating an instance of the concept modeling this relation.      tural information. Following, SALT analyzes the command exte-
This instance is then linked with the appropriate concept from the        sions (like the one for Use-case section in the figure) and the newly
rhetorical structure, exemplified in this case by Contribution.           introduced commands and environments (like claim, explanation or
In terms of argumentative discussions, the example shows how can          the scenario environment). It builds the rhetorical structure based
the claim be afterwards linked to instances of positive or negative       on them and represents it as an RDF graph. To be remarked that
arguments and how are the counter arguments instances modeled             each rhetoric element has a label attached (here, c1, e1 and respec-
in relation to the initial arguments.                                     tively p1) with the purpose of future referencing.
                                                                             The final step is embedding the necessary information in the PDF
5.2    The syntactic layer                                                document for the creation of the visual notes. The example shows
  The second layer introduced for embedding annotations into the          the visual note attached to the Use-case section and the visual notes
PDF documents, is the syntactic layer. Since we are targeting a           representing the beginning and the end of the Scenario rhetorical
priori annotations, created manually during the writing process,          branch. The latter cotains also the information about the rhetorical
our approach proposes the enrichment of the LATEX syntax in three         relations found as part of this branch.
ways:                                                                        In general, the three main phases of the overall process are:

    • through command syntax extension                                    The creation of the semantic annotations and thus the document
                                                                               enrichment during the authoring process.
    • by embedding N3-like statements in usual commands                   The ontology instantiation from the created annotations, together
    • by introducing new commands                                              with the creation of the semantic links between the three lev-
                                                                               els of the semantic layer.
   Our goal for this modified syntax structure is to have a lightweight   The visual representation of some of the annotations in the re-
form and as close as possible to the usual one, in order to avoid an           sulted PDF document.
overkill for the ordinary users. Therefore, the first two types of
modification, i.e. command syntax extension and N3-like state-               In conclusion, we make a short analysis of the modifications.
ments integration, maintain the syntactical core of the command,          In the first case, switching from the usual LATEXcommands to the
while the third one introduces simple new commands having sim-            extended ones by adding the annotation field should be straightfor-
ilar syntax as the usual ones. The resulted mixture of commands           ward, and should be considered an enrichment rather than a way
has the most natural LATEX form possible.                                 for confusing the ordinary users. The second modification, i.e. the
                                                                          introduction of the N3-like statements, enables the author to insert
Command syntax extension. The syntax extension process was                arbiters annotations. The last category of modifications represented
    developed for the commands which have as main goal the                by the addition of new commands, was necessary in order to repre-
    structuring of the document. Therefore, commands like ab-             sent the rhetorical structure of document.
    stract, section or subsection were extended with a new field
    meant for assigning comments – free text annotation – to the          6.   RELATED WORK
    corresponding part of the document. The field is delimited               To ease the reasoning or retrieval of documents published on the
    by a pair of curly brackets.                                          Desktop or Web, the documents should be classified in a way that
    Example: section{Introduction}{[...]}                                 users find helpful and meaningful. There exist several activities
                                                                          focused on semantic annotation as a way to enrich a document,
N3-like statements integration. This second type of modifications         making it machine-readable and also accessible to humans. The
      is the usage of N3-like statements in conjunction with LATEX        Writing in the Context of Knowledge(WiCK) project aims to pro-
      commands. These statements model information about the              duce a novel writing tool to help authors improve the coherence and
5
 The Means rhetorical relation states that the Explanation presents       consistency of the documents they are creating by helping to assim-
a method or instrument which tends to make realization of the             ilate key knowledge in each new document[3]. CREAM is a com-
Claim more likely                                                         prehensive framework which is specialized for populating HTML
                              Figure 7: An example of instantiation of the Rhetorical Structure Ontology.


pages with ontological concepts. It allows authors to build the doc-     Also, our solution places the annotations in their natural environ-
uments by dragging and dropping concepts and property from the           ment, i.e. as part of the document to which they are attached, and
ontology browser to a text editor [13].                                  thus transforming it into a semantic document.
   Most activities have proposed their own semantic structure based         The second mentioned interesting reference was [9]. It models
on ontologies. Ontological structures allows not only fundamental        the process of transforming semantic graphs into multimedia pre-
values for semantic annotation, but also for additional possibilities    sentations, using domain knowledge and discourse analysis. Their
such as inferencing or semantic retrieval [20]. The Semantic Web         work is focusing more on using parts of the text for presentation
Research Community (SWRC) ontology is originating from On-               purposes, as compared with our, which provides a method for en-
toWeb, which can be used to provide detailed information about           riching the normal documents with semantic annotations, based
research work. It models the Semantic Web research community             also on discourse analysis.
included researchers, publications, tools, and topics.                      In this paper, we propose the document ontology to express much
   Generally speaking, semantic documents include any informa-           richer semantics in documents including the extension of the ABCDE
tion regarding the document and its relationship with other docu-        format[6] for semantic structures of the document. From a rep-
ments [12]. Therefore, a semantic annotation of documents for-           resentational and technical perspective, our approach differs from
mally identifies concepts and relations between concepts in docu-        other approaches, in that ontologies support more sophisticated mod-
ments, and is intended primarily for use by machines[24]. There          eling for specifying relations of scientific documents. Moreover, an
are several efforts relevant related semantic documents such as Se-      embedding technology using XMP provides efficient sharing sup-
manticWord[23], OntoOffice[10], and SemTalk[8]. Eriksson[7] pro-         port which makes it possible to share about the document itself.
pose the PDF backend approach which is to use PDF as the basis
for Protege storage backend. It allows users to store ontologies and
knowledge bases inside PDF files. In some previous work however,         7.    DISCUSSION
an ontological information or metadata would exist in a different           In this Section we will raise some of the most interesting issues
place than the document itself. XMP is a formats for embedding           that appeared while researching the concepts presented in the cur-
knowledge in documents[3]. Adobe’s XMP[16] is a labeling tech-           rent paper. Although the list could be much longer, we will resume
nology that allows RDF constructs to be embedded in HTML, PDF            ourselves to two of them: i) document instance maintenance and
documents and all Adobe formats.                                         ii) object identification and reference, the latter being the source of
   In terms of the rhetorical structure of the text, [21] provides a     problem also for the first one.
deep analysis of the application domains in which it is used, e.g.          Our current approach solves the document instance maintenance
computational linguistics, cross-linguistic studies or dialogue and      issue by creating an instance for every annotated information chunk,
multimedia. From our perspective, the work done by Geurts et.            the finest granularity being the word. The main reason is the (gen-
all[9] and Uren et. all[25] seems interesting, because they are          eral) lack of a proper reference mechanism inside the PDF docu-
among the only reference – to our knowledge – which try to model         ment, especially when created from LATEX. Analyzing the provided
the rhetorical structure as an ontology.                                 solution, we could argue that it presents an possible advantage for
   [25] describes a framework for sensemaking tools in the context       a future development but in the same time also a quite clear disad-
of the Scholarly Ontologies Project. Their starting point is rep-        vantage. The advantage consists in the possibility of representing
resented by the requirements for a discourse ontology, having as         the whole documents as instances of the document ontology and
foundation the structure of the claim. The resulted ontology finds       then using the instances for versioning purposes and semantic diff
its roots in the CCR (Cognitive Coherence Relations) Theory and          operations. Obviously, the semantics of the diff operation has to be
models the rhetorical links in terms of similarity, causality or chal-   firstly defined as part of a proper context, maybe in a similar way as
lenges. Their goal is to create and visualize claim networks us-         realized in [27]. The disadvantage of this approach is the explosion
ing scholarly documents (represented as HTML files) using a cen-         in space of the document, considering the number of triples that
tral knowledge server. One of our future goals is also to create         need to be created for each word.
such knowledge networks, but using active reference embedded in             The second issue deals with object identification and reference.
the semantic document as an opposition to their central approach.        PDF documents have an internal organization represented by tree-
                                        Figure 8: The result of the annotation process using SALT.


based structures of complex objects and streams [17], together with             text of the document.
their associated properties. Post-creation analysis of the document,
and thus the reconstruction of this internal organization, represent a        • In terms of reference, we would have the opportunity of us-
hard task, due to the dependency on a handful of parameters, such               ing the XPointer framework [11] in conjunction with the doc-
as accessing rights, image analysis or text retrieval algorithms’ ac-           ument’s model.
curacy. As a consequence, object referencing inside the document            The combination of the two afore mentioned issues could start a
becomes also hard to accomplish.                                         new direction for creating semantic knowledge networks using in-
   On the other hand, we are dealing with a priori annotations,          formation chunks from documents, by means of active references,
which makes the situation even more complex. The annotation              rather than the existing static links. One would we able to directly
process takes place during the writing process, in the LATEX envi-       embed a certain information object, or discuss a certain claim, in
ronment, and thus, the targeted PDF document does not even exist         her scientific document, by providing only its active reference. The
yet. Still, to be able to reference the annotated parts of the docu-     resulted semantic network tends come close to Ted Nelson’s Xanadu
ment, we adopted the following solution: The document structure          vision[18].
was captured in the document ontology, and therefore giving us
the means of referencing the information chunks having a sentence
granularity. For referencing inside the sentence (word granularity)      8.     CONCLUSION
we introduced a base and an offset, pointing to the needed part of          In the paper we have described the authoring and annotation of
the sentence.                                                            a semantic documents to provide semantic annotation for the desk-
   As a future improvement of this process, i.e. reference inside        top. SALT leaves semantic data where it can be handled best, to-
the document, we intend build a DOM-like model (or a B-Tree              gether with the document. Thus SALT provides a means to create
model) of the LATEX document and map its structure to the tree-          Semantic Documents in a comparatively simple and intuitive way
based internal structure of the PDF document. This approach would        to use for LATEX authors.
give us the following advantages:                                           To attain this objective we have defined a SALT process, the ap-
                                                                         propriate Ontologies and the architecture. We have incorporated
   • In terms of identification, we would be able to provide a           the means for rhetorical markup of a document that allows for ex-
     unique identification for each information chunk, in the con-       ample the scientific author to explicit markup his contribution and
the claims he made and the support for this claims. This explicit         [8] C. Fillies, G. Wood-Albrecht, and F. Weichardt. A Pragmatic
annotation provides, as shown in our scenario, a innovate and im-             Application of the Semantic Web using SemTalk. In
proved presentation and navigation of online proceedings. Further-            Proceedings of the Eleventh International World Wide Web
more, it will enables other authors to explicit and directly reference        Conference, Honolulu, Hawaii, USA., pages 686–692, 2002.
these claims and other related information. In the end this will lead     [9] Joost Geurts, Stefano Bocconi, Jacco van Ossenbruggern,
to interconnected Semantic Documents.                                         and Lynda Hardman. Towards Ontology-driven Discourse:
   For the future, there is a long list of open issues concerning the         From Semantic Graphs to Multimedia Presentations.
authoring of semantic PDF documents – from the more mundane,                  Technical report, Centrum voor Wiskunde en Informatica
though important ones (top) to far-reaching ones (bottom):                    (INS-R0305), May 31, 2003.
                                                                         [10] Ontoprise GmbH. OntoOffice Tutorial, 2003.
     1. PDF referencing, as we described it in Section 7
                                                                              http://www.ontoprise.de/documents/tutorial ontooffice.pdf.
     2. Creation of semantic knowledge networks using PDF docu-          [11] P. Grosso, E. Maler, J. Marsh, and N. Walsh. XPointer
        ment, by active references, also introduced in Section 7.             element() Scheme, 2003.
                                                                              http://www.w3.org/TR/xptr-element/.
     3. Automatic derivation of markup.                                  [12] W. Guoren, W. Bin, H. Donghong, and Q. Baiyou. Design
                                                                              and Implementation of a Semantic Document Management
     4. Other information structures (or formats), for example, in-
                                                                              System. Information Technology Journal 4, 1:21–31, 2005.
        corporating not only the annotations created on the text, but
        also the ones created for the pictures, part of the Semantic     [13] S. Handschuh and S. Staab. Authoring and Annotation of
        Document.                                                             Web Pages in CREAM. In Proceedings of the 11th
                                                                              International World Wide Web Conference, WWW 2002,
  We believe that these options make SALT a rather intriguing ap-             Honolulu, Hawaii, May 7-11, 2002, pages 462–473. ACM
proach on which a considerable amount of scientific semantic doc-             Press, 2002.
uments might be build.                                                   [14] S. Handschuh, S. Staab, and A. Maedche. CREAM —
                                                                              Creating Relational Metadata with a Component-Based,
Acknowledgments                                                               Ontology-Driven Annotation Framework. In Proceedings of
                                                                              the First International Conference on Knowledge Capture
This work is funded by the European Commission 6th Framework                  (K-Cap 2001), pages 76–83, Victoria, B.C., Canada, October
Programme in context of the EU IST NEPOMUK IP - The So-                       2001. ACM Press.
cial Semantic Desktop Project, FP6-027705. Special thanks to Big         [15] Adobe Systems Incorporated. Adobe Acrobat SDK.
Faceless Organization (big.faceless.org) for providing the                    http://partners.adobe.com/public/developer/acrobat/sdk/
PDF library used in the metadata analysis process. Further we                 index.html.
thank Anita de Waard for fruitful discussions at ISWC 2005 and
                                                                         [16] Adobe Systems Incorporated. Extensible Metadata Platform.
ESWC 2006.
                                                                              http://www.adobe.com/products/xmp/.
                                                                         [17] Adobe Systems Incorporated. PDF Reference - Adobe
9.     REFERENCES                                                             Portable Document Format, April 2004.
 [1] DublinCore Metadata Initiative. http://dublincore.org/.                  http://partners.adobe.com/public/developer/en/pdf/
 [2] Tim Berners-Lee. An readable language for data on the web -              PDFReference16.pdf.
     notation 3, 1998.                                                   [18] Ted Nelson. Literary Machines: The report on, and of,
     http://www.w3.org/DesignIssues/Notation3.                                Project Xanadu concerning word processing, electronic
 [3] L. Carr, T. Miles-Board, G. Wills, A. Woukeu, and W. Hall.               publishing, hypertext, thinkertoys, tomorrow’s intellectual...
     Towards a Knowledge-Aware Office Environment. In                         including knowledge, education and freedom. Mindful Press,
     D. Karagiannis and U. Reimer, editors, Proceedings of 5th                Sausalito, California, 1981 edition: ISBN 089347052X,
     International Conference on Practical Aspects of Knowledge               1981.
     Management (PAKM 2004), volume LNAI 3336, pages                     [19] Maarten Sneep. The XMP inclusion package, 2005.
     129–140, 2004.                                                      [20] S. Staab, A. Maedche, and S. Handschuh. An annotation
 [4] J. Carroll, I. Dickinson, C. Dollin, D. Reynolds, A. Seaborne,           framework for the semantic web. In Proceedings of the First
     and K. Wilkinson. Jena: Implementing the Semantic Web                    Workshop on Multimedia Annotation, Tokyo, Japan, January
     Recommendations. Technical Report HPL-2003-146,                          30-31 2001.
     Hewlett-Packard, Dec 2003.                                          [21] Maite Taboada and William C. Mann. Applications of
     http://www.hpl.hp.com/techreports/2003/HPL-2003-                         Rhetorical Structure Theory. Discourse Studies, 8, No.
     146.html.                                                                4:567–588, 2006.
 [5] Fabio Ciravegna, Alexiei Dingli, Daniela Petrelli, and Yorick       [22] Maite Taboada and William C. Mann. Rhetorical Structure
     Wilks. User-System Cooperation in Document Annotation                    Theory: looking back and moving ahead. Discourse Studies,
     Based on Information Extraction. volume 2473, pages 122+,                8, No. 3:423–459, 2006.
     January 2002.                                                       [23] Marcello Tallis. Semantic Word Processing for Content
 [6] Anita de Waard and Gerard Tel. The ABCDE format -                        Authors. In Proceedings of the Knowledge Markup &
     Enabling Semantic Conference Proceeding. In Proceedings                  Semantic Annotation Workshop, Florida, USA, Part of the
     of 1st Workshop: ”SemWiki2006 - From Wiki to Semantics”,                 Second International Conference on Knowledge Capture,
     Budva, Montenegro, 2006.                                                 K-CAP 2003., 2003.
 [7] Henrik Eriksson. A PDF Storage Backend for Protege. In              [24] Victoria Uren, Philipp Cimiano, Jos Iria, Siegfried
     Proceedings of the 9th Protege International Conference,                 Handschuh, Maria Vargas-Vera, Enrico Motta, and Fabio
     Stanford, California, USA, 2006.
     Ciravegna. Semantic Annotation for Knowledge
     Management: Requirements and a Survey of the State of the
     Art. Journal of Web Semantics 4, 1:14–28, 2006.
[25] Victoria Uren, Simon Buckingham Shum, Gangmin Li, and
     Michelle Bachler. Sensemaking Tools for Understanding
     Research Literatures: Design, Implementation and User
     Evaluation. Int. Jnl. Human Computer Studies, 64,
     No.5:420–445, 2006.
[26] M. Vargas-Vera, E. Motta, J. Domingue, M. Lanzoni,
     A. Stutt, and F. Ciravegna. MnM: Ontology Driven
     Semi-Automatic and Automatic Support for Semantic
     Markup. In EKAW02, 13th International Conference on
     Knowledge Engineering and Knowledge Management,
     LNCS/LNAI 2473, pages 379–391, Sigüenza, Spain,
     October 2002. Springer.
[27] Max Voelkel and Tudor Groza. SemVersion: RDF-based
     Ontology Versioning System. In Proceedings of the IADIS
     International Conference WWW/Internet (ICWI 2006),
     Murcia, Spain, 2006.