<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Text Encoder and Annotator: an all-in-one editor for transcribing and annotating manuscripts with RDF</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Fabio Valsecchi</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Matteo Abrate</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Clara Bacciu</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Silvia Piccini</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrea Marchetti</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute for Computational Linguistics</institution>
          ,
          <addr-line>ILC</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Institute of Informatics and Telematics</institution>
          ,
          <addr-line>IIT</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>In the context of the digitization of manuscripts, transcription and annotation are often distinct, sequential steps. This could lead to diculties in improving the transcribed text when annotations have already been dened. In order to avoid this, we devised an approach which merges the two steps into the same process. Text Encoder and Annotator (TEA) is a prototype application embracing this concept. TEA is based on a lightweight language syntax which annotates text using Semantic Web technologies. Our approach is currently being developed within the Clavius on the Web project, devoted to studying the manuscripts of Christophorus Clavius, an inuential 16th century mathematician and astronomer.</p>
      </abstract>
      <kwd-group>
        <kwd>Manuscript Transcription</kwd>
        <kwd>Annotation</kwd>
        <kwd>RDF</kwd>
        <kwd>Semantic Web</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Within the eld of Digital Humanities, several projects are devoted to preserving,
analyzing and studying the large amounts of manuscripts, books, newspapers,
maps, photos and paintings stored in archives, museums and libraries around
the world.</p>
      <p>
        The Clavius On the Web 3 project [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] aims to restore and enrich the manuscripts
written by Christophorus Clavius (1538-1612), one of the most respected and
inuential mathematicians and astronomers of his time. Among the works
preserved by the Historical Archives of the Pontical Gregorian University (APUG)
there is the autograph manuscript, used for the printed edition of 1574, entitled
Euclidis Elementorum Libri XV. Accessit XVI de solidorum regularium
comparatione. Omnes perspicuis demonstrationubus, accuartisque scholiis illustrati .
It is an annotated translation from Greek into Latin of Euclid’s Elements, the
famous text of arithmetic and geometry from the 3rd century BC. The text was
considered one of the most comprehensive and authoritative of the 16th century
      </p>
    </sec>
    <sec id="sec-2">
      <title>3 http://claviusontheweb.it/</title>
      <p>so that personalities such as RenØ Descartes, Marin Mersenne and Johannes
Kepler built their knowledge on it.</p>
      <p>In this context, Semantic Web technologies, such as the Resource Description
Framework (RDF), make it possible to approach text annotation in an innovative
way. RDF-based annotations provide a method for enriching texts using
structured data already described in details and maintained by the Semantic Web
community (i.e., Linked Data sets). In addition, due to the interlinked structure
of Linked Data, RDF-based annotations produce valuable annotated documents,
characterized by a strong connection with external resources.
1.1</p>
    </sec>
    <sec id="sec-3">
      <title>Background</title>
      <p>Text annotation consists in attaching additional information such as comments,
tags or links, to specic portions of a text. Annotations can be mainly
performed using two methods: inline and stando . The former directly includes
annotations within the text, while the latter denes them in a dierent location.
Inline markup keeps the annotations and the annotated text close together, but
it has the drawback of weighing the document down. Moreover, depending on
the complexity of the markup language, the text could become hard to read.
This aspect is crucial and must be taken into account when developing manual
annotation tools, as users need to be able to read the annotated text with ease.
Last but not least, a complex and heavyweight markup language could make the
manual annotation process even more dicult since users have to rstly know all
the syntax rules and secondly write a considerable amount of additional markup.</p>
      <p>In contrast, the stando approach does not have markup overloading
problems due to its total independence from the resource text. Annotations are in
fact separately dened in a dierent location where the relative text osets are
stored and kept up to date. In addition, stando markup has the advantage of
allowing overlapping annotations. Nevertheless, this approach has some
drawbacks related to the sequential process of transcribing and annotating. Typically,
the available tools of this type separate the transcription and the annotation
phases. However, if a transcription error is found during the subsequent
annotation phase, it is necessary to recompute the osets in order to reect the changes
in the transcribed text. This implies an automatic recomputation of the osets,
a process that could be complex and costly. The logical conclusion is therefore
to make transcription and annotation a joint process.
1.2</p>
    </sec>
    <sec id="sec-4">
      <title>Our Approach</title>
      <p>The core idea of this work is to combine transcription and annotation of text,
thus streamlining the workow process. Hence, we devised a lightweight language
that enables this continuous and mixed process. Another key-point is that we
propose to treat every textual phenomenon as an annotation independently from
its specic type (e.g., semantic, syntactic, lexical). Every portion of text is treated
in the same way, and RDF-based annotations are used to describe their content.
RDF supports our purpose by allowing any possible annotation to be specied
using the enormous amount of ontologies, vocabularies and Linked Data sets
available on the Web.</p>
      <p>
        To the best of our knowledge, none of the existing tools include this approach.
For instance, Pundit [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and Refer.cx [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] provide RDF-based annotation features
but limited only to web pages and without the possibility of transcribing text.
Brat [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] allows for text annotation supported by Natural Language Processing
technology, however it does not provide a transcription feature. RDFaCE [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] is
a text editor for annotating text using a graphical UI and displaying results
with dierent views such as WYSIWYG and WYSIWYM (i.e., What You See
Is What You Get/Mean).
2
      </p>
      <sec id="sec-4-1">
        <title>Text Encoder and Annotator</title>
        <p>In the light of the above, this article presents the Text Encoder and Annotator, a
Web application, which provides an editor to transcribe texts and a lightweight
language for annotating them with RDF (all within the same environment).
It was envisaged for linguists, historians and more generally for scholars and
students. We devised a layout composed of three main views, horizontally placed
along the interface of the application (Figure 1):
1. Image box : it displays the digitized image of a specic manuscript to be
transcribed and annotated.
2. Editor box : it is the main component of the tool. It is used to write text and
enrich it with RDF-based annotations, which follow the specic language
syntax described below. Text highlighting identies the markup and
improves its legibility. Moreover, a top bar contains shortcut buttons for
inserting some basic annotations, characterized by common RDF predicates, such
as rdfs:seeAlso, which can be used to link to external resources, rdfs:comment
to specify text comments and foaf:page to include hyperlinks related to the
topic the annotation is about.
3. Diagram box : it is an optional view that can be activated on-demand if the
user needs a summary of their annotations. The node-link diagram contains a
white node representing the whole text of the document, blue nodes
identifying the portion of text referred to in an annotation, orange nodes displaying
the identiers of the annotation and gray nodes describing the objects of the
RDF triples specied. The edges between orange and gray nodes represent
the predicates of the triples dened in the annotations.
2.1</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Lightweight language Considering the pros and cons of inline and stando annotations discussed above, we think that a hybrid lightweight syntax is a suitable solution to fulll our requirement of having a simultaneous workow of transcription and annotation.</title>
      <p>
        We propose a Lightweight Markup Language 4 (LML) employed only within
the interface of TEA, in order to provide an easy and quick way of transcribing
and annotating text using RDF. The main reason behind the adoption of an
LML is that common markup languages based on XML (e.g., TEI) are not easy
to write and read in their raw form, due to their complex syntax. Moreover, the
use of LMLs has already proven to be benecial in other systems such as the
Leiden-plus5 language employed in the papyri.info editor. It is worth clarifying
that we do not propose our approach as a format for the representation and
interchange of texts, consequently it cannot be compared to standards such as
the Text Encoding Initiative (TEI) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Furthermore our language markup should
be considered as distinct from semantic markup languages like Microformat [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ],
RDFa [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and Microdata [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] since it has not been not devised for annotating
HTML and XML documents.
      </p>
      <p>Our language is used for marking portions of text and assigning them
identiers that will be used in a dierent, reserved and stando-like" section of the
text where the details of the annotations are specied (Figure 2). More precisely,
a portion of text can be annotated by enclosing it in a span using angle brackets,
while round brackets specify a string identifying the annotation (i.e., hannotated
portion of texti(identier) ). This inline syntax uses a very limited amount of
characters and does not weigh the text down too much, keeping it easy-to-read.
Identiers are then used in a distinct part of the text, called directive section,
where the annotation body is specied. Three plus signs (i.e., +++) are used
both to open and close this section that can be repeated within the text more
than once. Inside this block, annotations can be specied as RDF triples with the
identier of a certain span of text as subject. The choice of predicates and objects
4 http://en.wikipedia.org/wiki/Lightweight_markup_language
5 http://papyri.info/editor/documentation?docotype=text
is totally free, although some default predicates are suggested in order to
perform the most common and basic annotations (e.g., rdfs:seeAlso, rdfs:comment,
foaf:page). Currently, the syntax used in the directive section denes triples as
three text values separated by a space.
2.2</p>
    </sec>
    <sec id="sec-6">
      <title>Example</title>
      <p>
        We here provide an example of annotation performed on a portion of text
extracted from the Euclidis Elementorum Libri XV. Accessit XVI [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] (paragraph
30) by Clavius. A free English translation of the Latin text fragment, shown
in Figure 2, follows: "It is called Oxigonium as it has three acute angles.
Every Oxigonium triangle, or Acutangle triangle, could be either Equilateral, or
Isosceles or Scalene as you can see from the classication provided above and
not reported here". Figure 2 shows the code resulting from the text encoding
and annotation process. Portions of text are marked using spans while the body
of the annotations is specied within the directive section. The annotation
identiers (e.g., s1, t1, t2) are used as the subjects of the triples while predicates and
objects are freely chosen by annotators. The Latin language (i.e.,
lexvo:iso6393/lat) and the translation have been specied in the rst annotation s1
using the Lexvo ontology 6. Lexical entries of the mathematical lexicon of
Clavius7 (e.g., cll:math/triangulum_oxygonium) and the DBpedia Triangle resource
(i.e., dbr:Triangle) have been linked through the seeAlso predicate of the RDF
Schema8. The triangle entry of Wikipedia has been specied as an interesting
web page (i.e., foaf:page) for the annotation e1 using the FOAF vocabulary.
      </p>
    </sec>
    <sec id="sec-7">
      <title>6 http://lexvo.org/ontology</title>
      <p>7 http://claviusontheweb.it/lexicon/math/
8 http://www.w3.org/TR/rdf-schema/</p>
      <sec id="sec-7-1">
        <title>Conclusion &amp; Future Works</title>
        <p>This article describes an approach for merging the distinct steps of transcription
and annotation as a single process. We implemented a tool based on a lightweight
syntax language that allows RDF annotations to be performed. We conducted
some preliminary tests, which involved 50 students, who were asked to use the
prototype, and provide feedback. Future works will consist in developing an
improved language with a syntax capable of handling nested as well as overlapping
(i.e., not hierarchically nested) annotations. New syntax elements will be
introduced: milestone elements, for annotating a single location in the text (e.g., a
gap), partition elements, for the identication of phenomena such as line, page
or sentence breaks. Additional syntax will be introduced to provide shortcuts
("syntactic sugar") to the most common annotations. The Turtle syntax 9 will
also be taken into account for the RDF triples specication. Finally, dierent
formats (e.g., turtle, json, csv, xml) will be chosen for exporting annotations
according to various data models (e.g., Open Annotation, NLP Interchange Format
(NIF)).</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>9 https://www.w3.org/TR/turtle/</title>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Euclidis</given-names>
            <surname>Elementorum Libri</surname>
          </string-name>
          <string-name>
            <given-names>XV</given-names>
            :
            <surname>Accessit XVI De Solidorum Regularium Comparatione</surname>
          </string-name>
          .
          <article-title>Omnes perspicuis demonstrationubus, accuartisque scholiis illustrati</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Matteo</given-names>
            <surname>Abrate</surname>
          </string-name>
          ,
          <source>Angelo Mario Del Grosso</source>
          ,
          <string-name>
            <given-names>Emiliano</given-names>
            <surname>Giovannetti</surname>
          </string-name>
          , Angelica Lo Duca, Damiana Luzzi, Lorenzo Mancini, Andrea Marchetti, Irene Pedretti, and
          <string-name>
            <given-names>Silvia</given-names>
            <surname>Piccini</surname>
          </string-name>
          .
          <article-title>Sharing cultural heritage: the clavius on the web project</article-title>
          .
          <source>In Language Resources and Evaluation Conference</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Ben</given-names>
            <surname>Adida</surname>
          </string-name>
          , Mark Birbeck,
          <string-name>
            <surname>Shane McCarron</surname>
            ,
            <given-names>and Steven</given-names>
          </string-name>
          <string-name>
            <surname>Pemberton</surname>
          </string-name>
          .
          <article-title>Rdfa in xhtml: Syntax and processing</article-title>
          .
          <source>Recommendation, W3C</source>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Marco</given-names>
            <surname>Grassi</surname>
          </string-name>
          , Christian Morbidoni, Michele Nucci, Simone Fonda, and
          <string-name>
            <given-names>Francesco</given-names>
            <surname>Piazza</surname>
          </string-name>
          .
          <article-title>Pundit: augmenting web contents with semantics</article-title>
          .
          <source>Literary and linguistic computing</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Nancy</given-names>
            <surname>Ide</surname>
          </string-name>
          and
          <article-title>Jean VØronis</article-title>
          .
          <article-title>Text encoding initiative: Background and contexts</article-title>
          . Springer Science &amp; Business
          <string-name>
            <surname>Media</surname>
          </string-name>
          ,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Ali</given-names>
            <surname>Khalili</surname>
          </string-name>
          , Sren Auer, and
          <string-name>
            <given-names>Daniel</given-names>
            <surname>Hladky</surname>
          </string-name>
          .
          <article-title>The rdfa content editor-from wysiwyg to wysiwym</article-title>
          .
          <source>In Computer Software and Applications Conference</source>
          . IEEE,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>Rohit</given-names>
            <surname>Khare</surname>
          </string-name>
          and
          <article-title>Tantek ˙elik. Microformats: a pragmatic path to the semantic web</article-title>
          .
          <source>In Proceedings of the 15th international conference on World Wide Web</source>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>Steven</given-names>
            <surname>Ruggles</surname>
          </string-name>
          , Matthew Sobek, Catherine A Fitch, Patricia Kelly Hall, and
          <string-name>
            <given-names>Chad</given-names>
            <surname>Ronnander</surname>
          </string-name>
          .
          <source>Integrated public use microdata series: Version 2</source>
          .0 .
          <string-name>
            <given-names>Historical</given-names>
            <surname>Census</surname>
          </string-name>
          <string-name>
            <surname>Projects</surname>
          </string-name>
          , Department of History, University of Minnesota,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>Pontus</given-names>
            <surname>Stenetorp</surname>
          </string-name>
          , Sampo Pyysalo, Goran Topi¢,
          <string-name>
            <surname>Tomoko</surname>
            <given-names>Ohta</given-names>
          </string-name>
          , Sophia Ananiadou, and
          <article-title>Jun'ichi Tsujii. Brat: a web-based tool for nlp-assisted text annotation. In Conference of the European Chapter of the Association for Computational Linguistics</article-title>
          .
          <source>Association for Computational Linguistics</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Tabea</surname>
            <given-names>Tietz</given-names>
          </string-name>
          , Jrg Waitelonis, Joscha Jger, and
          <string-name>
            <given-names>Harald</given-names>
            <surname>Sack</surname>
          </string-name>
          .
          <article-title>Smart media navigator: Visualizing recommendations based on linked data</article-title>
          .
          <source>In 13th International Semantic Web Conference, Industry Track</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>