<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Doc2RDFa: Semantic Annotation for Web Documents</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Martin Beno</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Erwin Filtz</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sabrina Kirrane</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Axel Polleres</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Vienna University of Economics and Business</institution>
          ,
          <addr-line>Vienna</addr-line>
          ,
          <country country="AT">Austria</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Ever since its conception, the amount of data published on the worldwide web has been rapidly growing to the point where it has become an important source of both general and domain specific information. However, the majority of documents published online are not machine readable by default. Many researchers believe that the answer to this problem is to semantically annotate these documents, and thereby contribute to the linked “Web of Data“. Yet, the process of annotating web documents remains an open challenge. While some eforts towards simplifying this process have been made in the recent years, there is still a lack of semantic content creation tools that integrate well with information worker toolsets. Towards this end, we introduce Doc2RDFa, an HTML rich text processor with the ability to automatically and manually annotate domain-specific content.</p>
      </abstract>
      <kwd-group>
        <kwd>Information retrieval</kwd>
        <kwd>Semantic web</kwd>
        <kwd>RDFa</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        Advancements in the field of web technologies over the past two and a half decades have
led to an exponential rise in the amount of information available online, such as Open
Government Data Portals or publicly available legal databases, often only published as
PDFs or HTML [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], both of which are not designed to be machine-readable by default.
One possible means to unlock the knowledge stored in textual documents is to add
semantic annotations to the documents [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], thus making them machine-readable. Although
the benefits of semantically annotated documents having been widely recognized [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ],
there is still a vast number of web documents without any semantic annotations being
published every day, likely because adding semantic annotations to documents is a
laborious, error-prone, challenging task [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. The legal domain for instance, holds a large
amount of unannotated text documents, possibly because the adoption of automated
annotation systems has yet to happen in that domain. That is why we propose a
domainspecific system for the legal domain to increase the searchability and interlinking of
legal information by using semantic annotations. As such, it would be highly beneficial
to have a tool which could be used to create, modify and publish documents in a format
that allows the document to be enhanced with additional semantic information. In order
to address this gap, we introduce our Doc2RDFa tool, which can be used to
automatically annotate web documents, in a manner that enables them to be automatically added
to the linked “Web of Data“. The contributions described in this paper can be
summarized as follows: (i) we extend an existing open source web-based rich text processor in
a manner that caters for the embedding of metadata into web documents using RDFa;
(ii) we provide a user-friendly interface for the creation and automatic annotation of
web documents; and (iii) we further enhance the tool with Natural Language
Processing (NLP) such that metadata can be automatically extracted from domain-specific web
based documents. Previous work in this area shows that there is a need for semantic
content authoring tools [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], but annotating documents is a time consuming process which
often requires domain-specific knowledge of experts [
        <xref ref-type="bibr" rid="ref10 ref4">4, 10</xref>
        ]. Existing general purpose
tools, like RDFaCE [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] or Loomp [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], lack accuracy when applied in a domain-specific
context. In the remainder of this paper we provide a short description of the use case in
Section 2. In Section 3 we describe the front-end and back-end as well as the workflow
of our system. The paper concludes with a summary and future work in Section 4.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Use Case</title>
      <p>
        Legal documents residing in legal databases can often be viewed and downloaded in
HTML format. However, this serialization does not contain semantic information which
is necessary to make web documents machine-readable. For instance, semantic
annotations in the form of the keywords describing the case, information pertaining to the
deciding court, relevant dates or referenced laws if available could be used to improve
search over legal data. As such, there is a need for an all encompassing system that can
be used to create, download, display, annotate, and store web documents. Web
documents can be represented using the W3C Resource Description Framework in attributes
(RDFa) serialization, which allows for the embedding of semantic information inside
HTML documents. When it comes to automatic text annotation the focus is put on the
text analytics as opposed to the user interface. However in the case of manual annotations
a rich text processor is highly desirable as it is the tool of choice used by many domain
specialists. With this in mind, and by drawing upon existing literature, we have defined a
set of requirements for our annotation system: (i) User friendliness: An essential aspect
of the editor is the user-friendliness of the interface. The tool should be usable by users
not familiar with the concepts of Semantic Web. A view also held by Heese et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ],
who claim that the complexities of the annotator need to be hidden away from the users.
(ii) Familiarity: It should be possible to both write and annotate documents using the
same front-end interface. Khalili and Auer [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] argue that by integrating the annotation
system in the same environment, in which the documents are created, minimizes user
actions, thereby increasing the eficiency, user satisfaction, learnability and utility of the
system. (iii) Automation: The automatic annotation system should be able to accurately
annotate the document, thus requiring little to no manual efort from the user.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Doc2RDFa</title>
      <p>
        In this section we provide an overview of the core components of Doc2RDFa: (i) the
web-based rich text word processor; and (ii) the NLP and information extraction tool.
Following on from this we describe how users typically interact with the system.
With the advent of Web 2.0 and HTML5, document creation has been increasingly
moving towards browser based solutions [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. A major benefit of web applications is the fact
that they are operating system agnostic. The only requirement is a modern browser
supporting HTML5 and JavaScript features. Furthermore, many web-based rich text editors
support the creation of plain HTML text files, the presentation of which can then further
be adapted with Cascading Style Sheets (CSS). Both HTML and CSS are open
standards, maintained by the World Wide Web Consortium (W3C). We opted for building
upon an existing rich text editor (TinyMCE)1, which is still in development, can be
easily customized, has an active community contributing plugins, and is licensed under the
LGPL license. We extend the editor shown in Fig. 1 by implementing a number of
features and style formats to facilitate the creation of annotated web documents. Documents
written in TinyMCE are saved as plain text HTML files. In the recent years, there have
been various approaches used to annotate HTML documents, such as using Microdata,
JSON-LD, and RDFa. We annotate the documents using RDFa, a W3C
recommendation. The Resource Description Framework (RDF), which underpins the Linked Data
Web, can be used to represent and link information, in a manner which can be
interpreted by both humans and machines. The main benefit of using RDFa over Microdata
is that the metadata are self-contained in the HTML source code, eliminating the need
to separate the semantic annotations from the web document markup.
3.2
      </p>
      <sec id="sec-3-1">
        <title>Natural Language Processing Back-End</title>
        <p>
          The back-end is responsible for not only loading and saving annotations, but is also
enriched with NLP capabilities such that it is possible to automatically propose document
annotations (e.g., named entities, temporal information). Here we use the General
Architecture for Text Engineering (GATE) [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ], an NLP tool, which is capable of reading
1 https://www.tinymce.com/
        </p>
        <p>
          Fig. 2. Doc2RDFa workflow
documents in various formats and subsequently annotating them in one or more
corpora. In particular, we leverage the Java Annotations Pattern Engine (JAPE) [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] in order
to extract information contained in the legal documents. JAPE rules are based on the
tokens produced by a tokenizer at first, splitting the texts in individual tokens of
diferent types, for instance word, number or punctuation. A JAPE rule consists of a left-hand
side (LHS) specifying the extraction rules and a right-hand side (RHS) to add the
annotation to the document and specify annotation features, containing the property we use
for this information according to the applied ontology as well as a background color to
highlight the annotation in the front-end. Given that the editor itself is domain
independent the extraction rules need to be configured for each specific domain. The extension
to other domains therefore requires adding new extraction rules for common patterns.
Furthermore, we are not restricted to patterns, for which JAPE rules are the first choice.
When it comes to a limited amount of words or concepts that should be detected, for
instance the courts in a jurisdiction, the use of gazetteers is recommended which contain
a list of words that are looked up. The use of gazetteers is only recommended for small
sets of lookup words or concepts that do not change regularly and are of a maximum
length such that the lists could be managed and updated manually. The third extension
possibility our system provides is to use machine learning techniques for annotation.
3.3
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Workflow</title>
        <p>The workflow of Doc2RDFa is shown in Figure 2. The process starts with the user
interacting with the front-end text editor. HTML documents can be loaded from either
local storage or retrieved directly from a remote web server. Once a document has been
either written or loaded, the user starts the automatic annotation process by pressing the
"Auto Annotate" button. The source code of the document is then sent to the RESTFul
API and loaded by the GATE NLP, which annotates the document based on a set of
pre-defined grammar rules. The RESTful API then sends the annotated HTML code
as a response back to the client. Once the client retrieves the annotated document it
loads its contents and replaces the original unannotated HTML code. The annotated
text is highlighted in the editor, enabling the user to check whether everything has been
correctly annotated. Finally, missing, or incorrect annotations can be manually fixed
using the "Insert Annotation" button.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Summary and Future Work</title>
      <p>In the present paper, we introduced Doc2RDFa, an HTML rich-text processor with
the ability to automatically and manually annotate domain-specific content. We
subsequently discussed how Doc2RDFa can be used to automatically annotate legal
documents that can be integrated in a pipeline of legal corpus creation process, or used to
modify existing web documents. For future work we aim to extend the editor with
additional features, such as support for triple store databases, and enhancing the search
facility to include faceted browsing.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Beno</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Figl</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Umbrich</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Polleres</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Perception of key barriers in using and publishing open data</article-title>
          .
          <source>JeDEM-eJournal of eDemocracy and Open Government</source>
          <volume>9</volume>
          (
          <issue>2</issue>
          ),
          <fpage>134</fpage>
          -
          <lpage>165</lpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Cunningham</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Maynard</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tablan</surname>
          </string-name>
          , V.:
          <article-title>JAPE: a Java Annotation Patterns Engine (</article-title>
          <year>1999</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Cunningham</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tablan</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Roberts</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bontcheva</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Getting more out of biomedical documents with gate's full lifecycle open source text analytics</article-title>
          .
          <source>PLOS Computational Biology</source>
          <volume>9</volume>
          (
          <issue>2</issue>
          ),
          <fpage>1</fpage>
          -
          <lpage>16</lpage>
          (02
          <year>2013</year>
          ), https://doi.org/10.1371/ journal.pcbi.1002854
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Gangemi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>A comparison of knowledge extraction tools for the semantic web</article-title>
          .
          <source>In: Extended Semantic Web Conference</source>
          . pp.
          <fpage>351</fpage>
          -
          <lpage>366</lpage>
          . Springer (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Godwin-Jones</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          :
          <article-title>Emerging technologies: Web-writing 2.0: Enabling, documenting, and assessing writing online</article-title>
          .
          <source>Language Learning &amp; Technology</source>
          <volume>12</volume>
          (
          <issue>2</issue>
          ),
          <fpage>7</fpage>
          -
          <lpage>13</lpage>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Heese</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Luczak-Rösch</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paschke</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Oldakowski</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Streibel</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          :
          <article-title>One click annotation</article-title>
          .
          <source>In: SFSW</source>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Khalili</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Auer</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>User interfaces for semantic authoring of textual content: A systematic literature review</article-title>
          .
          <source>Web Semantics: Science, Services and Agents on the World Wide Web 22</source>
          ,
          <fpage>1</fpage>
          -
          <lpage>18</lpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Khalili</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Auer</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hladky</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>The rdfa content editor-from wysiwyg to wysiwym</article-title>
          .
          <source>In: Computer Software and Applications Conference (COMPSAC)</source>
          ,
          <source>2012 IEEE 36th Annual</source>
          . pp.
          <fpage>531</fpage>
          -
          <lpage>540</lpage>
          . IEEE (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Kiyavitskaya</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zeni</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mich</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cordy</surname>
            ,
            <given-names>J.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mylopoulos</surname>
          </string-name>
          , J.:
          <article-title>Text mining through semi automatic semantic annotation</article-title>
          .
          <source>In: International Conference on Practical Aspects of Knowledge Management</source>
          . pp.
          <fpage>143</fpage>
          -
          <lpage>154</lpage>
          . Springer (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Neves</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Leser</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          :
          <article-title>A survey on annotation tools for the biomedical literature</article-title>
          .
          <source>Briefings in bioinformatics 15(2)</source>
          ,
          <fpage>327</fpage>
          -
          <lpage>340</lpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Shadbolt</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Berners-Lee</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hall</surname>
            ,
            <given-names>W.:</given-names>
          </string-name>
          <article-title>The semantic web revisited</article-title>
          .
          <source>IEEE intelligent systems 21(3)</source>
          ,
          <fpage>96</fpage>
          -
          <lpage>101</lpage>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>