<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Semantic Tagging with Linked Open Data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>John Cuzzola, Zoran</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jelena Jovanovic</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Jeremic</institution>
          ,
          <addr-line>Ebrahim Bagheri</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Belgrade</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>-Making sense of text is a challenge for computers particularly with the ambiguity associated with language. Various annotators continue to be developed using a variety of techniques in order to provide context to text. In this paper, we describe Denote - our annotator that uses a structured ontology, machine learning, and statistical analysis to perform tagging and topic discovery. A short screencast for the curious is also available at http://youtu.be/espItTRQVzY as well as demonstration links provided in the conclusion.</p>
      </abstract>
      <kwd-group>
        <kwd>semantic web</kwd>
        <kwd>disambiguation</kwd>
        <kwd>entity recognition</kwd>
        <kwd>annotators</kwd>
        <kwd>tagging</kwd>
        <kwd>wikifying</kwd>
        <kwd>linked-data</kwd>
        <kwd>LOD</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        The availability of structured link open data, through
initiatives such as the “Linked Open Data (LOD)” project1,
has given rise to a new class of annotators for unstructured
text. Annotators like TagME [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], DBPedia Spotlight [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], and
Alchemy2 all offer such capability. In this systems paper we
describe Denote – our semantic tagging platform based on
Linked Open Data. In section II, we outline Denote’s
algorithm, describe its vocabulary, and key features. In III,
we demonstrate these features and compare Denote’s output
with other annotators.
      </p>
      <p>II.</p>
    </sec>
    <sec id="sec-2">
      <title>DENOTE’S DESIGN</title>
      <p>Denote searches its ontology for similar concepts to the
input text by performing keyword extraction then calculating
a weighted Jaacard coefficient on resource descriptions. This
provides a measure of text similarity. For each resource, its
known categories (defined in the ontology) are subjected to a
Bayesian filter to exclude those resources and categories that
do not appear relevant. This provides a measure of semantic
similarity. The surviving resources are then used for the
annotations. Denote’s output is in the form of a synopsis
whose lexicon is given in Table I. The output is a single
sentence per annotation with a set of relevant URIs sorted in
order of likelihood with confidence and available support
statistics.</p>
      <p>“Text” [Is_A {}] [[[With_Value •] Of_Units •] |</p>
      <p>Acting_As {}] [Cat_Of {}]</p>
      <p>Fig. 1. The output of an annotated text.</p>
      <p>
        Denote uses a database of linked open data, represented
in the form of n-triples (&lt;subject&gt;&lt;predicate&gt;&lt;object&gt;), to
perform annotations, similarity identification,
1 http://linkeddata.org/
2 http://www.alchemyapi.com/
disambiguation and topic categorization. Denote's database is
DBPedia [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]; an ontology derived from Wikipedia. In this
respect, it resembles DBPedia Spotlight (DBPedia) and
TagME (Wikipedia). However, Denote distinguishes itself in
key ways. First, it attempts to assign context to the
annotations by its [Acting_As] lexicon. Second, it attempts
to annotate numbers [With_Value] through statistical
analysis of similar concepts whose &lt;predicate&gt;:&lt;object&gt; are
of the same data type [Of_Units]. Third, Denote has an
extensive list of topic categories, made available through
DBPedia’s &lt;dcterms:subject&gt; predicate, which it assigns to
its annotations [Cat_Of]. These key differences were the
motivation for Denote’s creation. While other annotators
perform in a similar manner by first spotting word phrases
and linking them to the disambiguated top-surface
form;Denote attempts to find related concepts that will be used to
determine the properties of the spotted word phrases. This
allows for role-based annotations [Acting_As]. We coin this
process as deep tagging as opposed to the shallow tagging of
Denote’s peers.
      </p>
      <p>DENOTE’S ANNOTATION LEXICON EXPLAINED</p>
      <sec id="sec-2-1">
        <title>Lexicon</title>
        <p>Is_A {}
Acting_As {}
With_Value •
Of_Units •
Cat_Of {}</p>
      </sec>
      <sec id="sec-2-2">
        <title>Explanation</title>
        <p>"is a", "is an", "is used by". Asks: What is it?
Context/role. Asks: How is it used?
If number, Asks: What is the number value?
If number, Asks: What is the units of measure?
Asks: What relevant topic categories?
III.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>DEMONSTRATION In this section, we describe three core functions in Denote’s toolkit: text annotation, number annotation, and category disambiguation.</title>
      <sec id="sec-3-1">
        <title>A. The Text Annotator</title>
        <p>Table II demonstrate Denote’s capabilities when
compared to TagME and DBPedia Spotlight using the same
input text of: “BLT. The sub that proves great things come in
threes. In this case, those three things happen to be crisp
bacon, lettuce and juicy tomato. While there's no scientific
way of proving it, this BLT might be the most perfect BLT
sandwich in existence. The default configuration for Denote,
TagME and Spotlight were unchanged. Spotlight does not
perform category analysis. TagMe gives a topic listing but
this list is simply the annotated text rather than a separate
categorization. Consequently, the [Cat_Of] portion of
Denote’s synopsis was omitted and left for part C.</p>
        <p>DBPedia Spotlight was the least effective with the
fewest annotations and an incorrect disambiguation of BLT
as a “Bizarre Love Triangle”. TagME performed well with
numerous annotations with few mistakes (incorrectly tagged
words “crisp” and “juicy”. Both Denote and TagME shared
similar annotations but it is through Denote’s [Acting_As]
vocabulary that provided context information. For example,
both correctly annotated “lettuce” to its surface form, but it
was Denote that identified that lettuce was acting as a main
ingredient. Similarly, Denote linked the phrase “bacon,
lettuce, and juicy tomato” as an alias or alternate name.</p>
        <p>ANNOTATION OF “BLT. THE […] IN EXISTENCE.” WITH
DENOTE, TAGME AND DBPEDIA SPOTLIGHT.</p>
        <p>Denote
(DBPedia)</p>
        <p>TagME
(Wikipedia)
Is_A {/BLT} Acting_As {/name}
Is_A {/BLT} Acting_As {/name}
/BLT</p>
        <p>DBPedia</p>
        <p>Spotlight
(DBPedia/Wiki</p>
        <p>pedia)
/Bizarre_Love
_Triangle
/Sandwich
/Existence</p>
        <p>Is_A {/Bacon_sandwich,
Bacon,Side_bacon} Acting_As
{/mainIngredient, /ingredient}</p>
        <p>Is_A {/Lettuce} Acting_As
{/mainIngredient, /ingredient}</p>
        <p>Is_A {/Tomato} Acting_As
{/mainIngredient, /ingredient}</p>
        <p>Acting_As {/alias,
/alternateName}
/Submarine_</p>
        <p>sandwich
/Potato_chip
/Bacon
/Lettuce
/Juice
/Tomato
/Scientific_m
ethod
BLT sandwich
sandwich
in existence
sub
crisp
bacon
lettuce
juicy
tomato
bacon , lettuce
and juicy
tomato
scientific way</p>
      </sec>
      <sec id="sec-3-2">
        <title>B. The Number Annotator</title>
        <p>The number annotator is unique with respect to other
annotators in that Denote attempts to identify text that is
normally associated with a numerical value. Using statistical
analysis on the Jaacard/Bayes-discovered list of similar
concepts, Denote attempts to match up number values with
annotated text. Figure 2 demonstrates on the input text “The
radio shack color computer has only 16 kb of memory”.</p>
        <p>“memory” With_Value 16 Of_Units #int Cat_Of
{/Home_Computers, TRS-80_Color_Computer}</p>
      </sec>
      <sec id="sec-3-3">
        <title>C. The Categorizer</title>
        <p>Denote has access to over 656,000 categories defined in
DBPedia’s &lt;dcterm:subject&gt; ontology. A Bayesian filter is
used on each similar concept in order to determine if the
subject(s) of which the concept belongs to is contextually
related to the text being annotated. DBPedia Spotlight demo
does not perform topic category determination. TagME’s
demo performs topic categorization by simply listing its
annotated text in a cloud-tag structure rather than a defined
set of category topics. Consequently, we compare Denote’s
output with Alchemy. The Alchemy annotator can perform
named entity extraction from a list of 200+ defined
(sub)entities. In this comparison, the “storyline” of The Godfather
movie was retrieved from the Internet Movie Database
(IMDb) and annotated. Table III gives the results.</p>
        <p>Alchemy results were limited to primitive named entity
types of city and person with the exception of an incorrect
categorization of “television show”. In contrast, Denote
tagged text into rich categories that include
“ItalianAmerican novels”, “organized crime novels”, and
“Godfather characters ”.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>CONCLUSION</title>
      <p>In this paper we demonstrated Denote – a semantic
annotator based on the DBPedia ontology and compared its
features with that of same-class text taggers. Denote’s
middleware engine demo is available at
http://ls3.rnet.ryerson.ca/annotator while a
developerfriendly demo is at http://inextweb.com/denote_demo.
Denote’s annotation capabilities are wrapped around a
RESTful interface allowing for 3rd-party developers to create
their own semantic-aware applications. The result, we hope,
is an improvement in information search and retrieval for the
end user. Our future work involves parallelisation to scale
the service for a large number of concurrent clients. We are
also developing proof-of-concept demonstrations including a
semantic movie recommender whose database will be
included as a data-set to the LOD project.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>P.</given-names>
            <surname>Ferragina</surname>
          </string-name>
          , and U. Scaiella, “TAGME:
          <article-title>On-the-fly Annotation of Short Text Fragments (by Wikipedia Entities)∗”</article-title>
          ,
          <source>In Proceedings of the 19th ACM international conference on Information and knowledge management (CIKM '10)</source>
          .
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>P.</given-names>
            <surname>Mendes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Jakob</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>García-Silva</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          . “
          <article-title>DBpedia spotlight: shedding light on the web of documents”</article-title>
          ,
          <source>In Proceedings of the 7th International Conference on Semantic Systems (I-Semantics '11)</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lehmann</surname>
          </string-name>
          , G. Kobilarov,
          <string-name>
            <given-names>S.</given-names>
            <surname>Auer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Becker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Cyganiak</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Hellmann</surname>
          </string-name>
          . “
          <article-title>DBpedia - A crystallization point for the Web of Data</article-title>
          .”
          <source>In Web Semant. 7</source>
          ,
          <issue>3</issue>
          (
          <year>September 2009</year>
          ),
          <fpage>154</fpage>
          -
          <lpage>165</lpage>
          .
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>