<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>First steps beyond the bag-of-words representation of short texts</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Paolo Ferragina</string-name>
          <email>ferragina@di.unipi.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ugo Scaiella</string-name>
          <email>scaiella@di.unipi.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Dipartimento di Informatica University of Pisa</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>We address the problem of enhancing the classical bag-ofwords representation of texts by designing and engineering Tagme, the rst system that performs an accurate and on-the- y semantic annotation of short texts via Wikipedia as knowledge base. Several experiments show that Tagme outperforms state-of-the-art algorithms when they are adapted to work on short texts and it results fast and competitive on long ones. This leads us to argue favorably about Tagme's application to clustering, classi cation and retrieval systems on challenging scenarios like web-snippets, tweets, news, ads, etc..</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Motivation and background</title>
      <p>
        The typical IR-approach to indexing, clustering, classi cation and retrieval, just
to name a few, is that based on the bag-of-words paradigm. In recent years a
good deal of work attempted to go beyond this paradigm with the goal of
improving the search experience on (unstructured) textual data. In our work we
are concerned with the task of adding structure to unstructured data, consisting
of the identi cation of sequences of terms (aka spots) in the input text and their
annotation with a Wikipedia page. This annotation process provides a stunning
contextualization of the input text so that each subsequent IR-task could be
improved by leveraging the huge semantic network provided by Wikipedia.
Recently several works (see e.g. [
        <xref ref-type="bibr" rid="ref4 ref6">4, 6</xref>
        ] and refs therein) addressed the problem of
annotating texts with hyper-links to Wikipedia pages.
      </p>
      <p>We add to this ow of work the specialty that the input texts to be annotated
are short, namely, they are composed of few tens of terms. The context of use
we have in mind is the annotation of either the snippets of search-engine results,
or the tweets of a Twitter channel, or the items of a news feed, or the posts of a
blog, or the advertisement messages, etc.. It is easy to argue that these poorly
composed texts pose new challenges in terms of e ciency and e ectiveness of
the annotation process, which (1) should be very fast, because in those contexts
data may be retrieved at query time and thus cannot be pre-processed, and
(2) should be designed properly, because the input texts are so short that it is
di cult to mine signi cant statistics that are rather available when texts are
long. To address these issues, we have designed and implemented Tagme the
rst software system that, on-the- y and with high precision/recall, annotates
short texts with pertinent hyper-links to Wikipedia pages.</p>
      <p>As an example, let us consider the following news: \Diego Maradona won
against Mexico". Our goal is to detect \Diego Maradona" and \Mexico" as
spots, and then hyper-link them with the Wikipedia pages which deal with the
ex Argentina's coach and the football team of Mexico. Tagme uses as spots (to
be annotated) the sequences of terms composing the anchor texts which occur
in the Wikipedia pages, and it uses as possible senses for each spot the (possibly
many) pages pointed in Wikipedia by that spot/anchor. Tagme selects among
the potentially many available mappings (spot-to-page)1 the most pertinent ones
by nding a collective agreement among them via new scoring functions which
are fast to be computed and accurate in the nally produced annotation.</p>
      <p>
        A preliminary description of Tagme has been published as poster in Procs
ACM CIKM 2010. Tagme is available for test at http://tagme.di.unipi.it.
What follows is a sketch of the main ideas and experimental results concerning
with Tagme, the interested reader is invited to read [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] for details.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>The anatomy of TAGME</title>
      <p>The annotation process of Tagme is composed by two main phases: Anchor
disambiguation and Anchor pruning.</p>
      <p>
        Anchor disambiguation. This is the task that judiciously cross-references
each anchor a 2 AT found in the input text T with one pertinent page pa of
Wikipedia. Tagme selects the best association a 7! pa by computing a score for
each possible page pa linked to a in Wikipedia (we call Pg(a) this set) that is
based on a new notion of \collective agreement" between the page pa and the
pages that can be associated to all other anchors detected in T , i.e. the anchors
in AT . This agreement is evaluated by means of a voting scheme that computes
for each other anchor b 2 AT n fag its vote to the annotation a 7! pa. Given
that b may be linked to many pages in Wikipedia (i.e. jPg(b)j &gt; 1) we compute
this vote as the average relatedness between each page pb, potentially linked to
b, and the sense pa we wish to associate to a. However we argue that not all
possible pages of b have the same (statistical) signi cance, so we weight each
relatedness with the commonness of the page pb with respect to b (denoted as
Pr(pbjb) and computed as the prior probability that b points to pb over all links of
b in Wikipedia). Hence the voting given by anchor b to the annotation a 7! pa is
vb(pa) = jPg1(b)j Ppb2Pg(b) rel(pb; pa) Pr(pbjb) where the relatedness rel(pb; pa)
between the two Wikipedia pages pa and pb is computed as suggested in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]
by exploiting the intersection over the incoming links to pa and pb. The overall
score for the annotation a 7! pa is computed as the sum of the votes given by all
anchors b in T . This score is not enough to obtain an accurate disambiguation,
so we rst lter out candidate pages in Pg(a), via a properly set threshold, and
then select the best page by deploying the commonness scores.
      </p>
      <p>Anchor pruning. This step detects and possibly prunes some of the candidate
annotations produced by the Disambiguation phase, if they are considered to
be not meaningful. These \bad annotations" are detected via a simple, yet
effective, scoring function that takes into account only two features: (1) the link
1 \Diego Maradona" is the name of two persons and \Mexico" points to 154 di erent
pages in Wikipedia.
probability lp(a) of the anchor a, computed as the number of times that a is
used as anchor divided by the occurrences of a in Wikipedia (as anchor or not)
and (2) the coherence between its candidate annotation a 7! pa (assigned by the
Disambiguation Phase) and the candidate annotations of the other anchors in
T , computed with the relatedness function presented above. This pruning score,
say (a 7! p), is then compared against a properly set threshold na, so that
if (a 7! p) &lt; na then that annotation for a is discarded. The parameter na
allows to balance recall vs precision of the annotation process and we deeply
investigated its impact in our experiments.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Experimental evaluation</title>
      <p>
        We evaluated Tagme over a set of short texts randomly drawn from Wikipedia,
composed by about 20 terms (like web-snippets), and containing an average of
about 10 spots. We compared the annotation produced by Tagme against the
links attached by Wikipedia editors. We also evaluated Tagme on the dataset
proposed in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] that is composed by manually annotated long texts drawn from
web. In annotating long texts containing more than 10 spots, Tagme processes
the long input text by shifting a window of about 10 spots over it, and applying
our algorithms on each window in an incremental way, so that we didn't change
Tagme's architecture and Tagme is able to scale linearly with the number of
anchors in the input text. For lack of space we cannot report detailed gures
about the real performance of Tagme, as well as we cannot detail the
comparisons against Milne&amp;Witten's and Chakrabarti's systems2, however here we
brie y state that:
{ on short texts, our annotator outperforms M&amp;W's one by yielding an overall
      </p>
      <p>F-measure of about 78% with an absolute improvement of more than 8%;
{ on long texts, Tagme resulted competitive (if not superior) with respect
to M&amp;W's and Chakrabarti's systems even if our \shift-based" approach
clearly gives advantages to our competitors that deploy the full input text;
{ a deep evaluation on the scalability and time e ciency of Tagme showed
that it is able to annotate a short text of about 10 spots in 18ms, while
M&amp;W's and Chakrabarti's systems take about 95 ms and &gt; 2 sec,
respectively, on comparable hardware.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Applications and future works</title>
      <p>We believe that Tagme has implications which go far beyond the enrichment
of a text with explanatory links. The most interesting bene t of this annotation
process is the structured knowledge attached to textual fragments that let us
to leverage the semantic network provided by Wikipedia to improve IR-tasks
which nowadays are mainly addressed with the bag-of-words paradigm, with all
its well-known limitations.</p>
      <p>
        We are currently investigating the impact of Tagme's annotation onto several
application domains:
2 A deeper evaluation is presented in our technical report [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
{ The on-the- y labeled clustering of search-engine results (see also Clusty.com,
Carrot2 and the survey [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]). Most of those softwares are based on
syntactic and statistical features, so they could bene t from Tagme's annotation
to improve the e ectiveness of the labeling and the clustering phases. For
this application we are considering the deployment of spectral clustering
techniques [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
{ IR systems for vertical domains. Since in this context other structured
information (coming from other sources like databases, ontologies, etc.) could
be available, we are investigating how to deploy such information in the
Tagme-annotation process in order to further improve its performance. As
a case study, we are considering the application of Tagme to the annotation
of Italian medical prescriptions.
{ Web Advertising. The explanatory links and the structured knowledge
attached to plain-texts by Tagme could allow the e cient and e ective
resolution of ambiguity and polysemy issues which often occur when advertiser's
keywords are matched against the content of Web pages o ering display-ads.
      </p>
      <p>
        Finally, we are studying the impact of other tools and information sources
in Tagme to better relate and/or assign pages/senses to text anchors (e.g.
Natural Language Processing tools, other by-products of Wikipedia| such as
DBpedia.org or YAGO| or the huge amount of structured informations provided
by the W3C SWEO Linking Open Data project [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]). As well as, we are currently
setting up a much larger user-study over Mechanical Turk with the twofold goal
of creating a manually-annotated dataset which is much larger than the one
o ered by [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], and of supporting the applicability of Tagme on short web-texts.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          .
          <article-title>The Emerging Web of Linked Data</article-title>
          .
          <source>IEEE Int. Sys.</source>
          ,
          <volume>24</volume>
          (
          <issue>5</issue>
          ):
          <volume>87</volume>
          {
          <fpage>92</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>C.</given-names>
            <surname>Carpineto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Osinski</surname>
          </string-name>
          , G. Romano, and
          <string-name>
            <given-names>D.</given-names>
            <surname>Weiss</surname>
          </string-name>
          .
          <article-title>A survey of web clustering engines</article-title>
          .
          <source>ACM Comput. Surv.</source>
          ,
          <volume>41</volume>
          (
          <issue>3</issue>
          ):1{
          <fpage>38</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>P.</given-names>
            <surname>Ferragina</surname>
          </string-name>
          and
          <string-name>
            <given-names>U.</given-names>
            <surname>Scaiella</surname>
          </string-name>
          . TAGME:
          <article-title>On-the- y annotation of short text fragments (by Wikipedia entities)</article-title>
          .
          <source>In Proc. ACM CIKM</source>
          ,
          <volume>1625</volume>
          {
          <fpage>1628</fpage>
          ,
          <year>2010</year>
          .
          <article-title>A detailed technical</article-title>
          report is available at http://arxiv.org/abs/1006.3498.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>S.</given-names>
            <surname>Kulkarni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Ramakrishnan</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Chakrabarti</surname>
          </string-name>
          .
          <article-title>Collective annotation of Wikipedia entities in web text</article-title>
          .
          <source>In Proc. ACM KDD</source>
          ,
          <volume>457</volume>
          {
          <fpage>466</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>O.</given-names>
            <surname>Medelyan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Milne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Legg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and I. H.</given-names>
            <surname>Witten</surname>
          </string-name>
          .
          <article-title>Mining meaning from</article-title>
          <string-name>
            <given-names>Wikipedia. Int. J.</given-names>
            <surname>Hum</surname>
          </string-name>
          .-Comput. Stud.,
          <volume>67</volume>
          (
          <issue>9</issue>
          ):
          <volume>716</volume>
          {
          <fpage>754</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>D.</given-names>
            <surname>Milne</surname>
          </string-name>
          and
          <string-name>
            <given-names>I. H.</given-names>
            <surname>Witten</surname>
          </string-name>
          .
          <article-title>Learning to link with Wikipedia</article-title>
          .
          <source>In Proc. ACM CIKM</source>
          ,
          <volume>509</volume>
          {
          <fpage>518</fpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7. U. von
          <string-name>
            <surname>Luxburg</surname>
          </string-name>
          .
          <article-title>A tutorial on spectral clustering</article-title>
          .
          <source>Statistics and Computing</source>
          ,
          <volume>17</volume>
          (
          <issue>4</issue>
          ):
          <volume>395</volume>
          {
          <fpage>416</fpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>