<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Similarity-Based Cross-Media Retrieval for Events</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Piroska Lendvai</string-name>
          <email>piroska.r@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Thierry Declerck</string-name>
          <email>declerck@dfki.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Dept. of Computational Linguistics, Saarland University Saarbrucken</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <fpage>369</fpage>
      <lpage>372</lpage>
      <abstract>
        <p>Our goal is to link social media content to contextually relevant information in complementary media in the domain of daily news. Web links from tweets with user-included URLs are transferred to URLless tweets, using manually annotated events. The new cross-media ties establish authoritative feedback documents for unsupported social media content, and enable extracting an improved set of event-denoting terms based on longest common subsequences between tweets and documents.</p>
      </abstract>
      <kwd-group>
        <kwd>social media</kwd>
        <kwd>information contextualization</kwd>
        <kwd>similarity-based retrieval</kwd>
        <kwd>cross-media feedback documents</kwd>
        <kwd>term extraction</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>We aim to create a cross-media (CM) linking algorithm in the PHEME project1
to connect User-Generated Content (UGC) to topically relevant information in
complementary media. Media that is complementary to UGC (in our pilot study,
a tweet) is de ned to be authoritative news releases on the web.</p>
      <p>
        Recent natural language processing studies present some CM approaches
with the purpose of aligning UGC and authoritative content. The goal of [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] is
to collect information about emergency situations from tweets that are
complementary to mainstream media reports. First, relevant keywords are determined
from a centroid news article in a topically connected article cluster, and used
in various query constructions to retrieve event-related tweets. The direction
of linking is motivated by the need to boost retrieval precision on established
events, which is orthogonal to the mission of the PHEME project { our targeted
starting point is events that rst emerge in social media and only later or not at
all are covered in mainstream news releases. The algorithm of [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] is reused and
extended in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]: based on a centroid article in an event cluster, related tweets
that contain URLs are mined, using custom-threshold-based term vector
similarity. Then, relevance ranking takes place on these tweets, using platform-speci c
indicators (number of mentions, retweets, etc). New, related articles on the web
are retrieved based on the URLs of top-ranked tweets. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] do not report on the
proportion of web articles found that were already seen in the query-originating
news cluster. Such information would evaluate the retrieval of complementary
sources more transparently, and it forms an important part of our CM algorithm.
      </p>
      <p>
        To implement CM linking for PHEME, our core assumption was that URL
presence in tweets is a relevance feedback analogous to landing page information
in click data, utilizable to develop retrieval functions from observed user
behavior (see e.g. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]). Referring to external sources is a multi-purpose activity in social
media practices that may amalgamate among others intents of content framing
(i.e., quoting authoritative sources) and content enrichment (i.e., guiding to
extended information). Based on URLs that are present in tweets and point to
web documents, we devised a method that transfers this explicit, user-included
relevance signal to a collection of tweets that do not include explicit web links.
The transfer is based on Events that have been manually annotated; each tweet
is annotated with exactly one Event. Events are manually annotated situations
or stories that describe smaller scale episodes than hashtag-denoted topics.
      </p>
      <p>
        Our goal is to link URL-less tweets to a ranked list of web documents,
where topic relevance is bootstrapped from event-based similarity between
URLincluding tweets and URL-less tweets, and ranking is based on aggregated n-gram
similarity between tweet text and web document text. To this end, we extract
and rank key phrases based on document{tweet similarity, and associate them
with the Event the referring tweet is annotated with. As we focus on related
content discovery and its use for rumour2 veri cation purposes, our setup and
results are more speci c than the INEX tweet contextualization tasks (see e.g.
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]) to support a human reader.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Data and Algorithm</title>
      <p>
        We worked with a dataset that consists of tweets relating to two broad events:
(G) the Gurlitt art collection3 and (O) the Ottawa shooting4. Tweets were
precollected by ltering on event-related keywords (e.g. 'gurlitt'), selecting events
that meet the characteristics of a rumour. Each tweet was manually annotated
for situations/stories (henceforth: Events5) that correspond to speci c rumours,
as described in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]; for characteristics of the data see the top section of Table 1.
2.1
      </p>
      <p>String similarity-based term extraction
For each URL-containing tweet within each Event, a tweet { document similarity
calculation cycle is run. Similarity in the current implementation is based on
2 de ned in PHEME as a circulating story of questionable veracity
3 https://de.wikipedia.org/wiki/Schwabinger Kunstfund
4 https://en.wikipedia.org/wiki/2014 shootings at Parliament Hill, Ottawa
5 e.g. (G): 'The Bern Museum will accept the Gurlitt collection', 'Gurlitt
was mentally unfit when he wrote his will'; (O): 'There are snipers on
the roof of the National Art Gallery', 'Shooter is still on the loose'.</p>
      <p>
        Gurlitt Ottawa
languages DE, FR, EN EN
events 3 51
tweets without URL 43 182
tweets with URL 147 341
unique URLs 143 187
fetchable web documents [by authoritative sources] 61 [61] 107 [107]
terms extracted from URLed tweets 110 169
terms extracted from URLless tweets 96 190
terms unseen in URLed tweets 83 143
the Longest Common Subsequence (LCS) metric (cf. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]). LCS is a
languageindependent, exible-length skip-gram matching method that we apply on the
token level for each tweet { document sentence pair6. No linguistic information is
used, except for stopword ltering by the NLTK toolkit7. The process produces a
ranked list of tweets based on LCS similarity with their linked document (which
is in e ect a user-coded feedback document) for all URL-providing tweets for a
given Event, and outputs the longest common subsequence tokens between tweet
and document body.
      </p>
      <p>In the second pass, the cycle is applied to the same feedback web
document set, now paired with tweets that did not link external documents but are
hand-labeled with the same Events as the tweets from which web documents
are referred from. This boosts the pool of linked authoritative8 documents and
tweets by 105% for G and 294% for O; extracted top-5 LCS phrases9 grow
qualitatively10 by 75% for G and by 85% for O; cf. the bottom section of Table 1. An
example output is provided below for the focus Event 'The Bern Museum will
accept the Gurlitt collection'.</p>
      <p>Focus document's headlines: "Bestatigt: Kunstmuseum Bern nimmt das Erbe des
Kunstsammlers Cornelius Gurlitt an - KURIER.at"
Top tweet with URL to focus document: Bestatigt: Sammlung Gurlitt geht nach
Bern http://t.co/FRCSHTU5hL
LCS term of top URL-ed tweet and focus document: 'bestatigt sammlung gurlitt
geht bern'; Similarity score: 1.00
Top URL-less tweet labeled with focus Event: RT @SWRinfo: Das Kunstmuseum
Bern nimmt das Erbe des Kunstsammlers Cornelius #gurlitt an.
6 Casing is normalized, the retweet token, screen names and punctuation are removed
7 nltk.org
8 Based on a list of 25k authoritative news sources collected by PHEME.
9 We keep the 5 most similar LCS phrases for each tweet{web document pair.
10 I.e., in terms of obtaining new phrases that were unseen in the pool of URL-ed
tweets{linked web documents.
LCS term for top URL-less tweet and focus document: 'kunstmuseum bern nimmt
erbe kunstsammlers cornelius gurlitt'; Similarity score: 0.79
3</p>
    </sec>
    <sec id="sec-3">
      <title>Evaluation and Outlook</title>
      <p>We presented a pilot study on transferring feedback document relevance for
social media posts, based on manually annotated, ne-grained events. We used
the LCS similarity metric to extract descriptive phrases for each Event; the
obtained multi-word terms implicitly encode token proximity and word order,
valuable for query- and document language modeling and indexing. LCS was also
used to assign term-, respectively document weights to each Event, independent
of a xed document collection. Tweets with unsupported claims could be linked
to authoritative web documents by utilizing hand-coded tweet{tweet similarity
information; automatically obtaining this information is currently ongoing.</p>
      <p>The ndings suggest that LCS is advantageous when working with big data
across languages and domains, as foreseen in the PHEME project. In future work
we plan to compare LCS with other similarity metrics, as well as evaluate the
obtained term, respectively document rankings in a retrieval scenario for
information veri cation purposes. The major impact of Event-based bootstrapping
of cross-media links is that we obtain a much larger set of cross-media context
pairs, enabling the extraction of an improved list of event descriptors that can
be put to use in fact checking and contextual document ranking, on which we
plan to report in follow-up studies.</p>
      <p>Acknowledgments We are grateful to two anonymous reviewers for their insightful
comments. Work presented in this paper has been supported by the PHEME FP7
project (grant No. 611233).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Bellot</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moriceau</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mothe</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sanjuan</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tannier</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          :
          <article-title>Overview of INEX tweet contextualization 2013 track</article-title>
          .
          <source>CLEF</source>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Balahur</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tanev</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Detecting Event-Related Links and Sentiments from Social Media Texts</article-title>
          . ACL Conference System Demonstrations (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Joachims</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Optimizing search engines using clickthrough data</article-title>
          .
          <source>Proceedings of the ACM Conference on Knowledge Discovery and Data Mining</source>
          (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Lin</surname>
          </string-name>
          , Ch. Y.:
          <article-title>Rouge: A package for automatic evaluation of summaries</article-title>
          .
          <source>In: Text summarization branches out: Proceedings of the ACL-04 workshop</source>
          . Vol.
          <volume>8</volume>
          (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Tanev</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ehrmann</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Piskorski</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zavarella</surname>
            <given-names>V.</given-names>
          </string-name>
          :
          <article-title>Enhancing Event Descriptions through Twitter Mining</article-title>
          .
          <source>In: Proceedings of ICWSM</source>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Zubiaga</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liakata</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Procter</surname>
            ,
            <given-names>R. N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bontcheva</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tolmie</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Towards detecting rumours in social media</article-title>
          .
          <source>In: AAAI Workshop on AI for Cities</source>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>