<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Exploiting News to Categorize Tweets: Quantifying The Impact of Di erent News Collections</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Marco Pavan</string-name>
          <email>marco.pavan@uniud.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Matteo Bernardon</string-name>
          <email>matteo.bernardon@gmail.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stefano Mizzaro</string-name>
          <email>mizzaro@uniud.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ivan Scagnetto</string-name>
          <email>ivan.scagnetto@uniud.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>In: M. Martinez, U. Kruschwitz, G. Kazai, D. Corney, F.</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Hopfgartner, R. Campos and D. Albakour (eds.): Proceedings, of the NewsIR'16 Workshop at ECIR</institution>
          ,
          <addr-line>Padua, Italy, 20-March2016, published at http://ceur-ws.org</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Udine</institution>
          ,
          <addr-line>Udine</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Short texts, due to their nature which makes them full of abbreviations and new coined acronyms, are not easy to classify. Text enrichment is emerging in the literature as a potentially useful tool. This paper is a part of a longer term research that aims at understanding the e ectiveness of tweet enrichment by means of news, instead of the whole web as a knowledge source. Since the choice of a news collection may contribute to produce very di erent outcomes in the enrichment process, we compare the impact of three features of such collections: volume, variety, and freshness. We show that all three features have a signi cant impact on categorization accuracy.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Social Network contents are analyzed for several
purposes: identifying trends [MK10], categorizing and
ltering news [JG13, SSTW14], measuring their
importance, spread etc. [NGKA11]. Other researchers try to
categorize short texts posted on social networks (e.g.,
tweets), using contents taken from the WWW, to
understand user interests, to build user models etc.
However, platforms like Twitter limit the text length, and
users tend to use abbreviations and acronyms to write</p>
      <p>Copyright © 2016 for the individual papers by the paper's
authors. Copying permitted for private and academic purposes.
This volume is published and copyrighted by its editors.
even faster. In a lot of cases the posted texts have
a very low number of characters1; therefore, an
automatic categorization process with topic extraction
methodologies could be not enough reliable. In these
cases, exploiting an additional source of information
could help, providing additional text to analyze. Since
short texts posted by users are often related to
recent events (sharing their opinions and thoughts with
friends), our approach is to use news collections
instead of generic web contents in the categorization
process.</p>
      <p>On this basis, we study how the choice of the news
collection a ects the results: in particular, how di
erent news collections with di erent properties impact
the categorization e ectiveness. More speci cally, we
analyze, by means of three experiments, three features
of news collections: (i) Volume, to see how di erent
numbers of news provide di erent sets of terms for the
enrichment phase and, consequently, a ect the
categorizations; (ii) Variety, to see how news of di erent
nature impact the enrichment process; and (iii)
Freshness, to highlight the di erent e ectiveness by using
news from di erent time windows (i.e., same
temporal context, 1 year old, 2 years old etc.). We exploit
the methodology proposed in [MPSV14], based on a
text enrichment with new set of words, extracted from
news on webpages of the same temporal context,2 and
a categorization by querying the Wikipedia category
tree as external knowledge base.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related work</title>
      <p>All the works in the literature addressing the problem
of classifying tweets recognize that \data sparseness"
and ambiguity represent a serious issue. For instance,
1Several surveys show that the mode of characters is
28 [twi16a].</p>
      <p>2A set of news published in the same period of the short text.
in [HH15] the authors use the \bag-of-words"
approach, adopting dimensionality reduction techniques,
to reduce accuracy and performance problems.</p>
      <p>In [AGHT11] the authors introduce several
enrichment strategies (i.e., entity-based, topic-based,
tweetbased and news-based) to relate tweets and news
articles belonging to the same temporal context, in
order to assign a semantic meaning to short
messages. In [YPF10] another enrichment-based approach
is proposed to classify generic online text documents,
by adding a semantic context and structure, using
Wikipedia as a knowledge source. In [GLJD13] the
authors de ne a framework to enrich and relate
Twitter feeds to other tweets and news speaking about the
same topics. Hashtags (for tweets) and named entities
(for news) are used to achieve such goal. A
clusterbased representation enrichment method (CREST) is
introduced in [DSL13]: such system enriches short
texts by incorporating a vector of topical relevances
(besides the commonly adopted tf-idf representation).
Finally, topics are extracted using a hierarchical
clustering algorithm with purity control. Enrichment
techniques can also be quite sophisticated like, e.g.,
in [WZX+14] where a short texts are classi ed
exploiting link analysis on topic-keyword graphs. In
particular, after the initial topic modeling phase, each topic
is associated to a set of related keywords. Afterwards,
link analysis on a subsequent topic-keyword bipartite
graph is carried out, to select the keywords most
related to the analyzed short text.</p>
      <p>Machine learning can play a fundamental role in
classifying short texts: for instance, in [DDZC13]
supervised SVM (Support Vector Machine) techniques
are used to classify tweets into 12 prede ned groups
tailored for the online community of Sri Lanka.
In [ZCH15] a completely automatized unsupervised
bayesian model is used. In particular only tweets
related to events are selected, exploiting a lexicon built
from news articles published in the same period.</p>
      <p>So far, it is clear that the problem of classifying
short texts (whatever the related semantic domain)
must rely on some forms of background knowledge, to
ll the gaps and lack of information of the original
messages. Such knowledge base can be found in external
semantic platforms like, e.g., Wikipedia (as in some of
the above mentioned works, and in the INEX Tweet
Contextualization Track [ine13]), the WWW or other,
possibly more focused, archives/structures. Hence, it
is of utmost importance to study how the choice of the
external collection in uences the accuracy of the short
text categorization process.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Features of News Collections</title>
      <p>To run a set of experiments to analyze the collections
features, we use two di erent open source document
collections, which di er in number and kind of
documents included, have di erent sizes, span from 2011 to
2013, and also have some temporal overlaps to allow
several comparisons. They are shown in Table 1 and
allow us to analyze the following three key features:
• Volume: we want to see the impact of news
samples with di erent cardinality, extracted from the
same collection in di erent percentages. With this
test we aim to measure how the amount increment
correlates to the nal enrichment e ectiveness.
• Variety: news are often di erent in nature, such
as texts from blogs, forums, online newspapers
etc., and di erent variety of texts could have
different impact on the text enrichment. We want to
measure how the news variety a ects the results.
• Freshness: short texts are often related to recent
events, therefore, it is interesting to study how
important is to have the publishing time of the
news close to the publishing time of the short text
being enriched, and how the enrichment e
ectiveness changes using increasingly older news.</p>
      <p>Figure 1 shows a representation of the two
collections distributed over time and tweets as short texts
to analyze. The Volume test, highlighted in orange,
aims to compare the categorization results with
samples of news from the same collection but with di
erent sizes; the Variety test, in green, compares results
among news samples with same cardinality but with
di erent kinds of news; and the Freshness test, in
purple, exploits news from the same collection but in
different years. The gure shows only some examples;
the details of all the experiments are described in the
next section.
4
4.1</p>
    </sec>
    <sec id="sec-4">
      <title>Experimental evaluation</title>
      <sec id="sec-4-1">
        <title>Experimental design</title>
        <p>To evaluate the impact of each news collection on
the categorization process we selected a set of 5
popular Twitter account famous in di erent elds.</p>
        <p>In particular, David Cameron (@David Cameron)
ahttp://ntcirtemporalia.github.io/NTCIR-12/collection.html
bhttp://trec-kba.org/
cData extracted from the 3rd stream corpora http://s3.amazonaws.com/aws-publicdatasets/trec/kba/index.html
for Politics, Harry Kane (@HKane) for Sport, Bill Test 2b: Tweets posted in the second half of 2012,
Gates (@BillGates) for Technology, Neil Patrick Harris categorized with Temporalia Jul-Dec 2012 (400K news
(@ActuallyNPH ) for Cinema and Rihanna (@rihanna) sample), KBA Jul-Dec 2012 (400K news sample) and
for Music. We extracted a set of tweets from each ac- Temporalia+KBA Jul-Dec 2012 (200K+ 200K news
count in a speci c time window, according to the test sample).
we planned to run, in order to have a su cient amount
of short texts to enrich and categorize. We used a 4.1.3 Freshness test
Python wrapper [pyt16] around the o cial Twitter
API [twi16b] to retrieve tweets. We repeated this
process to have a sample of 1000 tweets for each test which
involves a large temporal window (e.g., six months or
one year). Instead, for tests focused on one month,
we built samples of 250 tweets. We then de ned the
benchmarks as follows in the next sections.
4.1.1</p>
      </sec>
      <sec id="sec-4-2">
        <title>Volume test</title>
        <p>To measure the impact of collections volume we
dened 2 tests, "Test 1a" based on Temporalia and "Test
1b" on KBA. We analyzed samples using news subsets
with di erent cardinality. With these tests we can see
how changing the amount of news a ects the results,
and also if the results will generalize across di erent
collections. The 2 tests are de ned as follows:
Test 1a: Tweets posted in whole 2013, categorized
with Temporalia 1%, Temporalia 10% and Temporalia
100%.</p>
        <p>Test 1b: Tweets posted in whole 2013, categorized
with KBA 1%, KBA 10% and KBA 100%.
4.1.2</p>
      </sec>
      <sec id="sec-4-3">
        <title>Variety test</title>
        <p>We de ned "Test 2a" and "Test 2b" to measure how
the variety of news inside a collection could impact the
enrichment phase and consequently the categorization
process. We selected news samples with the same
cardinality from di erent collections and from di erent
time windows, in order to see the e ects of changing
news varieties, and also if on a wider time window of
6 months we have the same e ects we get on only 1
month. The 2 tests are de ned as follows:
Test 2a: Tweets posted in January 2013, categorized
with Temporalia Jan 2013 (60K news sample), KBA
Jan 2013 (60K news sample) and Temporalia+KBA
Jan 2013 (30K+30K news sample).</p>
        <p>To benchmark how the news freshness is important we
de ned 3 tests, "Test 3a", "Test 3b", based on di
erent news "aging", and "Test 3c", based on a di erent
collection. For the rst test we want to see the di
erence between enriching the tweets with news extracted
from the same temporal context (i.e., at most 1 month
before the publishing date) and news in the same year
of publishing (i.e., at most 1 year before the publishing
date). In the second test we want to extend this
analysis to more than 1 year before the publishing date,
in particular we benchmark the results using news
related to event of the same year of the tweets, 1 year old
and 2 years old. The third test aims to compare the
same "aging e ect" with a di erent collection. The 3
tests are de ned as follows:
Test 3a: Tweets posted in whole 2013, categorized
with Temporalia 2013 - contextualized 3 and
Temporalia Jan 2013 (both samples are composed of 60K
news).</p>
        <p>Test 3b: Tweets posted in whole 2013, categorized
with Temporalia 2013, Temporalia 2012 and
Temporalia 2011 (both samples are composed of 90K news).
Test 3c: Tweets posted in whole 2012, categorized
with KBA 2012 - contextualized, KBA Jan 2012 and
KBA 2012 (both samples are composed of 100K news).
4.2</p>
      </sec>
      <sec id="sec-4-4">
        <title>Measures</title>
        <p>To evaluate the experiments and to benchmark the
collections e ectiveness we carried out an expert
evaluation to assess each analyzed feature over short texts
samples composed of either all tweets for one month
based tests (250) or a set of 250 randomly extracted
tweets for tests based on larger temporal windows.</p>
        <p>We used a categorization prototype system
[MPSV14] for the categorization of short texts which
3Only news from the same month when the tweet has been
posted.
provides, as nal outcome, a list of labels extracted
from Wikipedia category tree. The system includes
a module which analyzes text, searches related
documents into a news collection, and extracts a set of
words used to enrich the original short text.</p>
        <p>The texts have been submitted to the categorization
system with di erent news collections according to the
three tests described in Section 4.1. For each test, in
order to assess the news impact over the enrichment
process, the set of categories yielded by the system has
been evaluated by expert users. The latter assigned
a rating, i.e., a number between 1 and 5 (1=lowest
value, 5=highest value) indicating how the categories
properly represent the topic discussed in the tweet.</p>
        <p>In particular for the Volume test, we run the
evaluation several times, with news samples randomly rebuilt
each time, where we used only a portion of the entire
collection. We kept the average ratings obtained with
di erent sub-collections, avoiding bias due to the
random set of news. Speci cally for samples with 10%
or 1% of news we run respectively the evaluation 3
or 5 times, approximating the average ratings to the
nearest integer value.
4.3</p>
      </sec>
      <sec id="sec-4-5">
        <title>Results</title>
        <p>Results are reported in the following charts, which
show distribution functions of ratings obtained by each
test with the di erent experiment settings. In
particular, we display the cumulative distribution function
(CDF), the inverted complementary cumulative
distribution function (I-CCDF), and a table reporting the
mean ratings. The I-CCDF is provided for an easier
reading, showing the data in ascending order and thus
highlighting the news collection performing better as
the line at the top of the chart.
4.3.1</p>
      </sec>
      <sec id="sec-4-6">
        <title>Volume Test</title>
        <p>Figure 2 shows the results related to Test 1a and 1b,
highlighting how for both collections the number of
news is an important feature to consider. We can
observe a noticeable improvement with Temporalia 100%
compared to smaller samples. Increasing the volume
allows us to include a large number of both relevant
and not relevant news: the rst ones yield a global
improvement, while the second ones have a low overall
impact. The general improvement is also con rmed by
the Wilcoxon test. Then, we notice only a slight di
erence between Temporalia 1% and 10%, where the news
increase in number from an order of magnitude 10K
to 100K. The Wilcoxon test, over the latter couple
of rating distributions, con rmed a non statistically
signi cant di erence between those samples, with a
p-value&gt;0.05. On the other hand, with KBA we
already have a noticeable di erence between KBA 1%
and KBA 10%, due to order of magnitude from 100K
to 1M, and even better using KBA 100% (10M). This
fact emphasizes how increasing the sample sizes has
considerable e ects on the results only when a
certain amount of news is reached. The diverse impact
of Temporalia and KBA is probably also due to other
factors than the only di erence in size. Of course the
same percentage, applied to collections with very
different sizes, yields sets of extracted documents whose
cardinality is very di erent; whence we can also expect
a di erent variety of such sets. Moreover, for instance,
KBA does not fully cover year 2013, whence the
effectiveness could be a ected by the publishing date of
the analyzed short texts. Such aspects are taken into
consideration in the remaining experiments.
4.3.2</p>
      </sec>
      <sec id="sec-4-7">
        <title>Variety Test</title>
        <p>Figure 3 shows how the variety of news inside the
analyzed samples a ects the enrichment e ectiveness.
Continuous lines represent the results over 1 month of
news (Test 2a), and dotted lines over 6 months (Test
2b). For both experiments there is a noticeable
difference among the samples which highlights how
increasing the variety of news allows to improve the
nal categorization also on di erent time windows. The
Wilcoxon test over the sample pairs of each test
conrms the statistically signi cant di erence between all
the rating distributions. This fact highlights how
important is to increase the variety of news in order to
improve the set of words to use as text enrichment.
4.3.3</p>
      </sec>
      <sec id="sec-4-8">
        <title>Freshness Test</title>
        <p>The chart in Figure 4 shows the results related to Test
3a, 3b and 3c, and it is possible to notice how the news
freshness a ected the results especially when the news
get older. Collections with contextualized news got
the best e ectiveness due to the news publishing time
close to the tweets (same month), therefore they allow
to have more relevant additional text to exploit. The
system has worsened the categorization process with
tweets randomly selected from whole 2013, and using
collections of news extracted from the same year,
either equally distributed over all months or only in
January. The e ectiveness decreases drastically when the
news get older in previous years. In particular we can
notice how we got the same lowest e ectiveness with
Temporalia 2012 and Temporalia 2011, highlighting
how 1 (or more) year old news are poor of information
for these purposes.</p>
        <p>Test 3a results, related to Temporalia 2013, show
how large is the di erence between news distant only
some months in time, and Test 3b results, where we
analyzed three years of Temporalia news, highlight how
going back to 1 year is crucial for the categorization
The experiments performed in this work have
demonstrated that text enrichment is sensibly a ected by
the features of the news collections that we have
analyzed. More precisely, there is a critical threshold for
what concerns the collection Volume, that allows to
have a su cient amount of news to reach a good level
of e ectiveness. Moreover, such threshold seems to be
dependent on the whole size of the collection taken
into consideration. Our benchmarks con rm the
importance of news variety, highlighting how increasing
the number of available kinds yields a better
enrichment both for texts selected in one month and in the
For future work, we plan to re ne and complete
the experiments on the three focused features. For
instance, it could be interesting to look at the
impact of the number of documents extracted from the
news collection and used to categorize short texts. As
we pointed out in Section 4.3, a larger database will
produce a higher number of elements (with the same
percentage), and this fact can have subtle
implications on the nal outcomes. We also plan to carry
on further experiments about the variety,
investigating which kinds of news it is important to include in
the collection, and which ones are marginal. As the
freshness is concerned, we could investigate more
precisely, varying the granularity of the time windows,
which is the temporal threshold causing a quick
decrease of the e ectiveness of the enrichment process.
Moreover, we plan to carry on further experiments on</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [AGHT11]
          <string-name>
            <given-names>Fabian</given-names>
            <surname>Abel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Qi</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <surname>Geert-Jan Houben</surname>
            , and
            <given-names>Ke</given-names>
          </string-name>
          <string-name>
            <surname>Tao</surname>
          </string-name>
          .
          <article-title>Semantic enrichment of twitter posts for user pro le construction on the social web</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <source>In The Semanic Web: Research and Applications</source>
          , pages
          <volume>375</volume>
          {
          <fpage>389</fpage>
          . Springer,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [DDZC13]
          <string-name>
            <given-names>Inoshika</given-names>
            <surname>Dilrukshi</surname>
          </string-name>
          , Kasun De Zoysa, and
          <string-name>
            <given-names>Amitha</given-names>
            <surname>Caldera</surname>
          </string-name>
          .
          <article-title>Twitter news classi cation using SVM</article-title>
          .
          <source>In Proc. of ICCSE'13</source>
          , pages
          <fpage>287</fpage>
          {
          <fpage>291</fpage>
          . IEEE,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [DSL13]
          <string-name>
            <given-names>Zichao</given-names>
            <surname>Dai</surname>
          </string-name>
          , Aixin Sun, and
          <string-name>
            <surname>Xu-Ying Liu</surname>
          </string-name>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <article-title>Crest: Cluster-based representation enrichment for short text classi cation</article-title>
          .
          <source>In Advances in Knowledge Discovery and Data Mining</source>
          , pages
          <volume>256</volume>
          {
          <fpage>267</fpage>
          . Springer,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [GLJD13]
          <string-name>
            <given-names>Weiwei</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Hao</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Heng</given-names>
            <surname>Ji</surname>
          </string-name>
          , and
          <string-name>
            <surname>Mona T Diab</surname>
          </string-name>
          .
          <article-title>Linking tweets to news: A framework to enrich short text data in social media</article-title>
          .
          <source>In ACL (1)</source>
          , pages
          <fpage>239</fpage>
          {
          <fpage>249</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [HH15]
          <string-name>
            <surname>Yin-Fu Huang</surname>
          </string-name>
          and
          <string-name>
            <surname>Chen-Ting Huang</surname>
          </string-name>
          .
          <article-title>Mining domain information from social contents based on news categories</article-title>
          .
          <source>In Proc. of IDEAS'15</source>
          , pages
          <fpage>186</fpage>
          {
          <fpage>191</fpage>
          . ACM,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [ine13]
          <article-title>INEX 2013 Tweet Contextualization Track</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          http://inex.mmci.uni-saarland.de/ tracks/qa/,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [JG13]
          <article-title>Nirmal Jonnalagedda and Susan Gauch. Personalized News Recommendation Using Twitter</article-title>
          .
          <source>In Proc. of WI-IAT'13</source>
          , pages
          <fpage>21</fpage>
          {
          <fpage>25</fpage>
          . IEEE Computer Society,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [MK10]
          <string-name>
            <given-names>Michael</given-names>
            <surname>Mathioudakis</surname>
          </string-name>
          and
          <string-name>
            <given-names>Nick</given-names>
            <surname>Koudas</surname>
          </string-name>
          .
          <article-title>Twittermonitor: trend detection over the Twitter</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>