-

Exploiting News to Categorize Tweets: Quantifying The Impact of Di erent News Collections

Marco Pavan

marco.pavan@uniud.it 1

Matteo Bernardon

matteo.bernardon@gmail.com 1

Stefano Mizzaro

mizzaro@uniud.it 1

Ivan Scagnetto

ivan.scagnetto@uniud.it 1

In: M. Martinez, U. Kruschwitz, G. Kazai, D. Corney, F.

0 0 Hopfgartner, R. Campos and D. Albakour (eds.): Proceedings, of the NewsIR'16 Workshop at ECIR , Padua, Italy, 20-March2016, published at http://ceur-ws.org 1 University of Udine , Udine , Italy

Short texts, due to their nature which makes them full of abbreviations and new coined acronyms, are not easy to classify. Text enrichment is emerging in the literature as a potentially useful tool. This paper is a part of a longer term research that aims at understanding the e ectiveness of tweet enrichment by means of news, instead of the whole web as a knowledge source. Since the choice of a news collection may contribute to produce very di erent outcomes in the enrichment process, we compare the impact of three features of such collections: volume, variety, and freshness. We show that all three features have a signi cant impact on categorization accuracy.

Social Network contents are analyzed for several purposes: identifying trends [MK10], categorizing and ltering news [JG13, SSTW14], measuring their importance, spread etc. [NGKA11]. Other researchers try to categorize short texts posted on social networks (e.g., tweets), using contents taken from the WWW, to understand user interests, to build user models etc. However, platforms like Twitter limit the text length, and users tend to use abbreviations and acronyms to write

Copyright © 2016 for the individual papers by the paper's authors. Copying permitted for private and academic purposes. This volume is published and copyrighted by its editors. even faster. In a lot of cases the posted texts have a very low number of characters1; therefore, an automatic categorization process with topic extraction methodologies could be not enough reliable. In these cases, exploiting an additional source of information could help, providing additional text to analyze. Since short texts posted by users are often related to recent events (sharing their opinions and thoughts with friends), our approach is to use news collections instead of generic web contents in the categorization process.

On this basis, we study how the choice of the news collection a ects the results: in particular, how di erent news collections with di erent properties impact the categorization e ectiveness. More speci cally, we analyze, by means of three experiments, three features of news collections: (i) Volume, to see how di erent numbers of news provide di erent sets of terms for the enrichment phase and, consequently, a ect the categorizations; (ii) Variety, to see how news of di erent nature impact the enrichment process; and (iii) Freshness, to highlight the di erent e ectiveness by using news from di erent time windows (i.e., same temporal context, 1 year old, 2 years old etc.). We exploit the methodology proposed in [MPSV14], based on a text enrichment with new set of words, extracted from news on webpages of the same temporal context,2 and a categorization by querying the Wikipedia category tree as external knowledge base. 2

Related work

All the works in the literature addressing the problem of classifying tweets recognize that \data sparseness" and ambiguity represent a serious issue. For instance, 1Several surveys show that the mode of characters is 28 [twi16a].

2A set of news published in the same period of the short text. in [HH15] the authors use the \bag-of-words" approach, adopting dimensionality reduction techniques, to reduce accuracy and performance problems.

In [AGHT11] the authors introduce several enrichment strategies (i.e., entity-based, topic-based, tweetbased and news-based) to relate tweets and news articles belonging to the same temporal context, in order to assign a semantic meaning to short messages. In [YPF10] another enrichment-based approach is proposed to classify generic online text documents, by adding a semantic context and structure, using Wikipedia as a knowledge source. In [GLJD13] the authors de ne a framework to enrich and relate Twitter feeds to other tweets and news speaking about the same topics. Hashtags (for tweets) and named entities (for news) are used to achieve such goal. A clusterbased representation enrichment method (CREST) is introduced in [DSL13]: such system enriches short texts by incorporating a vector of topical relevances (besides the commonly adopted tf-idf representation). Finally, topics are extracted using a hierarchical clustering algorithm with purity control. Enrichment techniques can also be quite sophisticated like, e.g., in [WZX+14] where a short texts are classi ed exploiting link analysis on topic-keyword graphs. In particular, after the initial topic modeling phase, each topic is associated to a set of related keywords. Afterwards, link analysis on a subsequent topic-keyword bipartite graph is carried out, to select the keywords most related to the analyzed short text.

Machine learning can play a fundamental role in classifying short texts: for instance, in [DDZC13] supervised SVM (Support Vector Machine) techniques are used to classify tweets into 12 prede ned groups tailored for the online community of Sri Lanka. In [ZCH15] a completely automatized unsupervised bayesian model is used. In particular only tweets related to events are selected, exploiting a lexicon built from news articles published in the same period.

So far, it is clear that the problem of classifying short texts (whatever the related semantic domain) must rely on some forms of background knowledge, to ll the gaps and lack of information of the original messages. Such knowledge base can be found in external semantic platforms like, e.g., Wikipedia (as in some of the above mentioned works, and in the INEX Tweet Contextualization Track [ine13]), the WWW or other, possibly more focused, archives/structures. Hence, it is of utmost importance to study how the choice of the external collection in uences the accuracy of the short text categorization process. 3

Features of News Collections

To run a set of experiments to analyze the collections features, we use two di erent open source document collections, which di er in number and kind of documents included, have di erent sizes, span from 2011 to 2013, and also have some temporal overlaps to allow several comparisons. They are shown in Table 1 and allow us to analyze the following three key features: • Volume: we want to see the impact of news samples with di erent cardinality, extracted from the same collection in di erent percentages. With this test we aim to measure how the amount increment correlates to the nal enrichment e ectiveness. • Variety: news are often di erent in nature, such as texts from blogs, forums, online newspapers etc., and di erent variety of texts could have different impact on the text enrichment. We want to measure how the news variety a ects the results. • Freshness: short texts are often related to recent events, therefore, it is interesting to study how important is to have the publishing time of the news close to the publishing time of the short text being enriched, and how the enrichment e ectiveness changes using increasingly older news.

Figure 1 shows a representation of the two collections distributed over time and tweets as short texts to analyze. The Volume test, highlighted in orange, aims to compare the categorization results with samples of news from the same collection but with di erent sizes; the Variety test, in green, compares results among news samples with same cardinality but with di erent kinds of news; and the Freshness test, in purple, exploits news from the same collection but in different years. The gure shows only some examples; the details of all the experiments are described in the next section. 4 4.1

Experimental evaluation Experimental design

To evaluate the impact of each news collection on the categorization process we selected a set of 5 popular Twitter account famous in di erent elds.

In particular, David Cameron (@David Cameron) ahttp://ntcirtemporalia.github.io/NTCIR-12/collection.html bhttp://trec-kba.org/ cData extracted from the 3rd stream corpora http://s3.amazonaws.com/aws-publicdatasets/trec/kba/index.html for Politics, Harry Kane (@HKane) for Sport, Bill Test 2b: Tweets posted in the second half of 2012, Gates (@BillGates) for Technology, Neil Patrick Harris categorized with Temporalia Jul-Dec 2012 (400K news (@ActuallyNPH ) for Cinema and Rihanna (@rihanna) sample), KBA Jul-Dec 2012 (400K news sample) and for Music. We extracted a set of tweets from each ac- Temporalia+KBA Jul-Dec 2012 (200K+ 200K news count in a speci c time window, according to the test sample). we planned to run, in order to have a su cient amount of short texts to enrich and categorize. We used a 4.1.3 Freshness test Python wrapper [pyt16] around the o cial Twitter API [twi16b] to retrieve tweets. We repeated this process to have a sample of 1000 tweets for each test which involves a large temporal window (e.g., six months or one year). Instead, for tests focused on one month, we built samples of 250 tweets. We then de ned the benchmarks as follows in the next sections. 4.1.1

Volume test

To measure the impact of collections volume we dened 2 tests, "Test 1a" based on Temporalia and "Test 1b" on KBA. We analyzed samples using news subsets with di erent cardinality. With these tests we can see how changing the amount of news a ects the results, and also if the results will generalize across di erent collections. The 2 tests are de ned as follows: Test 1a: Tweets posted in whole 2013, categorized with Temporalia 1%, Temporalia 10% and Temporalia 100%.

Test 1b: Tweets posted in whole 2013, categorized with KBA 1%, KBA 10% and KBA 100%. 4.1.2

Variety test

We de ned "Test 2a" and "Test 2b" to measure how the variety of news inside a collection could impact the enrichment phase and consequently the categorization process. We selected news samples with the same cardinality from di erent collections and from di erent time windows, in order to see the e ects of changing news varieties, and also if on a wider time window of 6 months we have the same e ects we get on only 1 month. The 2 tests are de ned as follows: Test 2a: Tweets posted in January 2013, categorized with Temporalia Jan 2013 (60K news sample), KBA Jan 2013 (60K news sample) and Temporalia+KBA Jan 2013 (30K+30K news sample).

To benchmark how the news freshness is important we de ned 3 tests, "Test 3a", "Test 3b", based on di erent news "aging", and "Test 3c", based on a di erent collection. For the rst test we want to see the di erence between enriching the tweets with news extracted from the same temporal context (i.e., at most 1 month before the publishing date) and news in the same year of publishing (i.e., at most 1 year before the publishing date). In the second test we want to extend this analysis to more than 1 year before the publishing date, in particular we benchmark the results using news related to event of the same year of the tweets, 1 year old and 2 years old. The third test aims to compare the same "aging e ect" with a di erent collection. The 3 tests are de ned as follows: Test 3a: Tweets posted in whole 2013, categorized with Temporalia 2013 - contextualized 3 and Temporalia Jan 2013 (both samples are composed of 60K news).

Test 3b: Tweets posted in whole 2013, categorized with Temporalia 2013, Temporalia 2012 and Temporalia 2011 (both samples are composed of 90K news). Test 3c: Tweets posted in whole 2012, categorized with KBA 2012 - contextualized, KBA Jan 2012 and KBA 2012 (both samples are composed of 100K news). 4.2

Measures

To evaluate the experiments and to benchmark the collections e ectiveness we carried out an expert evaluation to assess each analyzed feature over short texts samples composed of either all tweets for one month based tests (250) or a set of 250 randomly extracted tweets for tests based on larger temporal windows.

We used a categorization prototype system [MPSV14] for the categorization of short texts which 3Only news from the same month when the tweet has been posted. provides, as nal outcome, a list of labels extracted from Wikipedia category tree. The system includes a module which analyzes text, searches related documents into a news collection, and extracts a set of words used to enrich the original short text.

The texts have been submitted to the categorization system with di erent news collections according to the three tests described in Section 4.1. For each test, in order to assess the news impact over the enrichment process, the set of categories yielded by the system has been evaluated by expert users. The latter assigned a rating, i.e., a number between 1 and 5 (1=lowest value, 5=highest value) indicating how the categories properly represent the topic discussed in the tweet.

In particular for the Volume test, we run the evaluation several times, with news samples randomly rebuilt each time, where we used only a portion of the entire collection. We kept the average ratings obtained with di erent sub-collections, avoiding bias due to the random set of news. Speci cally for samples with 10% or 1% of news we run respectively the evaluation 3 or 5 times, approximating the average ratings to the nearest integer value. 4.3

Results

Results are reported in the following charts, which show distribution functions of ratings obtained by each test with the di erent experiment settings. In particular, we display the cumulative distribution function (CDF), the inverted complementary cumulative distribution function (I-CCDF), and a table reporting the mean ratings. The I-CCDF is provided for an easier reading, showing the data in ascending order and thus highlighting the news collection performing better as the line at the top of the chart. 4.3.1

Volume Test

Figure 2 shows the results related to Test 1a and 1b, highlighting how for both collections the number of news is an important feature to consider. We can observe a noticeable improvement with Temporalia 100% compared to smaller samples. Increasing the volume allows us to include a large number of both relevant and not relevant news: the rst ones yield a global improvement, while the second ones have a low overall impact. The general improvement is also con rmed by the Wilcoxon test. Then, we notice only a slight di erence between Temporalia 1% and 10%, where the news increase in number from an order of magnitude 10K to 100K. The Wilcoxon test, over the latter couple of rating distributions, con rmed a non statistically signi cant di erence between those samples, with a p-value>0.05. On the other hand, with KBA we already have a noticeable di erence between KBA 1% and KBA 10%, due to order of magnitude from 100K to 1M, and even better using KBA 100% (10M). This fact emphasizes how increasing the sample sizes has considerable e ects on the results only when a certain amount of news is reached. The diverse impact of Temporalia and KBA is probably also due to other factors than the only di erence in size. Of course the same percentage, applied to collections with very different sizes, yields sets of extracted documents whose cardinality is very di erent; whence we can also expect a di erent variety of such sets. Moreover, for instance, KBA does not fully cover year 2013, whence the effectiveness could be a ected by the publishing date of the analyzed short texts. Such aspects are taken into consideration in the remaining experiments. 4.3.2

Variety Test

Figure 3 shows how the variety of news inside the analyzed samples a ects the enrichment e ectiveness. Continuous lines represent the results over 1 month of news (Test 2a), and dotted lines over 6 months (Test 2b). For both experiments there is a noticeable difference among the samples which highlights how increasing the variety of news allows to improve the nal categorization also on di erent time windows. The Wilcoxon test over the sample pairs of each test conrms the statistically signi cant di erence between all the rating distributions. This fact highlights how important is to increase the variety of news in order to improve the set of words to use as text enrichment. 4.3.3

Freshness Test

The chart in Figure 4 shows the results related to Test 3a, 3b and 3c, and it is possible to notice how the news freshness a ected the results especially when the news get older. Collections with contextualized news got the best e ectiveness due to the news publishing time close to the tweets (same month), therefore they allow to have more relevant additional text to exploit. The system has worsened the categorization process with tweets randomly selected from whole 2013, and using collections of news extracted from the same year, either equally distributed over all months or only in January. The e ectiveness decreases drastically when the news get older in previous years. In particular we can notice how we got the same lowest e ectiveness with Temporalia 2012 and Temporalia 2011, highlighting how 1 (or more) year old news are poor of information for these purposes.

Test 3a results, related to Temporalia 2013, show how large is the di erence between news distant only some months in time, and Test 3b results, where we analyzed three years of Temporalia news, highlight how going back to 1 year is crucial for the categorization The experiments performed in this work have demonstrated that text enrichment is sensibly a ected by the features of the news collections that we have analyzed. More precisely, there is a critical threshold for what concerns the collection Volume, that allows to have a su cient amount of news to reach a good level of e ectiveness. Moreover, such threshold seems to be dependent on the whole size of the collection taken into consideration. Our benchmarks con rm the importance of news variety, highlighting how increasing the number of available kinds yields a better enrichment both for texts selected in one month and in the For future work, we plan to re ne and complete the experiments on the three focused features. For instance, it could be interesting to look at the impact of the number of documents extracted from the news collection and used to categorize short texts. As we pointed out in Section 4.3, a larger database will produce a higher number of elements (with the same percentage), and this fact can have subtle implications on the nal outcomes. We also plan to carry on further experiments about the variety, investigating which kinds of news it is important to include in the collection, and which ones are marginal. As the freshness is concerned, we could investigate more precisely, varying the granularity of the time windows, which is the temporal threshold causing a quick decrease of the e ectiveness of the enrichment process. Moreover, we plan to carry on further experiments on

[AGHT11]

Fabian

Abel ,

Gao , Geert-Jan Houben , and Ke Tao . Semantic enrichment of twitter posts for user pro le construction on the social web .

In The Semanic Web: Research and Applications , pages 375 { 389 . Springer, 2011 .

[DDZC13]

Inoshika

Dilrukshi , Kasun De Zoysa, and

Amitha

Caldera . Twitter news classi cation using SVM . In Proc. of ICCSE'13 , pages 287 { 291 . IEEE, 2013 .

[DSL13]

Zichao

Dai , Aixin Sun, and Xu-Ying Liu .

Crest: Cluster-based representation enrichment for short text classi cation . In Advances in Knowledge Discovery and Data Mining , pages 256 { 267 . Springer, 2013 .

[GLJD13]

Weiwei

Guo ,

Hao

Li ,

Heng

Ji , and Mona T Diab . Linking tweets to news: A framework to enrich short text data in social media . In ACL (1) , pages 239 { 249 , 2013 .

[HH15] Yin-Fu Huang and Chen-Ting Huang . Mining domain information from social contents based on news categories . In Proc. of IDEAS'15 , pages 186 { 191 . ACM, 2015 .

[ine13] INEX 2013 Tweet Contextualization Track .

http://inex.mmci.uni-saarland.de/ tracks/qa/, 2013 .

[JG13] Nirmal Jonnalagedda and Susan Gauch. Personalized News Recommendation Using Twitter . In Proc. of WI-IAT'13 , pages 21 { 25 . IEEE Computer Society, 2013 .

[MK10]

Michael

Mathioudakis and

Nick

Koudas . Twittermonitor: trend detection over the Twitter