Exploiting News to Categorize Tweets: Quantifying The Impact of Different News Collections Marco Pavan Stefano Mizzaro University of Udine, Udine, Italy University of Udine, Udine, Italy marco.pavan@uniud.it mizzaro@uniud.it Matteo Bernardon Ivan Scagnetto University of Udine, Udine, Italy University of Udine, Udine, Italy matteo.bernardon@gmail.com ivan.scagnetto@uniud.it even faster. In a lot of cases the posted texts have a very low number of characters1 ; therefore, an au- Abstract tomatic categorization process with topic extraction methodologies could be not enough reliable. In these Short texts, due to their nature which makes cases, exploiting an additional source of information them full of abbreviations and new coined could help, providing additional text to analyze. Since acronyms, are not easy to classify. Text en- short texts posted by users are often related to re- richment is emerging in the literature as a cent events (sharing their opinions and thoughts with potentially useful tool. This paper is a part friends), our approach is to use news collections in- of a longer term research that aims at under- stead of generic web contents in the categorization pro- standing the effectiveness of tweet enrichment cess. by means of news, instead of the whole web On this basis, we study how the choice of the news as a knowledge source. Since the choice of collection affects the results: in particular, how differ- a news collection may contribute to produce ent news collections with different properties impact very different outcomes in the enrichment pro- the categorization effectiveness. More specifically, we cess, we compare the impact of three features analyze, by means of three experiments, three features of such collections: volume, variety, and fresh- of news collections: (i) Volume, to see how different ness. We show that all three features have a numbers of news provide different sets of terms for the significant impact on categorization accuracy. enrichment phase and, consequently, affect the cate- gorizations; (ii) Variety, to see how news of different 1 Introduction nature impact the enrichment process; and (iii) Fresh- ness, to highlight the different effectiveness by using Social Network contents are analyzed for several pur- news from different time windows (i.e., same tempo- poses: identifying trends [MK10], categorizing and fil- ral context, 1 year old, 2 years old etc.). We exploit tering news [JG13, SSTW14], measuring their impor- the methodology proposed in [MPSV14], based on a tance, spread etc. [NGKA11]. Other researchers try to text enrichment with new set of words, extracted from categorize short texts posted on social networks (e.g., news on webpages of the same temporal context,2 and tweets), using contents taken from the WWW, to un- a categorization by querying the Wikipedia category derstand user interests, to build user models etc. How- tree as external knowledge base. ever, platforms like Twitter limit the text length, and users tend to use abbreviations and acronyms to write 2 Related work Copyright © 2016 for the individual papers by the paper’s All the works in the literature addressing the problem authors. Copying permitted for private and academic purposes. of classifying tweets recognize that “data sparseness” This volume is published and copyrighted by its editors. and ambiguity represent a serious issue. For instance, In: M. Martinez, U. Kruschwitz, G. Kazai, D. Corney, F. Hopfgartner, R. Campos and D. Albakour (eds.): Proceedings 1 Several surveys show that the mode of characters is of the NewsIR’16 Workshop at ECIR, Padua, Italy, 20-March- 28 [twi16a]. 2016, published at http://ceur-ws.org 2 A set of news published in the same period of the short text. in [HH15] the authors use the “bag-of-words” ap- proach, adopting dimensionality reduction techniques, to reduce accuracy and performance problems. In [AGHT11] the authors introduce several enrich- ment strategies (i.e., entity-based, topic-based, tweet- based and news-based) to relate tweets and news ar- ticles belonging to the same temporal context, in order to assign a semantic meaning to short mes- sages. In [YPF10] another enrichment-based approach is proposed to classify generic online text documents, by adding a semantic context and structure, using Figure 1: News collections distribution with features Wikipedia as a knowledge source. In [GLJD13] the based tests authors define a framework to enrich and relate Twit- ter feeds to other tweets and news speaking about the collections, which differ in number and kind of docu- same topics. Hashtags (for tweets) and named entities ments included, have different sizes, span from 2011 to (for news) are used to achieve such goal. A cluster- 2013, and also have some temporal overlaps to allow based representation enrichment method (CREST) is several comparisons. They are shown in Table 1 and introduced in [DSL13]: such system enriches short allow us to analyze the following three key features: texts by incorporating a vector of topical relevances • Volume: we want to see the impact of news sam- (besides the commonly adopted tf-idf representation). ples with different cardinality, extracted from the Finally, topics are extracted using a hierarchical clus- same collection in different percentages. With this tering algorithm with purity control. Enrichment test we aim to measure how the amount increment techniques can also be quite sophisticated like, e.g., correlates to the final enrichment effectiveness. in [WZX+ 14] where a short texts are classified exploit- • Variety: news are often different in nature, such ing link analysis on topic-keyword graphs. In particu- as texts from blogs, forums, online newspapers lar, after the initial topic modeling phase, each topic etc., and different variety of texts could have dif- is associated to a set of related keywords. Afterwards, ferent impact on the text enrichment. We want to link analysis on a subsequent topic-keyword bipartite measure how the news variety affects the results. graph is carried out, to select the keywords most re- lated to the analyzed short text. • Freshness: short texts are often related to recent Machine learning can play a fundamental role in events, therefore, it is interesting to study how classifying short texts: for instance, in [DDZC13] su- important is to have the publishing time of the pervised SVM (Support Vector Machine) techniques news close to the publishing time of the short text are used to classify tweets into 12 predefined groups being enriched, and how the enrichment effective- tailored for the online community of Sri Lanka. ness changes using increasingly older news. In [ZCH15] a completely automatized unsupervised Figure 1 shows a representation of the two collec- bayesian model is used. In particular only tweets re- tions distributed over time and tweets as short texts lated to events are selected, exploiting a lexicon built to analyze. The Volume test, highlighted in orange, from news articles published in the same period. aims to compare the categorization results with sam- So far, it is clear that the problem of classifying ples of news from the same collection but with differ- short texts (whatever the related semantic domain) ent sizes; the Variety test, in green, compares results must rely on some forms of background knowledge, to among news samples with same cardinality but with fill the gaps and lack of information of the original mes- different kinds of news; and the Freshness test, in pur- sages. Such knowledge base can be found in external ple, exploits news from the same collection but in dif- semantic platforms like, e.g., Wikipedia (as in some of ferent years. The figure shows only some examples; the above mentioned works, and in the INEX Tweet the details of all the experiments are described in the Contextualization Track [ine13]), the WWW or other, next section. possibly more focused, archives/structures. Hence, it is of utmost importance to study how the choice of the external collection influences the accuracy of the short 4 Experimental evaluation text categorization process. 4.1 Experimental design To evaluate the impact of each news collection on 3 Features of News Collections the categorization process we selected a set of 5 To run a set of experiments to analyze the collections popular Twitter account famous in different fields. features, we use two different open source document In particular, David Cameron (@David Cameron) Table 1: The two news collections used in the experiments Acronym Name # of docs/ size kind of docs Timespan Temporalia NTCIR Temporal Information Access 2012 ~2M / ~20GB blogs news a Jan2011 – Dec2013 KBA Knowledge Base Acceleration 2012b ~20M / ~930GBc blogs, news, forums, social Oct2011 – May2013 a http://ntcirtemporalia.github.io/NTCIR-12/collection.html b http://trec-kba.org/ c Data extracted from the 3rd stream corpora http://s3.amazonaws.com/aws-publicdatasets/trec/kba/index.html for Politics, Harry Kane (@HKane) for Sport, Bill Test 2b: Tweets posted in the second half of 2012, Gates (@BillGates) for Technology, Neil Patrick Harris categorized with Temporalia Jul-Dec 2012 (400K news (@ActuallyNPH ) for Cinema and Rihanna (@rihanna) sample), KBA Jul-Dec 2012 (400K news sample) and for Music. We extracted a set of tweets from each ac- Temporalia+KBA Jul-Dec 2012 (200K+ 200K news count in a specific time window, according to the test sample). we planned to run, in order to have a sufficient amount of short texts to enrich and categorize. We used a 4.1.3 Freshness test Python wrapper [pyt16] around the official Twitter To benchmark how the news freshness is important we API [twi16b] to retrieve tweets. We repeated this pro- defined 3 tests, ”Test 3a”, ”Test 3b”, based on differ- cess to have a sample of 1000 tweets for each test which ent news ”aging”, and ”Test 3c”, based on a different involves a large temporal window (e.g., six months or collection. For the first test we want to see the differ- one year). Instead, for tests focused on one month, ence between enriching the tweets with news extracted we built samples of 250 tweets. We then defined the from the same temporal context (i.e., at most 1 month benchmarks as follows in the next sections. before the publishing date) and news in the same year of publishing (i.e., at most 1 year before the publishing 4.1.1 Volume test date). In the second test we want to extend this anal- To measure the impact of collections volume we de- ysis to more than 1 year before the publishing date, fined 2 tests, ”Test 1a” based on Temporalia and ”Test in particular we benchmark the results using news re- 1b” on KBA. We analyzed samples using news subsets lated to event of the same year of the tweets, 1 year old with different cardinality. With these tests we can see and 2 years old. The third test aims to compare the how changing the amount of news affects the results, same ”aging effect” with a different collection. The 3 and also if the results will generalize across different tests are defined as follows: collections. The 2 tests are defined as follows: Test 3a: Tweets posted in whole 2013, categorized Test 1a: Tweets posted in whole 2013, categorized with Temporalia 2013 - contextualized 3 and Tempo- with Temporalia 1%, Temporalia 10% and Temporalia ralia Jan 2013 (both samples are composed of 60K 100%. news). Test 3b: Tweets posted in whole 2013, categorized Test 1b: Tweets posted in whole 2013, categorized with Temporalia 2013, Temporalia 2012 and Tempo- with KBA 1%, KBA 10% and KBA 100%. ralia 2011 (both samples are composed of 90K news). 4.1.2 Variety test Test 3c: Tweets posted in whole 2012, categorized with KBA 2012 - contextualized, KBA Jan 2012 and We defined ”Test 2a” and ”Test 2b” to measure how KBA 2012 (both samples are composed of 100K news). the variety of news inside a collection could impact the enrichment phase and consequently the categorization 4.2 Measures process. We selected news samples with the same car- dinality from different collections and from different To evaluate the experiments and to benchmark the time windows, in order to see the effects of changing collections effectiveness we carried out an expert eval- news varieties, and also if on a wider time window of uation to assess each analyzed feature over short texts 6 months we have the same effects we get on only 1 samples composed of either all tweets for one month month. The 2 tests are defined as follows: based tests (250) or a set of 250 randomly extracted tweets for tests based on larger temporal windows. Test 2a: Tweets posted in January 2013, categorized We used a categorization prototype system with Temporalia Jan 2013 (60K news sample), KBA [MPSV14] for the categorization of short texts which Jan 2013 (60K news sample) and Temporalia+KBA 3 Only news from the same month when the tweet has been Jan 2013 (30K+30K news sample). posted. provides, as final outcome, a list of labels extracted and KBA 10%, due to order of magnitude from 100K from Wikipedia category tree. The system includes to 1M, and even better using KBA 100% (10M). This a module which analyzes text, searches related doc- fact emphasizes how increasing the sample sizes has uments into a news collection, and extracts a set of considerable effects on the results only when a cer- words used to enrich the original short text. tain amount of news is reached. The diverse impact The texts have been submitted to the categorization of Temporalia and KBA is probably also due to other system with different news collections according to the factors than the only difference in size. Of course the three tests described in Section 4.1. For each test, in same percentage, applied to collections with very dif- order to assess the news impact over the enrichment ferent sizes, yields sets of extracted documents whose process, the set of categories yielded by the system has cardinality is very different; whence we can also expect been evaluated by expert users. The latter assigned a different variety of such sets. Moreover, for instance, a rating, i.e., a number between 1 and 5 (1=lowest KBA does not fully cover year 2013, whence the ef- value, 5=highest value) indicating how the categories fectiveness could be affected by the publishing date of properly represent the topic discussed in the tweet. the analyzed short texts. Such aspects are taken into In particular for the Volume test, we run the evalua- consideration in the remaining experiments. tion several times, with news samples randomly rebuilt each time, where we used only a portion of the entire 4.3.2 Variety Test collection. We kept the average ratings obtained with different sub-collections, avoiding bias due to the ran- Figure 3 shows how the variety of news inside the dom set of news. Specifically for samples with 10% analyzed samples affects the enrichment effectiveness. or 1% of news we run respectively the evaluation 3 Continuous lines represent the results over 1 month of or 5 times, approximating the average ratings to the news (Test 2a), and dotted lines over 6 months (Test nearest integer value. 2b). For both experiments there is a noticeable dif- ference among the samples which highlights how in- creasing the variety of news allows to improve the fi- 4.3 Results nal categorization also on different time windows. The Results are reported in the following charts, which Wilcoxon test over the sample pairs of each test con- show distribution functions of ratings obtained by each firms the statistically significant difference between all test with the different experiment settings. In partic- the rating distributions. This fact highlights how im- ular, we display the cumulative distribution function portant is to increase the variety of news in order to (CDF), the inverted complementary cumulative distri- improve the set of words to use as text enrichment. bution function (I-CCDF), and a table reporting the mean ratings. The I-CCDF is provided for an easier 4.3.3 Freshness Test reading, showing the data in ascending order and thus highlighting the news collection performing better as The chart in Figure 4 shows the results related to Test the line at the top of the chart. 3a, 3b and 3c, and it is possible to notice how the news freshness affected the results especially when the news get older. Collections with contextualized news got 4.3.1 Volume Test the best effectiveness due to the news publishing time Figure 2 shows the results related to Test 1a and 1b, close to the tweets (same month), therefore they allow highlighting how for both collections the number of to have more relevant additional text to exploit. The news is an important feature to consider. We can ob- system has worsened the categorization process with serve a noticeable improvement with Temporalia 100% tweets randomly selected from whole 2013, and using compared to smaller samples. Increasing the volume collections of news extracted from the same year, ei- allows us to include a large number of both relevant ther equally distributed over all months or only in Jan- and not relevant news: the first ones yield a global uary. The effectiveness decreases drastically when the improvement, while the second ones have a low overall news get older in previous years. In particular we can impact. The general improvement is also confirmed by notice how we got the same lowest effectiveness with the Wilcoxon test. Then, we notice only a slight differ- Temporalia 2012 and Temporalia 2011, highlighting ence between Temporalia 1% and 10%, where the news how 1 (or more) year old news are poor of information increase in number from an order of magnitude 10K for these purposes. to 100K. The Wilcoxon test, over the latter couple Test 3a results, related to Temporalia 2013, show of rating distributions, confirmed a non statistically how large is the difference between news distant only significant difference between those samples, with a some months in time, and Test 3b results, where we an- p-value>0.05. On the other hand, with KBA we al- alyzed three years of Temporalia news, highlight how ready have a noticeable difference between KBA 1% going back to 1 year is crucial for the categorization Figure 2: Volume impact CDF, I-CCDF, and mean ratings Figure 3: Variety impact CDF, I-CCDF, and mean ratings process. With KBA collections we can notice how the larger time window. The news Freshness appears to results are similar and the rating distributions, rep- be a sensible feature since news published close to the resented by dotted lines, highlight better effectiveness same period of the short text provide a better set of with higher news freshness. Wilcoxon tests confirm terms to use in the enrichment phase. Indeed, as soon that there is statistical significant difference among the as the news begin to age (even of just a few months) rating distributions in both Temporalia and KBA, ex- the effectiveness of the categorization drastically de- cept for Temporalia ’11 and ’12 which obviously have creases. equal values. This is a further confirmation that few months old news have a strong impact as those from For future work, we plan to refine and complete previous years. the experiments on the three focused features. For instance, it could be interesting to look at the im- pact of the number of documents extracted from the 5 Discussion and Conclusions news collection and used to categorize short texts. As The experiments performed in this work have demon- we pointed out in Section 4.3, a larger database will strated that text enrichment is sensibly affected by produce a higher number of elements (with the same the features of the news collections that we have ana- percentage), and this fact can have subtle implica- lyzed. More precisely, there is a critical threshold for tions on the final outcomes. We also plan to carry what concerns the collection Volume, that allows to on further experiments about the variety, investigat- have a sufficient amount of news to reach a good level ing which kinds of news it is important to include in of effectiveness. Moreover, such threshold seems to be the collection, and which ones are marginal. As the dependent on the whole size of the collection taken freshness is concerned, we could investigate more pre- into consideration. Our benchmarks confirm the im- cisely, varying the granularity of the time windows, portance of news variety, highlighting how increasing which is the temporal threshold causing a quick de- the number of available kinds yields a better enrich- crease of the effectiveness of the enrichment process. ment both for texts selected in one month and in the Moreover, we plan to carry on further experiments on Figure 4: Freshness impact CDF, I-CCDF, and mean ratings different news collections and new kinds of short texts stream. In Proc. of ACM SIGMOD’10, pages (e.g., instant chat messages, online comments). Un- 1155–1158. ACM, 2010. fortunately we could not use the Signal Media collec- [MPSV14] S. Mizzaro, M. Pavan, I. Scagnetto, and tion available at http://research.signalmedia.co/ M. Valenti. Short text categorization exploit- newsir16/signal-dataset.html; indeed, a collection ing contextual enrichment and external knowl- edge. In SIGIR ’14 Proceedings. SoMeRA, SI- covering a one-month period is not sufficient for the GIR, July 2014. kind of experiments we described in this paper (think, [NGKA11] Nasir Naveed, Thomas Gottron, Jérôme e.g., of the freshness test). Kunegis, and Arifah Che Alhadi. Bad news travel fast: A content-based analysis of inter- estingness on Twitter. In Proc. of WebSci’11, References page 8. ACM, 2011. [pyt16] Python wrapper around the Twitter API. [AGHT11] Fabian Abel, Qi Gao, Geert-Jan Houben, and https://dev.twitter.com/rest/public, Ke Tao. Semantic enrichment of twitter posts 2016. for user profile construction on the social web. [SSTW14] Timm O Sprenger, Philipp G Sandner, An- In The Semanic Web: Research and Applica- dranik Tumasjan, and Isabell M Welpe. News tions, pages 375–389. Springer, 2011. or noise? using twitter to identify and under- [DDZC13] Inoshika Dilrukshi, Kasun De Zoysa, and stand company-specific news flow. Journal of Amitha Caldera. Twitter news classification Business Finance & Accounting, 41(7-8):791– using SVM. In Proc. of ICCSE’13, pages 287– 830, 2014. 291. IEEE, 2013. [twi16a] The Next Web. http://thenextweb.com [DSL13] Zichao Dai, Aixin Sun, and Xu-Ying Liu. /twitter/2012/01/07/interesting-fact-most Crest: Cluster-based representation enrich- -tweets-posted-are-approximately-30-char ment for short text classification. In Ad- acters-long/#gref, 2016. [Online, visited vances in Knowledge Discovery and Data Min- Feb-2016]. ing, pages 256–267. Springer, 2013. [twi16b] Twitter REST APIs. https://dev.twitter. [GLJD13] Weiwei Guo, Hao Li, Heng Ji, and Mona T com/rest/public, 2016. Diab. Linking tweets to news: A framework to [WZX+ 14] Peng Wang, Heng Zhang, Bo Xu, Chenglin enrich short text data in social media. In ACL Liu, and Hongwei Hao. Short text feature en- (1), pages 239–249, 2013. richment using link analysis on topic-keyword [HH15] Yin-Fu Huang and Chen-Ting Huang. Mining graph. In Natural Language Processing and domain information from social contents based Chinese Computing, pages 79–90. Springer, on news categories. In Proc. of IDEAS’15, 2014. pages 186–191. ACM, 2015. [YPF10] Hiroki Yamakawa, Jing Peng, and Anna Feld- [ine13] INEX 2013 Tweet Contextualization Track. man. Semantic enrichment of text represen- http://inex.mmci.uni-saarland.de/ tation with wikipedia for text classification. tracks/qa/, 2013. In Proc. of SMC’10, pages 4333–4340. IEEE, [JG13] Nirmal Jonnalagedda and Susan Gauch. Per- 2010. sonalized News Recommendation Using Twit- [ZCH15] Deyu Zhou, Liangyu Chen, and Yulan He. An ter. In Proc. of WI-IAT’13, pages 21–25. IEEE unsupervised framework of exploring events on Computer Society, 2013. twitter: Filtering, extraction and categoriza- [MK10] Michael Mathioudakis and Nick Koudas. Twit- tion. In Proc. of AAAI’15, 2015. termonitor: trend detection over the Twitter