Extending Jensen Shannon Divergence to Compare Multiple Corpora Jinghui Lu1 , Maeve Henchion2 , and Brian Mac Namee1 1 School of Computer Science, University College Dublin, Ireland 2 Teagasc Food Research Centre, Ireland Abstract. Investigating public discourse on social media platforms has proven a viable way to reflect the impacts of political issues. In this paper we frame this as a corpus comparison problem in which the online discus- sion of different groups are treated as different corpora to be compared. We propose an extended version of the Jensen-Shannon divergence mea- sure to compare multiple corpora and use the FP-growth algorithm to mix unigrams and bigrams in this comparison. We also propose a set of visualizations that can illustrate the results of this analysis. To demon- strate these approaches we compare the Twitter discourse surrounding Brexit in Ireland and Great Britain across a 14 week time period. 1 Introduction Social media platforms—such as Twitter, Reddit and Facebook—have dramat- ically changed the way that people communicate and form their opinions on issues that are important to them [23]. The massive volume of relatively easily accessible digital content generated on these platforms (Twitter alone, for exam- ple, has 320 million monthly active users3 ) presents a compelling opportunity to harvest and analyse the opinions of the public on important issues [1]. Many interesting questions that can be answered by analysing data from so- cial media platforms amount to comparing how the opinions of specific groups differ and can be framed as a corpus comparison. Jensen-Shannon divergence (JSD) [13] is a popular mechanism for performing corpus comparison but is lim- ited to comparing pairs of corpora and considering only bigrams or unigrams. We extend the Jensen-Shannon divergence approach to allow comparison of mul- tiple corpora and enable simultaneous analysis of both unigrams and bigrams through the use of the FP-growth algorithm, a popular approach for frequent itemset mining. We demonstrate the effectiveness of this approach through an analysis of the differences in Twitter data relating to Brexit arising from Ireland (including Northern Ireland) and Great Britain (excluding Northern Ireland), and across different time periods. The remainder of the paper proceeds as follows. In Section 2 we survey rele- vant existing work; Section 3 describes Jensen-Shannon divergence and how we have extended it; Section 4 is a case study of the application of our approach 3 Data retrieved on July 22, 2017 from https://about.twitter.com/company to analysing the Twitter discussion of Brexit; and, finally, Section 5 summarizes the work and suggests directions for future explorations. 2 Related Work There are many examples in the literature of researchers harvesting content posted on social media platforms and analysing it to understand public opinion. For example, Conover et al. [5] proposed several approaches to monitoring the political opinions of the general public from Twitter data. Similarly, Bollen et al. [2] analysed sentiments extracted from tweets to reveal how events in the social, political, cultural and economic fields impact on the public mood. Twitter data has also been analysed to reveal the distinctive phrases used by people of differ- ent genders [19], and the differences between social protest and counter-protest movements [6]. Twitter has also been used for tracking the levels of disease ac- tivity and public concern in the US during the influenza H1N1 pandemic of 2009 [21]. Eiji et al [1] addressed the similar issue of detecting influenza epidemics using Twitter data. Although there are some recognized limitations of the effec- tiveness of using data from social media platforms such as Twitter for analysing public opinion (for example the narrow demographics of these platforms’ users, or the tendency to communicate extreme opinions), this has been shown to be an effective approach to revealing insights [15]. Many of the interesting questions that can be answered by analysing data from social media platforms amount to comparing how the opinions of specific groups differ (for example [19] and [6]). This can be framed as a corpus com- parison problem [10] in which the posts of the different groups are treated as different text corpora to be compared. Typical approaches to corpus comparison are statistical in nature. For example, the TF-IDF measure [18] can be used to reflect how important a word is to a document in a collection of corpora. It is also possible to apply statistical significance tests across the distribution of words in different corpora. For instance, Leech and Fallon [12] used a χ2 -test to identify whether words are more common in British or American English, and Church and Hanks proposed the Mutual Information (MI) measure [4] which was employed to identify the characteristic vocabulary of corpora [10]. Meanwhile, frequency profiling was later used by Rayson and Garside [17] to extract distinct words over corpora of different domains. Rather than applying statistical corpus comparison methods simply to tok- enized words in a corpus, it can be useful to apply linguistic pre-processing using techniques such as part-of-speech tagging [3], stemming [14], and lemmatization [7]. Weber and Buitelaar [22] adopted a hybrid method that computes a χ2 value for each term after linguistic processing. Terms of a χ2 value above a cer- tain threshold value are decided to be relevant to an individual domain. Another widely used hybrid corpora comparison method is Jensen-Shannon divergence (JSD) [13]. For example, Pechenick [16] used JSD to weight the importance of words involved in language evolution. Gallagher et al. [6] used JSD to quantify the divergence between tweets containing the hashtag #BlackLivesMatter and other tweets including #AllLivesMatter to investigate the differing opinions of protest and counter-protest movements. Typical approaches to JSD work across pairs of corpora and are based on unigram tokens. We extend these to an approach that can compare multiple corpora and mixtures of unigram and bigram tokens. 3 Extending Jensen-Shannon Divergence In this section we describe how Jensen-Shannon divergence (JSD) can be used for corpus comparison and how we have extended the standard approach to allow for comparison of multiple corpora and the use of a combination of unigram and bigram tokens. 3.1 Jensen-Shannon Divergence Broadly entropy refers to uncertainty or disorder [9]. Shannon’s entropy [20] is a measure of the unpredictability of a state and can be written as: n X H=− pi log pi (1) i=1 In the text analysis context, Shannon’s entropy describes the uncertainty of a text which has n unique words, where the ith word has probability pi of appearing. In this case, we can use Shannon’s entropy as a diversity measure called the Shannon index, where higher entropy implies higher diversity (text is more unpredictable) and vice versa. Kullback and Leibler [11] proposed a statistical measure which estimates the differences between two probability distributions. Given two probability distri- butions P and Q, the Kullback-Leibler (KL) divergence is defined as: n X pi DKL (P kQ) = pi log2 (2) i=1 qi where n is the size of the sample space. In the context of text analysis, n can be regarded as the number of unique words; and pi and qi are the probabilities of observing word i in corpora P and Q respectively. Applying KL divergence directly to two Twitter corpora, however, is likely to raise issues [6]. If a word appears in only one corpus, this divergence will be infinitely large. To avoid this, Gallagher et al suggested implementing the Jensen- Shannon divergence instead, which is a smoothed version of the KL divergence. The JSD was originally proposed by Lin [13] as: DJS (P kQ) = H(π1 P + π2 Q) − π1 H(P ) − π2 H(Q) (3) where H(x) is Shannon’s entropy as described in Equation 1 and π1 and π2 are weights associated with the two probability distributions P and Q, respectively. Gallagher et al [6] rephrased JSD as: DJS (P kQ) = π1 DKL (P kM ) + π2 DKL (QkM ) (4) This solves the issue of infinite divergence by introducing the mixed distribution M = π1 P + π2 Q, where π1 and π2 are weights proportional to the sizes of P and Q, with π1 + π2 = 1. JSD has a useful property that it is bounded between 0 and 1. When comparing two texts, if a JSD score equals 0 this indicates that the word probability distributions in both texts are equal. A JSD score of 1 indicates that there is no word that appears in both distributions [6]. Another advantage is that we can measure the contribution to the divergence of individual words by the linearity of JSD. The contribution of word i to JSD can be calculated by: DJS ,i (P || Q) = −mi log2 mi + π1 pi log2 pi + π2 qi log2 qi (5) where mi is the probability of seeing word i in M . Through Equation 5, we can easily find the most indicative words of each corpus by sorting the JSD contributions of each possible word. JSD has been previously used to compare two corpora [6, 16]. We extend this idea so that we can not only compare tweets from two countries but also tweets from different time periods. We can extend Equation 5 so that it can be applied across multiple proba- bility distributions, which would allow more than two corpora to be compared. The extension of Equation 5 that computes word i’s contributions to divergence over multiple corpora is given as: n X DJS,i (P1 || P2 || ... || Pn ) = −mi log2 mi + πj pji log2 pji (6) j=1 where pji is the probability of observing word i in corpus Pj , and mi is the possibility of seeing word i in M . Here, M is a mixed distribution of n corpora: n X M= πi P i (7) i=1 where, again, π1 , π2 ... πn are weights proportional to the sizes of P1 to Pn , with π1 + π2 ... + πn = 1. By Equation 6, we can compute the contributions of individual words to the JSD divergence value over many corpora. In this study, we apply the extended JSD equation to discover the distinguishing words of tweets from different time periods. We also extend previous approaches to allow unigrams and bigrams to be analysed in parallel and describe the approach to this in the next section. (a) Great Britain (b) Ireland Fig. 1. Scatter plot of top terms from Great Britain (a) and scatter plot of top terms from Ireland (b) 3.2 The FP-growth Algorithm JSD can be easily applied at both the unigram and bigram levels by considering unigram or bigram tokens in separate applications of the calculations described in the previous section and combining the results. This naive approach, however, leads to an unsatisfactory result where the component unigrams of each bigram will also appear in any list of the most divergent terms. To address this issue we apply the FP-growth algorithm4 [8] to discover all frequent sets of tokens which satisfy a minimal support level (in our implementation a frequency equal to at least the square root of the number of words in the corpus). If a unigram and a bigram are included in the same frequent set of tokens, the unigram can be recognized as redundant and removed. By using FP-growth to filter out redundant information, we can analyse bigrams together with unigrams to give better analysis. 4 Case Study In this section we present a case study of using our extended JSD approach to compare the Twitter discussion relating to Brexit in Ireland (including Northern Ireland) and Great Britain (excluding Northern Ireland) across different time periods. We are concerned with two questions: (1) how did people’s concerns over Brexit change over the time period, and (2) what are the different concerns in relation to Brexit in the Great Britain and Ireland? We first describe how we collected a dataset, then describe a set of visualizations used to present the analysis, and finally the insights which are extracted from the analysis. 4.1 Data Collection Our dataset was obtained from Twitter using the Twitter Get Search API5 . We collected tweets relating to Brexit from Ireland and Great Britain over the time period from 15/01/2017 to 23/04/2017. To collect tweets relating to Brexit we specify that a tweet must contain at least one of the search terms “brexit, “post- brexit”, “hard-brexit”, “soft-brexit”, “postbrexit”, “softbrexit”, or “hardbrexit”. To separate tweets from Ireland and Great Britain we specify spatial regions through a centre and radius (as allowed through the API). The details are: – Ireland: latitude: 53.413940, longitude: -7.940989, radius: 300 – GB (south): latitude: 52.674554, longitude: -1.761640, radius: 220 – GB (north): latitude: 56.268001, longitude: -5.185579, radius: 300 Great Britain is divided into two regions, GB (north) and GB (south), with tweets from both combined into a single corpus. After collecting tweets using these criteria we drop all duplicate tweets and retweets. Our final dataset contained 1,129,754 tweets from Great Britain and 4 Using the pyfpgrowth package in Python 5 https://dev.twitter.com/rest/reference/get/search/tweets 72,148 tweets from Ireland. Before beginning analysis of this dataset we removed all punctuation (except for # and @ symbols), converted all text to lowercase, and removed stop words. Following this we tokenised separately to unigrams and bigrams. 4.2 The Evolution of Attention Figure 1(a) shows an illustration of how the top concerns over Brexit of British people changed over the period from 15/01/2017 to 23/04/2017, according to an analysis of collected tweets. We have divided this time period into 14 periods of 7 days, each of which defines a corpus. We apply Equation 6 to compute how much individual unigrams and bigrams contribute to the divergence across these 14 time periods. Then we rank the unigrams and bigrams according to their contribution scores. As we look at Figure 1(a), each term is represented by a horizontal bar with a width indicating the JSD contribution score. Each term only appears once in the graph in the time period that has the highest possibility of seeing that term. The vertical position of each term represents the rank of the JSD contribution score for the term in a certain period. For example, the term “supreme court” is located at rank 1 for the week starting on January 22. The JSD contribution score for this term is high denoting that it is the most distinct phrase between Jan 22 and Jan 29. To produce Figure 1(a) we select the top 50 bigrams and top 50 unigrams and then use the FP-growth algorithm to remove unigrams that carry duplicate information. From Figure 1(a), we can see the change of British people’s concerns during different time periods. In general, before February, British people were concerned with topics surrounding British Prime Minister Theresa May’s speech, Article 50, and the Supreme Court. In contrast, during February and March, the British people’s concerns seemed to become distracted by many other issues when con- sidering Brexit, like Budget 2017, and the Scottish independence referendum as evidenced by the presence of terms “#budget2017”, “#scotref”, and “#in- dyref2”. However, at the end of March, the topics around Article 50 came back to the public sight. Theresa May signed the letter to trigger Article 50 and instigate Brexit on March 29th which also explains the high ranks of phrases “#brexitday”, “may trigger”, and “#article50 #brexit” at that time. Finally, at mid April, people’s focus appears to be dominated by topics relating to the 2017 British general election. Figure 1(b) shows the most distinctive phrases from Irish tweets over the same time periods. The result shows an extremely similar situation to the British one. Overall, from January 15 to April 23, the focus of Twitter attention to Brexit in Ireland surrounds Theresa May’s speech, the triggering of Article 50, the Scottish independence referendum, and the British general election. There are some differences, however, evidenced by the appearance of terms like “united Ireland” and “hard border”. In the next section we focus explicitly on analysing these differences. 4.3 Comparing Brexit in Ireland and Great Britain To compare the differences between the Twitter discussions of Brexit in Ireland and Great Britain across the full time period covered by our dataset we apply Equation 5 to determine the most divergent unigrams or bigrams. We present the results in Figure 2. We list the top 20 unigrams and top 20 bigrams from each of Ireland and Great Britain (again, we remove the unigrams which are included in high ranking bigrams according to the result of the FP-growth algorithm). The length of the bars corresponds to JSD contribution scores with higher values indicating more distinguishing words. A bar to the left (shaded red) indicates that a term is more common in British tweets, while a bar to the right (shaded blue) indicates that a term is more common in Irish tweets. Fig. 2. The most divergent unigrams and bigrams (according to JSD contribution) in British (red and to the left) and Irish (blue and to the right) tweets during the period from 15/01/2017 to 23/04/2017. From Figure 2 we can see that the top two terms spoken about in Irish tweets, but not in British tweets, are “ni” (an abbreviation for Northen Ireland) and “northern ireland”. This illustrates that the key difference between the concerns regarding Brexit expressed on Twitter by people from Ireland and those ex- pressed by people in Great Britain is a focus on the impact on Northern Ireland and in particular, its border with the Republic of Ireland. We see this echoed in terms like “stormont” (the seat of parliament in Northern Ireland), “hard bor- der”, “sinn fein” (an Irish republican political party), “united ireland”, “good friday”, “friday agreement” (the Good Friday Agreement was a key instrument in peace talks between the Republic of Ireland and Northern Ireland), “common travel”, and “enda kenny” (the Irish prime minister at the time that the tweets were collected). (a) Great Britain (b) Ireland Fig. 3. (a) 30 terms with highest frequency from British tweets and (b) 30 terms with highest frequency from Irish tweets Conversely, the British tweets seem focused on local issues such as “corbyn” (the British Labour Party leader Jeremy Corbyn), “#ukip” (the Eurosceptic United Kingdom Independence Party), and the “nhs” (a shortened form of the “National Health Service”); and potential impacts of Brexit such as “eu citizens”, “hard brexit”, and “london”. We can see from Figure 2, however, that the JSD contribution scores are different across the British and Irish corpora—Irish tweets tend to give rise to unigrams and bigrams with higher JSD contributions that British tweets. A possible explanation for this is that it arises because the JSD method looks for unigrams or bigrams that consistently appear in one corpus but rarely appear in the other. Figure 3 shows the 30 most frequent terms in each corpus. These graphs show that the main topics surrounding Brexit (e.g. Article 50, Scottish independence referendum, and Theresa May) are discussed in both British and Irish tweets as evidenced by the high frequency of terms “teresa may”, “article 50” and so on. But Irish tweets alone have a set of frequently mentioned border- related topics as evidenced by the highest frequency of term “northern ireland” and so on. In contrast, British tweets do not seem to have a set of frequently mentioned topics that do not appear in Irish tweets. The corpus of British tweets is also much larger than the corpus of Irish tweets and this might also contribute to the relatively low JSD contribution scores for unigrams and bigrams from British tweets. 5 Conclusions In this paper we have proposed an approach to analysing differences in Twitter discourse of different groups around the same topics using corpus comparison techniques. Specifically, we have used an extended version of Jensen-Shannon divergence coupled with the application of the FP-growth algorithm to merge unigram and bigram analyses. We have also demonstrated how this analysis can be visualized. We demonstrate this approach through a case study that analyses Twitter discussion of Brexit from Ireland and Great Britain, across different time periods. Through our analysis we can see how concerns over Brexit evolved over the period studied as well as extracting the main differences between the concerns in the two countries—primarily a focus on the impact on the border with Northern Ireland in Irish tweets. This case study also exposes some of the drawbacks of this approach. For example, it appears that the results of JSD are vulnerable to the effects of spam tweets. The appearance of the phrases “bridging loan”, “#brexit bridg- ing”, “#Manchester #Capital” etc. reveals that our approach is sensitive to the specific phrases in spam tweets. The reason is simple: if many spam tweets that only appear in one corpus contain very specific terms (e.g. “bridging loan”), then those specific terms will be recognized by the JSD approach as distinguish- ing. For example, we see in our British corpus (but not in our Irish corpus) various business promotion tweets like “How much can I borrow? - #Manch- ester #Capital #Bridging Loans #Brexit https://t.co/nDg5ZNKVdf bridging loan, uk, Manchester” that include the distinguishing terms “bridging loan”, “#Manchester #Capital” and so on. Though these tweets contain the hashtag “#Brexit” they are actually unrelated to Brexit issues. The JSD approach can be easily hijacked when this happens. There is also a tension between successfully displaying divergence scores along with frequency in a way that is easy for readers to comprehend. We will address these issues in future work. In future work we will also address how similar tech- niques can be used to compare corpora that arise from quite different sources— for example online news sources and Twitter. Acknowledgement. This research was kindly supported by a Teagasc Walshe Fellowship award (2016053). References 1. Aramaki, E., Maskawa, S., Morita, M.: Twitter catches the flu: detecting influenza epidemics using twitter. In: Proceedings of the conference on empirical methods in natural language processing (pp. 1568-1576) (2011) 2. Bollen, J., Mao, H., Pepe, A.: Modeling public mood and emotion: Twitter senti- ment and socio-economic phenomena. In: ICWSM, 11, pp.450-453 (2011) 3. Brill, E.: A simple rule-based part of speech tagger. In: Proceedings of the workshop on Speech and Natural Language (pp. 112-116). Association for Computational Linguistics (1992) 4. Church, K., Hanks, P.: Word association norms, mutual information, and lexicog- raphy. In: Computational linguistics, 16(1), pp.22-29 (1990) 5. Conover, M., Gonalves, B., Ratkiewicz, J., Flammini, A., Menczer, F.: Predict- ing the political alignment of twitter users. In: Privacy, Security, Risk and Trust (PASSAT) and 2011 IEEE Third Inernational Conference on Social Computing (SocialCom), 2011 IEEE Third International Conference on (pp. 192-199). IEEE (2011) 6. Gallagher, R., Reagan, A., Danforth, C., Dodds, P.: Divergent discourse between protests and counter-protests:# blacklivesmatter and# alllivesmatter. In: arXiv preprint arXiv:1606.06820 (2011) 7. Green, N., Breimyer, P., Kumar, V., Samatova, N.: Webbanc: Building semantically-rich annotated corpora from web user annotations of minority lan- guages (2009) 8. Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. In: ACM sigmod record (Vol. 29, No. 2, pp. 1-12) (2000) 9. Kelleher, J.D., Mac Namee, B., D’Arcy, A.: Fundamentals of machine learning for predictive data analytics: algorithms, worked examples, and case studies. MIT Press (2015) 10. Kilgarriff, A.: Comparing corpora. In: International journal of corpus linguistics, 6(1), pp.97-133 (2001) 11. Kullback, S., Leibler, R.: On information and sufficiency. In: The annals of math- ematical statistics, 22(1), pp.79-86 (1951) 12. Leech, G., Fallon, R.: Computer corpora: what do they tell us about culture. In: ICAME journal, 16 (1992) 13. Lin, J.: Divergence measures based on the shannon entropy. In: IEEE Transactions on Information theory, 37(1), pp.145-151 (1991) 14. Lovins, J.: Development of a stemming algorithm. In: Mech. Translat. & Comp. Linguistics, 11(1-2), pp.22-31 (1968) 15. O’Callaghan, D., Prucha, N., Greene, D., Conway, M., Carthy, J., Cunningham, P.: Online social media in the syria conflict: Encompassing the extremes and the in-betweens. In: Advances in Social Networks Analysis and Mining (ASONAM), 2014 IEEE/ACM International Conference on (pp. 409-416). IEEE. (2014) 16. Pechenick, E., Danforth, C., Dodds, P.: Is language evolution grinding to a halt? the scaling of lexical turbulence in english fiction suggests it is not. In: Journal of Computational Science, 21, pp.24-37 (2017) 17. Rayson, P., Garside, R.: Comparing corpora using frequency profiling. In: Proceed- ings of the workshop on Comparing Corpora (pp. 1-6). Association for Computa- tional Linguistics (2000) 18. Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. In: Information processing & management, 24(5), pp.513-523 (1988) 19. Schwartz, H., Eichstaedt, J., Kern, M., Dziurzynski, L., Ramones, S., Agrawal, M., Shah, A., Kosinski, M., Stillwell, D., Seligman, M., Ungar, L.: Personality, gender, and age in the language of social media: The open-vocabulary approach. In: PloS one, 8(9), p.e73791 (2013) 20. Shannon, C.: A mathematical theory of communication. In: ACM SIGMOBILE Mobile Computing and Communications Review, 5(1), pp.3-55 (2011) 21. Signorini, A., Segre, A., Polgreen, P.: The use of twitter to track levels of disease activity and public concern in the us during the influenza a h1n1 pandemic. In: PloS one, 6(5), p.e19467 (2011) 22. Weber, N., Buitelaar, P.: Web-based ontology learning with isolde. In: Proc. of the Workshop on Web Content Mining with Human Language at the International Semantic Web Conference, Athens GA, USA (Vol. 11) (2006) 23. Zappavigna, M.: Discourse of twitter and social media: How we use language to create affiliation on the web (2015)