<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Extending Jensen Shannon Divergence to Compare Multiple Corpora</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jinghui Lu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maeve Henchion</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Brian Mac Namee</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>School of Computer Science, University College Dublin</institution>
          ,
          <country country="IE">Ireland</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Teagasc Food Research Centre</institution>
          ,
          <country country="IE">Ireland</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Investigating public discourse on social media platforms has proven a viable way to re ect the impacts of political issues. In this paper we frame this as a corpus comparison problem in which the online discussion of di erent groups are treated as di erent corpora to be compared. We propose an extended version of the Jensen-Shannon divergence measure to compare multiple corpora and use the FP-growth algorithm to mix unigrams and bigrams in this comparison. We also propose a set of visualizations that can illustrate the results of this analysis. To demonstrate these approaches we compare the Twitter discourse surrounding Brexit in Ireland and Great Britain across a 14 week time period.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Social media platforms|such as Twitter, Reddit and Facebook|have
dramatically changed the way that people communicate and form their opinions on
issues that are important to them [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ]. The massive volume of relatively easily
accessible digital content generated on these platforms (Twitter alone, for
example, has 320 million monthly active users3) presents a compelling opportunity to
harvest and analyse the opinions of the public on important issues [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        Many interesting questions that can be answered by analysing data from
social media platforms amount to comparing how the opinions of speci c groups
di er and can be framed as a corpus comparison. Jensen-Shannon divergence
(JSD) [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] is a popular mechanism for performing corpus comparison but is
limited to comparing pairs of corpora and considering only bigrams or unigrams.
We extend the Jensen-Shannon divergence approach to allow comparison of
multiple corpora and enable simultaneous analysis of both unigrams and bigrams
through the use of the FP-growth algorithm, a popular approach for frequent
itemset mining. We demonstrate the e ectiveness of this approach through an
analysis of the di erences in Twitter data relating to Brexit arising from Ireland
(including Northern Ireland) and Great Britain (excluding Northern Ireland),
and across di erent time periods.
      </p>
      <p>The remainder of the paper proceeds as follows. In Section 2 we survey
relevant existing work; Section 3 describes Jensen-Shannon divergence and how we
have extended it; Section 4 is a case study of the application of our approach
3 Data retrieved on July 22, 2017 from https://about.twitter.com/company
to analysing the Twitter discussion of Brexit; and, nally, Section 5 summarizes
the work and suggests directions for future explorations.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        There are many examples in the literature of researchers harvesting content
posted on social media platforms and analysing it to understand public opinion.
For example, Conover et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] proposed several approaches to monitoring the
political opinions of the general public from Twitter data. Similarly, Bollen et al.
[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] analysed sentiments extracted from tweets to reveal how events in the social,
political, cultural and economic elds impact on the public mood. Twitter data
has also been analysed to reveal the distinctive phrases used by people of di
erent genders [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ], and the di erences between social protest and counter-protest
movements [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Twitter has also been used for tracking the levels of disease
activity and public concern in the US during the in uenza H1N1 pandemic of 2009
[
        <xref ref-type="bibr" rid="ref21">21</xref>
        ]. Eiji et al [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] addressed the similar issue of detecting in uenza epidemics
using Twitter data. Although there are some recognized limitations of the e
ectiveness of using data from social media platforms such as Twitter for analysing
public opinion (for example the narrow demographics of these platforms' users,
or the tendency to communicate extreme opinions), this has been shown to be
an e ective approach to revealing insights [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ].
      </p>
      <p>
        Many of the interesting questions that can be answered by analysing data
from social media platforms amount to comparing how the opinions of speci c
groups di er (for example [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] and [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]). This can be framed as a corpus
comparison problem [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] in which the posts of the di erent groups are treated as
di erent text corpora to be compared. Typical approaches to corpus comparison
are statistical in nature. For example, the TF-IDF measure [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] can be used
to re ect how important a word is to a document in a collection of corpora. It
is also possible to apply statistical signi cance tests across the distribution of
words in di erent corpora. For instance, Leech and Fallon [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] used a 2-test to
identify whether words are more common in British or American English, and
Church and Hanks proposed the Mutual Information (MI) measure [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] which was
employed to identify the characteristic vocabulary of corpora [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Meanwhile,
frequency pro ling was later used by Rayson and Garside [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] to extract distinct
words over corpora of di erent domains.
      </p>
      <p>
        Rather than applying statistical corpus comparison methods simply to
tokenized words in a corpus, it can be useful to apply linguistic pre-processing using
techniques such as part-of-speech tagging [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], stemming [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], and lemmatization
[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Weber and Buitelaar [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] adopted a hybrid method that computes a 2
value for each term after linguistic processing. Terms of a 2 value above a
certain threshold value are decided to be relevant to an individual domain. Another
widely used hybrid corpora comparison method is Jensen-Shannon divergence
(JSD) [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. For example, Pechenick [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] used JSD to weight the importance of
words involved in language evolution. Gallagher et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] used JSD to quantify
the divergence between tweets containing the hashtag #BlackLivesMatter and
other tweets including #AllLivesMatter to investigate the di ering opinions of
protest and counter-protest movements.
      </p>
      <p>Typical approaches to JSD work across pairs of corpora and are based on
unigram tokens. We extend these to an approach that can compare multiple
corpora and mixtures of unigram and bigram tokens.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Extending Jensen-Shannon Divergence</title>
      <p>In this section we describe how Jensen-Shannon divergence (JSD) can be used
for corpus comparison and how we have extended the standard approach to allow
for comparison of multiple corpora and the use of a combination of unigram and
bigram tokens.
3.1</p>
      <sec id="sec-3-1">
        <title>Jensen-Shannon Divergence</title>
        <p>
          Broadly entropy refers to uncertainty or disorder [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. Shannon's entropy [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ] is
a measure of the unpredictability of a state and can be written as:
(1)
(2)
H =
n
X pi log pi
i=1
        </p>
        <p>In the text analysis context, Shannon's entropy describes the uncertainty
of a text which has n unique words, where the ith word has probability pi of
appearing. In this case, we can use Shannon's entropy as a diversity measure
called the Shannon index, where higher entropy implies higher diversity (text is
more unpredictable) and vice versa.</p>
        <p>
          Kullback and Leibler [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] proposed a statistical measure which estimates the
di erences between two probability distributions. Given two probability
distributions P and Q, the Kullback-Leibler (KL) divergence is de ned as:
DKL(P kQ) =
n
        </p>
        <p>pi
X pi log2 q
i=1 i
where n is the size of the sample space. In the context of text analysis, n can be
regarded as the number of unique words; and pi and qi are the probabilities of
observing word i in corpora P and Q respectively.</p>
        <p>
          Applying KL divergence directly to two Twitter corpora, however, is likely
to raise issues [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. If a word appears in only one corpus, this divergence will be
in nitely large. To avoid this, Gallagher et al suggested implementing the
JensenShannon divergence instead, which is a smoothed version of the KL divergence.
The JSD was originally proposed by Lin [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] as:
        </p>
        <p>
          DJS (P kQ) = H( 1P + 2Q)
1H(P )
2H(Q)
(3)
where H(x) is Shannon's entropy as described in Equation 1 and 1 and 2 are
weights associated with the two probability distributions P and Q, respectively.
Gallagher et al [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] rephrased JSD as:
        </p>
        <p>
          DJS(P kQ) = 1DKL(P kM ) + 2DKL(QkM )
(4)
This solves the issue of in nite divergence by introducing the mixed distribution
M = 1P + 2Q, where 1 and 2 are weights proportional to the sizes of P and
Q, with 1 + 2 = 1. JSD has a useful property that it is bounded between 0
and 1. When comparing two texts, if a JSD score equals 0 this indicates that the
word probability distributions in both texts are equal. A JSD score of 1 indicates
that there is no word that appears in both distributions [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ].
        </p>
        <p>Another advantage is that we can measure the contribution to the divergence
of individual words by the linearity of JSD. The contribution of word i to JSD
can be calculated by:</p>
        <p>
          DJS;i (P jj Q) =
mi log2 mi + 1pi log2 pi + 2qi log2 qi
(5)
where mi is the probability of seeing word i in M . Through Equation 5, we
can easily nd the most indicative words of each corpus by sorting the JSD
contributions of each possible word. JSD has been previously used to compare
two corpora [
          <xref ref-type="bibr" rid="ref16 ref6">6, 16</xref>
          ]. We extend this idea so that we can not only compare tweets
from two countries but also tweets from di erent time periods.
        </p>
        <p>We can extend Equation 5 so that it can be applied across multiple
probability distributions, which would allow more than two corpora to be compared.
The extension of Equation 5 that computes word i's contributions to divergence
over multiple corpora is given as:
(6)
(7)
DJS;i(P1 jj P2 jj ::: jj Pn) =
n
mi log2 mi + X
j=1
jpji log2 pji
where pji is the probability of observing word i in corpus Pj, and mi is the
possibility of seeing word i in M . Here, M is a mixed distribution of n corpora:
n
M = X
i=1
iPi
where, again, 1, 2 ... n are weights proportional to the sizes of P1 to Pn, with
1 + 2::: + n = 1.</p>
        <p>By Equation 6, we can compute the contributions of individual words to the
JSD divergence value over many corpora. In this study, we apply the extended
JSD equation to discover the distinguishing words of tweets from di erent time
periods. We also extend previous approaches to allow unigrams and bigrams to
be analysed in parallel and describe the approach to this in the next section.
r
I
1
.
g
i
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>The FP-growth Algorithm</title>
        <p>
          JSD can be easily applied at both the unigram and bigram levels by considering
unigram or bigram tokens in separate applications of the calculations described
in the previous section and combining the results. This naive approach, however,
leads to an unsatisfactory result where the component unigrams of each bigram
will also appear in any list of the most divergent terms. To address this issue
we apply the FP-growth algorithm4 [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] to discover all frequent sets of tokens
which satisfy a minimal support level (in our implementation a frequency equal
to at least the square root of the number of words in the corpus). If a unigram
and a bigram are included in the same frequent set of tokens, the unigram
can be recognized as redundant and removed. By using FP-growth to lter out
redundant information, we can analyse bigrams together with unigrams to give
better analysis.
4
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Case Study</title>
      <p>In this section we present a case study of using our extended JSD approach to
compare the Twitter discussion relating to Brexit in Ireland (including Northern
Ireland) and Great Britain (excluding Northern Ireland) across di erent time
periods. We are concerned with two questions: (1) how did people's concerns
over Brexit change over the time period, and (2) what are the di erent concerns
in relation to Brexit in the Great Britain and Ireland? We rst describe how
we collected a dataset, then describe a set of visualizations used to present the
analysis, and nally the insights which are extracted from the analysis.
4.1</p>
      <sec id="sec-4-1">
        <title>Data Collection</title>
        <p>Our dataset was obtained from Twitter using the Twitter Get Search API5. We
collected tweets relating to Brexit from Ireland and Great Britain over the time
period from 15/01/2017 to 23/04/2017. To collect tweets relating to Brexit we
specify that a tweet must contain at least one of the search terms \brexit,
\postbrexit ", \hard-brexit ", \soft-brexit ", \postbrexit ", \softbrexit ", or \hardbrexit ".
To separate tweets from Ireland and Great Britain we specify spatial regions
through a centre and radius (as allowed through the API). The details are:
{ Ireland: latitude: 53.413940, longitude: -7.940989, radius: 300
{ GB (south): latitude: 52.674554, longitude: -1.761640, radius: 220
{ GB (north): latitude: 56.268001, longitude: -5.185579, radius: 300
Great Britain is divided into two regions, GB (north) and GB (south), with
tweets from both combined into a single corpus.</p>
        <p>After collecting tweets using these criteria we drop all duplicate tweets and
retweets. Our nal dataset contained 1,129,754 tweets from Great Britain and
4 Using the pyfpgrowth package in Python
5 https://dev.twitter.com/rest/reference/get/search/tweets
72,148 tweets from Ireland. Before beginning analysis of this dataset we removed
all punctuation (except for # and @ symbols), converted all text to lowercase,
and removed stop words. Following this we tokenised separately to unigrams and
bigrams.
4.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>The Evolution of Attention</title>
        <p>Figure 1(a) shows an illustration of how the top concerns over Brexit of British
people changed over the period from 15/01/2017 to 23/04/2017, according to an
analysis of collected tweets. We have divided this time period into 14 periods of
7 days, each of which de nes a corpus. We apply Equation 6 to compute how
much individual unigrams and bigrams contribute to the divergence across these
14 time periods. Then we rank the unigrams and bigrams according to their
contribution scores.</p>
        <p>As we look at Figure 1(a), each term is represented by a horizontal bar with
a width indicating the JSD contribution score. Each term only appears once in
the graph in the time period that has the highest possibility of seeing that term.
The vertical position of each term represents the rank of the JSD contribution
score for the term in a certain period. For example, the term \supreme court"
is located at rank 1 for the week starting on January 22. The JSD contribution
score for this term is high denoting that it is the most distinct phrase between
Jan 22 and Jan 29. To produce Figure 1(a) we select the top 50 bigrams and top
50 unigrams and then use the FP-growth algorithm to remove unigrams that
carry duplicate information.</p>
        <p>From Figure 1(a), we can see the change of British people's concerns during
di erent time periods. In general, before February, British people were concerned
with topics surrounding British Prime Minister Theresa May's speech, Article
50, and the Supreme Court. In contrast, during February and March, the British
people's concerns seemed to become distracted by many other issues when
considering Brexit, like Budget 2017, and the Scottish independence referendum
as evidenced by the presence of terms \#budget2017", \#scotref", and
\#indyref2". However, at the end of March, the topics around Article 50 came back
to the public sight. Theresa May signed the letter to trigger Article 50 and
instigate Brexit on March 29th which also explains the high ranks of phrases
\#brexitday", \may trigger", and \#article50 #brexit" at that time. Finally,
at mid April, people's focus appears to be dominated by topics relating to the
2017 British general election.</p>
        <p>Figure 1(b) shows the most distinctive phrases from Irish tweets over the
same time periods. The result shows an extremely similar situation to the British
one. Overall, from January 15 to April 23, the focus of Twitter attention to
Brexit in Ireland surrounds Theresa May's speech, the triggering of Article 50,
the Scottish independence referendum, and the British general election. There
are some di erences, however, evidenced by the appearance of terms like \united
Ireland" and \hard border". In the next section we focus explicitly on analysing
these di erences.
4.3</p>
      </sec>
      <sec id="sec-4-3">
        <title>Comparing Brexit in Ireland and Great Britain</title>
        <p>To compare the di erences between the Twitter discussions of Brexit in Ireland
and Great Britain across the full time period covered by our dataset we apply
Equation 5 to determine the most divergent unigrams or bigrams. We present
the results in Figure 2. We list the top 20 unigrams and top 20 bigrams from each
of Ireland and Great Britain (again, we remove the unigrams which are included
in high ranking bigrams according to the result of the FP-growth algorithm).
The length of the bars corresponds to JSD contribution scores with higher values
indicating more distinguishing words. A bar to the left (shaded red) indicates
that a term is more common in British tweets, while a bar to the right (shaded
blue) indicates that a term is more common in Irish tweets.</p>
        <p>From Figure 2 we can see that the top two terms spoken about in Irish tweets,
but not in British tweets, are \ni" (an abbreviation for Northen Ireland) and
\northern ireland". This illustrates that the key di erence between the concerns
regarding Brexit expressed on Twitter by people from Ireland and those
expressed by people in Great Britain is a focus on the impact on Northern Ireland
and in particular, its border with the Republic of Ireland. We see this echoed in
terms like \stormont" (the seat of parliament in Northern Ireland), \hard
border", \sinn fein" (an Irish republican political party), \united ireland", \good
friday", \friday agreement" (the Good Friday Agreement was a key instrument
in peace talks between the Republic of Ireland and Northern Ireland), \common
travel", and \enda kenny" (the Irish prime minister at the time that the tweets
were collected).</p>
        <p>(a) Great Britain</p>
        <p>(b) Ireland
Fig. 3. (a) 30 terms with highest frequency from British tweets and (b) 30 terms with
highest frequency from Irish tweets</p>
        <p>Conversely, the British tweets seem focused on local issues such as \corbyn"
(the British Labour Party leader Jeremy Corbyn), \#ukip" (the Eurosceptic
United Kingdom Independence Party), and the \nhs" (a shortened form of the
\National Health Service"); and potential impacts of Brexit such as \eu citizens",
\hard brexit", and \london".</p>
        <p>We can see from Figure 2, however, that the JSD contribution scores are
di erent across the British and Irish corpora|Irish tweets tend to give rise to
unigrams and bigrams with higher JSD contributions that British tweets. A
possible explanation for this is that it arises because the JSD method looks for
unigrams or bigrams that consistently appear in one corpus but rarely appear
in the other. Figure 3 shows the 30 most frequent terms in each corpus. These
graphs show that the main topics surrounding Brexit (e.g. Article 50, Scottish
independence referendum, and Theresa May) are discussed in both British and
Irish tweets as evidenced by the high frequency of terms \teresa may", \article
50" and so on. But Irish tweets alone have a set of frequently mentioned
borderrelated topics as evidenced by the highest frequency of term \northern ireland"
and so on. In contrast, British tweets do not seem to have a set of frequently
mentioned topics that do not appear in Irish tweets. The corpus of British tweets
is also much larger than the corpus of Irish tweets and this might also contribute
to the relatively low JSD contribution scores for unigrams and bigrams from
British tweets.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusions</title>
      <p>In this paper we have proposed an approach to analysing di erences in Twitter
discourse of di erent groups around the same topics using corpus comparison
techniques. Speci cally, we have used an extended version of Jensen-Shannon
divergence coupled with the application of the FP-growth algorithm to merge
unigram and bigram analyses. We have also demonstrated how this analysis can
be visualized.</p>
      <p>We demonstrate this approach through a case study that analyses Twitter
discussion of Brexit from Ireland and Great Britain, across di erent time periods.
Through our analysis we can see how concerns over Brexit evolved over the period
studied as well as extracting the main di erences between the concerns in the
two countries|primarily a focus on the impact on the border with Northern
Ireland in Irish tweets.</p>
      <p>This case study also exposes some of the drawbacks of this approach. For
example, it appears that the results of JSD are vulnerable to the e ects of
spam tweets. The appearance of the phrases \bridging loan", \#brexit
bridging", \#Manchester #Capital" etc. reveals that our approach is sensitive to the
speci c phrases in spam tweets. The reason is simple: if many spam tweets that
only appear in one corpus contain very speci c terms (e.g. \bridging loan"),
then those speci c terms will be recognized by the JSD approach as
distinguishing. For example, we see in our British corpus (but not in our Irish corpus)
various business promotion tweets like \How much can I borrow? -
#Manchester #Capital #Bridging Loans #Brexit https://t.co/nDg5ZNKVdf bridging
loan, uk, Manchester" that include the distinguishing terms \bridging loan",
\#Manchester #Capital" and so on. Though these tweets contain the hashtag
\#Brexit" they are actually unrelated to Brexit issues. The JSD approach can
be easily hijacked when this happens.</p>
      <p>There is also a tension between successfully displaying divergence scores along
with frequency in a way that is easy for readers to comprehend. We will address
these issues in future work. In future work we will also address how similar
techniques can be used to compare corpora that arise from quite di erent sources|
for example online news sources and Twitter.</p>
      <p>Acknowledgement. This research was kindly supported by a Teagasc Walshe
Fellowship award (2016053).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Aramaki</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Maskawa</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Morita</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Twitter catches the u: detecting in uenza epidemics using twitter</article-title>
          .
          <source>In: Proceedings of the conference on empirical methods in natural language processing</source>
          (pp.
          <fpage>1568</fpage>
          -
          <lpage>1576</lpage>
          ) (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Bollen</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mao</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pepe</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Modeling public mood and emotion: Twitter sentiment and socio-economic phenomena</article-title>
          .
          <source>In: ICWSM, 11</source>
          , pp.
          <fpage>450</fpage>
          -
          <lpage>453</lpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Brill</surname>
          </string-name>
          , E.:
          <article-title>A simple rule-based part of speech tagger</article-title>
          .
          <source>In: Proceedings of the workshop on Speech and Natural Language</source>
          (pp.
          <fpage>112</fpage>
          -
          <lpage>116</lpage>
          ).
          <article-title>Association for Computational Linguistics (</article-title>
          <year>1992</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Church</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hanks</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Word association norms, mutual information, and lexicography</article-title>
          .
          <source>In: Computational linguistics</source>
          ,
          <volume>16</volume>
          (
          <issue>1</issue>
          ), pp.
          <fpage>22</fpage>
          -
          <lpage>29</lpage>
          (
          <year>1990</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Conover</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gonalves</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ratkiewicz</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Flammini</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Menczer</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Predicting the political alignment of twitter users</article-title>
          .
          <source>In: Privacy, Security, Risk and Trust (PASSAT)</source>
          and
          <source>2011 IEEE Third Inernational Conference on Social Computing (SocialCom)</source>
          ,
          <year>2011</year>
          IEEE Third International Conference on (pp.
          <fpage>192</fpage>
          -
          <lpage>199</lpage>
          ). IEEE (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Gallagher</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Reagan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Danforth</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dodds</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Divergent discourse between protests and counter-protests:# blacklivesmatter and# alllivesmatter</article-title>
          .
          <source>In: arXiv preprint arXiv:1606.06820</source>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Green</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Breimyer</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kumar</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Samatova</surname>
          </string-name>
          , N.:
          <article-title>Webbanc: Building semantically-rich annotated corpora from web user annotations of minority languages (</article-title>
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8. Han,
          <string-name>
            <given-names>J</given-names>
            .,
            <surname>Pei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Yin</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y.</surname>
          </string-name>
          :
          <article-title>Mining frequent patterns without candidate generation</article-title>
          .
          <source>In: ACM sigmod record</source>
          (Vol.
          <volume>29</volume>
          , No.
          <issue>2</issue>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>12</lpage>
          ) (
          <year>2000</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Kelleher</surname>
            ,
            <given-names>J.D.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Mac</given-names>
            <surname>Namee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            ,
            <surname>D'Arcy</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          :
          <article-title>Fundamentals of machine learning for predictive data analytics: algorithms, worked examples, and case studies</article-title>
          . MIT Press (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Kilgarri</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Comparing corpora</article-title>
          .
          <source>In: International journal of corpus linguistics</source>
          ,
          <volume>6</volume>
          (
          <issue>1</issue>
          ), pp.
          <fpage>97</fpage>
          -
          <lpage>133</lpage>
          (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Kullback</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Leibler</surname>
          </string-name>
          , R.:
          <article-title>On information and su ciency</article-title>
          .
          <source>In: The annals of mathematical statistics</source>
          ,
          <volume>22</volume>
          (
          <issue>1</issue>
          ), pp.
          <fpage>79</fpage>
          -
          <lpage>86</lpage>
          (
          <year>1951</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Leech</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fallon</surname>
          </string-name>
          , R.:
          <article-title>Computer corpora: what do they tell us about culture</article-title>
          .
          <source>In: ICAME journal</source>
          ,
          <volume>16</volume>
          (
          <year>1992</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>Divergence measures based on the shannon entropy</article-title>
          .
          <source>In: IEEE Transactions on Information theory</source>
          ,
          <volume>37</volume>
          (
          <issue>1</issue>
          ), pp.
          <fpage>145</fpage>
          -
          <lpage>151</lpage>
          (
          <year>1991</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Lovins</surname>
          </string-name>
          , J.:
          <article-title>Development of a stemming algorithm</article-title>
          . In: Mech. Translat. &amp;
          <string-name>
            <surname>Comp</surname>
          </string-name>
          . Linguistics,
          <volume>11</volume>
          (
          <issue>1-2</issue>
          ), pp.
          <fpage>22</fpage>
          -
          <lpage>31</lpage>
          (
          <year>1968</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <given-names>O</given-names>
            <surname>'Callaghan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Prucha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            ,
            <surname>Greene</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Conway</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Carthy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Cunningham</surname>
          </string-name>
          ,
          <string-name>
            <surname>P.</surname>
          </string-name>
          :
          <article-title>Online social media in the syria con ict: Encompassing the extremes and the in-betweens</article-title>
          .
          <source>In: Advances in Social Networks Analysis and Mining (ASONAM)</source>
          ,
          <year>2014</year>
          IEEE/ACM International Conference on (pp.
          <fpage>409</fpage>
          -
          <lpage>416</lpage>
          ). IEEE. (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Pechenick</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Danforth</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dodds</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Is language evolution grinding to a halt? the scaling of lexical turbulence in english ction suggests it is not</article-title>
          .
          <source>In: Journal of Computational Science</source>
          ,
          <volume>21</volume>
          , pp.
          <fpage>24</fpage>
          -
          <lpage>37</lpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Rayson</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Garside</surname>
          </string-name>
          , R.:
          <article-title>Comparing corpora using frequency pro ling</article-title>
          .
          <source>In: Proceedings of the workshop on Comparing Corpora</source>
          (pp.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          ).
          <article-title>Association for Computational Linguistics (</article-title>
          <year>2000</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Salton</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Buckley</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Term-weighting approaches in automatic text retrieval</article-title>
          .
          <source>In: Information processing &amp; management, 24(5)</source>
          , pp.
          <fpage>513</fpage>
          -
          <lpage>523</lpage>
          (
          <year>1988</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Schwartz</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Eichstaedt</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kern</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dziurzynski</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ramones</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Agrawal</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shah</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kosinski</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stillwell</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Seligman</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ungar</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Personality, gender, and age in the language of social media: The open-vocabulary approach</article-title>
          . In: PloS one,
          <volume>8</volume>
          (
          <issue>9</issue>
          ), p.
          <source>e73791</source>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Shannon</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>A mathematical theory of communication</article-title>
          .
          <source>In: ACM SIGMOBILE Mobile Computing and Communications Review</source>
          ,
          <volume>5</volume>
          (
          <issue>1</issue>
          ), pp.
          <fpage>3</fpage>
          -
          <lpage>55</lpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Signorini</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Segre</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Polgreen</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>The use of twitter to track levels of disease activity and public concern in the us during the in uenza a h1n1 pandemic</article-title>
          . In: PloS one,
          <volume>6</volume>
          (
          <issue>5</issue>
          ), p.
          <source>e19467</source>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Weber</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Buitelaar</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Web-based ontology learning with isolde</article-title>
          .
          <source>In: Proc. of the Workshop on Web Content Mining with Human Language at the International Semantic Web Conference</source>
          ,
          <string-name>
            <surname>Athens</surname>
            <given-names>GA</given-names>
          </string-name>
          , USA (Vol.
          <volume>11</volume>
          ) (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Zappavigna</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Discourse of twitter and social media: How we use language to create a liation on the web (</article-title>
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>