=Paper= {{Paper |id=Vol-1743/paper2 |storemode=property |title=Measuring Citizen Participation in South African Public Debates using Twitter: An Exploratory Study |pdfUrl=https://ceur-ws.org/Vol-1743/paper2.pdf |volume=Vol-1743 |authors=Selvas Mwanza,Hussein Suleman |dblpUrl=https://dblp.org/rec/conf/simbig/MwanzaS16 }} ==Measuring Citizen Participation in South African Public Debates using Twitter: An Exploratory Study== https://ceur-ws.org/Vol-1743/paper2.pdf
    Measuring citizen participation in South African public debates using
                       Twitter: An exploratory study

                   Selvas Mwanza                                    Hussein Suleman
               ICT4D Research Centre                          Department of Computer Science
               University of Cape Town                           University of Cape Town
               Cape Town, South Africa                           Cape Town, South Africa
             smwanza@cs.uct.ac.za                              hussein@cs.uct.ac.za


                    Abstract                                fin, 2015). This was mirrored on Twitter when
                                                            the #FeesMustFall hash tag created for the protests
    This paper addresses the task of measur-                trended on Twitter worldwide. This provides ev-
    ing Twitter social attributes that can be               idence of the adoption of Twitter by citizens in
    used for detecting patterns that show user              South Africa as a platform to participate in socio-
    participation in public debates in South                economic issues.
    African. We propose a method that lever-                Social media mining is the process of represent-
    ages observable information on Twitter                  ing, analysing, and extracting actionable patterns
    such as use of language, retweeting user                from social media data (Zafarani et al., 2014).
    behaviour, and the relationship between                 Twitter data has been mined by different re-
    topics and the user social network graph.               searchers around the world. Examples of Twit-
    Our experimental results suggest high de-               ter mining includes: financial prediction (Mao et
    grees of citizen participation: people in               al., 2012), extracting market and business insights
    an otherwise multilingual country tweet in              (Park and Chung, 2012), political analysis (Monti
    a dominant language; there is more orig-                et al., 2013), mass movement analysis (Borge-
    inal commentary and interactive discus-                 Holthoefer et al., 2015) and monitoring of natu-
    sion; and topics often span natural online              ral disastere and crises (Takeshi et al., 2010). Al-
    communities.                                            though a lot of research has been done, little at-
                                                            tention has been given to Twitter data produced in
1   Introduction
                                                            Africa.
With the large user base and the ease of publishing         In this paper, we address the task of measuring
content, Twitter has become an ideal platform for           citizen participation in public debates on Twit-
many people to communicate and also serves as a             ter. We use standard methods like language de-
platform for expressing opinions on different top-          tection in text, graph partitioning and graph cen-
ics like politics, sports and socio-economic issues.        trality measures to detect patterns of use of lan-
Users on Twitter can converse and interact in dif-          guage, retweeting user behaviour, and the relation-
ferent ways. A user can follower another user. A            ship between topics and user communities to mea-
user who follow another user subscribes to receive          sure user participation in public debates in South
Twitter messages posted by the followed. Users              African.
can reference each other in messages using the              The paper is organized as follows. Section 2 intro-
@ symbol followed by the username (e.g., I miss             duces the literature review on social media analy-
@cindy my best friend). Users can also forward              sis. Section 3 describes in detail our methodology
a message to others. Twitter adds the key word              for measuring citizen participation in South Africa
RT @username at the beginning of all forwarded              using Twitter data, while Section 4 reports on the
tweets. The username after the @ symbol is the              experiment design and the results. Finally, in Sec-
name of the user who originally posted the mes-             tion 5 we discuss the conclusions and outline fu-
sage. In addition, Twitter users can use a # symbol         ture work.
to indicate what the message is about.
In 2015, University students in South Africa
protested against the increase in school fees (Grif-



                                                       26
2   Literature Review                                       trality score has also been used to detect spammers
                                                            in Twitter (Yang et al., 2011). The authors used
This section looks at previous work that is related         the betweenness centrality to rank users in a graph
to our work.                                                then use the ranking score to identify spammers.
2.1 Graph partitioning                                      2.3    Use of language detection in Twitter
Graph partitioning or community detection aims
                                                            Language detection is the task of detecting the
to identify groups in a graph by only using the in-
                                                            natural language in which a document is written
formation encoded in the graph topology (Lanci-
                                                            (Lui et al., 2004). Hong and Convertino (Hong
chinetti and Fortunato, 2009). Lancichinetti and
                                                            and Convertino, 2011) used language detection
Fortunato (2009) reviewed various disjoint com-
                                                            in Twitter data to discover cross-language differ-
munity detection algorithms. Disjoint community
                                                            ences in adoption of features such as URLs, hash-
detection algorithms partition a graph into dis-
                                                            tags, mentions, replies, and retweets. The authors
joint groups and has a wide application. Recently,
                                                            used a combination of LingPipe text classifier and
with the introduction of social media mining, at-
                                                            Google language API to to classify 62,556,331
tention has been given to overlapping community
                                                            tweets into languages. The data was downloaded
detection algorithms. Overlapping community de-
                                                            for a period of four weeks. The authors then ana-
tection algorithms identify a set of partitions that
                                                            lyzed how each cluster uses URLs, hashtags, men-
are not necessarily disjoint (Xie et al., 2013). A
                                                            tions, replies, and retweets. Use of language has
node in the graph can be found in more than one
                                                            also been used as a primary tool for detecting spam
partition. People in social media usually have
                                                            in tweets (Martinez-Romo and Araujo, 2013). The
connections to several social groups like family,
                                                            authors examine the use of language in the topic,
friends, and colleagues. Java (2007) used an over-
                                                            a tweet, and the page linked from the tweet. They
lapping detection algorithm called clique percola-
                                                            make an assumption that the language model for
tion method (CPM) to detect overlapping commu-
                                                            a spam tweet will be substantially different: the
nities in a Twitter network. CMP was used to find
                                                            spammer is usually trying to divert traffic to sites
how communities connect to each other by over-
                                                            that have no semantic relation. They exploit this
lapped components. Overlapping community de-
                                                            divergence between the language models to effec-
tection has also been used to explain how informa-
                                                            tively classify tweets as spam or non-spam.
tion cascades through Twitter communities (Bar-
bieri et al., 2013). The authors used a community           3     Methodology
detection algorithm to find the level of authority
and passive interest of a node in each community            In this section we describe in details three types of
it belongs to.                                              social attributes that can help in measuring citizen
                                                            participation: use of language, retweeting user
2.2 Graph centrality measures                               behaviour, and relationship between topics and
Node centrality measures node involvement in                the user network graph. We use these three met-
the walk structure of a network (Freeman,                   rics to detect patterns that measure citizen partici-
1978). Freeman defined three centrality measures,           pation in public debates in South Africa.
namely, degree, closeness and betweenness. De-
gree centrality is a count of the number of edges           3.1    Use of Language
incident upon a given node. Closeness defines               South Africa is a multilingual country with nine
the total geodesic (the length of a walk is de-             official languages, namely: English, Afrikaans,
fined as the number of edges it contains, and               Zulu, Xhosa, Ndebele, Northern Sotho, Tsonga,
the shortest path between two nodes is known                Tswana and Venda. English and Afrikaans are
as a geodesic) distance from a given node to all            high resource languages while the other lan-
other nodes. Betweenness measures the geodesics             guages, which are Bantu languages, are low re-
that pass through a given vertex. Centrality mea-           source languages. In our work, we are interested in
sures have been used in ranking and understanding           detecting English and Afrikaans in tweets. Tweets
nodes in social networks. Ediger (2010) used be-            that cannot be detected as English or Afrikaans are
tweenness centrality to rank nodes in clusters of           categorized as other.
conversations on Twitter data. Betweenness cen-             Tweets are informal. They contain special tokens



                                                       27
such as @ for usernames, # for trending topics                cially available products (Mikowski, 2010).
and they have http links for related content. They            Spelling check and correction was done
also contain slang, misspellings and grammatical              by using the jazzy spell checker (Idzelis,
errors. We implemented a program called SATwit-               2005). Jazzy spell checker integrates the
terCleaner that cleans the dataset before language            DoubleMetaphone phonetic matching algo-
detection. Cleaning involved doing the following:             rithm and the Levenshtein distance using the
                                                              near-miss strategy. The jazzy spell checker
 1. Removing usernames: The program removes                   was chosen because it gives suggestions if
    all usernames in the dataset by searching for             the word is not properly spelled. SATwit-
    words that starts with the @ symbol. This                 terCleaner employs the spell checker to pick
    follows the convention that all usernames in              the first option in the suggestion list as a
    Twitter messages are prefixed with the @                  replacement for the mispelled word. The
    symbol.                                                   method used in our work for grammar and
 2. Removing hash tag (#) symbol in the mes-                  spell checking is limited to English text as
    sages: The program removes all hash tags by               we could not find equivalent libray tools for
    searching for the # symbol.                               Afrikaans. Hence, only English text was cor-
                                                              rected on grammar and spelling.
 3. Removing URLs in the messages: Twitter
    users reference external sources by inserting          7. Replacing repeated characters in words with
    URLs in their messages. SATwitterCleaner                  the correct number of characters: We devel-
    implements a string pattern that identifies               oped a method for English text that can re-
    URLs in Twitter messages and removes them.                move repeated characters in words. English
                                                              seldom uses words with more than two char-
 4. Remove emoticons from the message: An                     acter repetition. However, there are words
    emoticon is a representation of a facial ex-              with three character repetition. We compiled
    pression used in electronic communication                 a list of 21 English words with three character
    to convey the writer’s feelings. The online               repetition. The program ignores all the words
    community uses different types of emoticons               with repeated characters that are found in the
    for different expressions. We compiled 15                 compiled list. Otherwise, if a word has re-
    emoticons used for happy expressions and 11               peated characters, the program first reduces
    emoticons used for sad expressions. The pro-              the repeated characters to two. Then, using
    gram used this list to indentify and remove               the jazzy spell checker (Idzelis, 2005), the
    emoticons from messages.                                  program checks if the word is a correct En-
                                                              glish word. If not, the spell checker is used to
 5. Expand slang words into their actual mean-                get the suggested close word. The program
    ing: Slang is the use of informal words and               then computes the cosine similarity distance
    expressions that are not considered standard              between the suggested word and the original
    in the speaker’s language or dialect but are              word. If the distance is below a threshold,
    considered acceptable in certain social set-              the suggested word is taken as a replacement,
    tings. Example: 2b means to be. We created a              otherwise the program skips the replacement.
    slang dictionary of 5,364 slang words. Each               We chose the similarity distance threshold of
    slang word in the dictionary was mapped to                1.
    its actual meaning. The slang dictionary was
    used by SATwitterCleaner to expand all slang          After data cleaning, we used a combination of the
    words found in the dataset.                           Naive Bayesian method and simple word statis-
                                                          tics to detect the English and Afrikaans tweets.
 6. Correcting spelling and grammatical errors            We used LangDetect, which implements a Naive
    in English tweets: The program employed a             Bayes classifier, using a character n-gram based
    LanguageTool library to correct the grammer           representation without feature selection, with a
    in tweets. LanguageTool (LT) is based on              set of normalization heuristics to improve accu-
    surface text processing, without deep pars-           racy (Nakatani, 2010). Lui and Baldwin (2014)
    ing, yet, it manages to get significantly bet-        compared the performance of eight off-the-shelf
    ter results for some languages than commer-           language detection systems to determine which



                                                     28
would be the most suitable for Twitter data. They            the graph as undirected, so an edge from @A to
compared langid.py (Baldwin et al., 2013), CLD2              @B also connects @B back to @A. All loops are
(McCandless, 2010), LangDetect (Lui and Bald-                discarded from the graph. Loops are formed when
win, 2014), LDIG (Nakatani, 2012), whatlang                  a user retweets his/her own tweet. We also ignore
(Brown, 2013), YALI (Majlis, 2012), TextCat                  duplicate user interactions so that only unique
(Scheelen, 2003) and MSR-LID (Goldszmidt et                  user interactions are represented in the graph. Our
al., 2013). They compared the systems on four                graph had 30,114 vertices and 55,578 edges. In
different Twitter datasets. They found that LDIG             this paper we analyse the graph at two different
outperforms all the algorithms though it supports            levels, network level and group level. Network
a limited number of languages and Afrikaans is               level is the view of the entire graph. Group level is
not one of them. Overall, they concluded that,               the view of sub-graphs/communities in the graph.
in their off-the-shelf configuration, only three sys-        Network level
tems (LangDetect, langid.py, CLD2) perform con-              At a network level we calculated the betweenness
sistently well on language detection of Twitter              centrality of all the nodes in the graph. Freeman
messages. Our Twitter messages cleaner, SATwit-              (1978) defines betweenness centrality as: let
terCleaner and the language detection program                gij denote the number of geodesic paths from
was developed in Java hence we chose LangDetect              node i to node j, and let gikj denote the number
because it has Java support. Simple word statistics          of geodesic paths from i to j that pass through
classify tweets by counting the number of words              intermediary k. Then the betweenness centrality
in a tweet that are English or Afrikaans. If the             is defined as follows:
number is higher than or equal to 50%, a tweet
is classified as English or Afrikaans respectively.
All the tweets that were not detected as English
by LangDetect were classified by the simple word             Betweenness centrality measures the influ-
statistics. This allowed us to compensate for the            ence/centrality of a node in a graph. According
inacuracy of the LangDetect system. Only tweets              to the definition, a node with high betweenness
with more than three words were considered for               centrality sits at a connection point of subgraphs.
language detection.                                          A node plays a major role in the movement
                                                             of the data from one subgraph to the other.
3.2 Retweeting user behaviour                                Freeman applied the betweenness to connected
Twitter adds the key word RT @username to all                and undirected graphs. Social networks often
forwarded tweets. RT mean retweet and @user-                 share common characteristics. Natural clusters
name refers to the name of the user who originally           form, but the clusters do not partition the graph
made the tweet.                                              (Mislove et al., 2007). We use this characteristic
In our work, we want to measure how many tweets              to make an assumption that our graph will be
and retweets are present in the dataset. To find the         largely connected and hence the betweenness can
number of original tweets, we counted all tweets             be applied.
that do not start with RT @. To find the number of           We also performed another measurement at the
retweets in the dataset, we counted all the tweets           network level we called resourceful measure.
that starts with RT @ keyword.                               We calculated how many tweets from each node
                                                             in the graph have been retweeted at least once.
3.3 Relationship between topics and user                     A node with a high resourceful measure has a
    network graph                                            high number of tweets retweeted at least once by
We created a social graph using retweets. Galuba             other users. Resourceful measure measures how
(2010) showed that retweets is the most powerful             many tweets each node has contributed in the
mechanism to diffuse information and a strong                graph. In our work, we compared the resourceful
indication of the direction of information flow in           measure with the betweenness centrality measure
Twitter. We created a graph using retweets be-               of nodes to find the relationship between the top
cause we wanted to see and measure how a graph               producers of tweets in the graph and the top users
form around the tweets. Users form vertices in the           who propagate tweets to subgraphs. The Jaccard
graph. We add an edge from user @A to user @B                similarity coefficient (Jaccard, 1902) is a common
whenever @A retweets a tweet from @B. We treat



                                                        29
index for binary variables. It is defined as the
quotient between the intersection and the union             Topic Categories        Topics
of the pairwise compared variables among two                Controversial topics    #OscarPistorius
objects. Jaccard is calculated as follows: given                                    Kim Martin
two groups b and c, the percent similarity = [a/(a                                  #FeesMustFall
+ b + c)] where a = number of elements present                                      #Sarafina
in both groups, b = number of elements present                                      BCCSA
only in group b, and c = number of elements                                         Esethu
present only in the group c. Jaccard coefficient is                                 #TaxiStrike
a number from 0 and 1. If the coefficient is 0, it                                  Durban protests
means the two groups are completely unidentical.                                    Mbuyisa
If the coefficient is 1, then the two groups are            Developmental topics    #ProjectKhanya
completely identical. We used the Jaccard coeffi-                                   Wastestopswithme
cient to measure the similarity between the nodes                                   Cleanerjoburg
with high betweenness centrality and the nodes                                      TeamUpToCleanUp
with high resourceful measure.                                                      #JobSeekersWednesday
Group level                                                                         Durban protests
We partitioned the graph into communities. Xie,             Entertainment topics    Pearl thusi
Kelley and Szymanski (2013) did a review of                                         #ExpressoShow
the state of the art in overlapping community                                       #BangOut
detection algorithms. They reviewed a total of                                      #FreshAT5
fourteen algorithms and concluded that, for low                                     #KentPhonikFridays
overlapping density networks, SLPA (Xie et al.,                                     #GenNext2016
2011), OSLOM (Lancichinetti et al., 2011), Game                                     #iGazi
(Chena et al., 2010) and COPRA (Gregory, 2010)                                      #DateMyFamily
offer better performance than the other tested                                      #FridayStandIn
algorithms. For networks with high overlapping                                      Jessica Nkosi
density and high overlapping diversity, both SLPA                                   Ertugral
and Game provide relatively stable performance.                                     #AskAMan
We evaluated two algorithms, namely, COPRA                  Political topics        #NandosDMgathering
and SLPA. We observed that SLPA performed                                           #SpyTapes
better than COPRA on our graph both in computer                                     #ANCGPManifestoLaunch
time and modularity. The modularity of a partition                                  #ANCFriday
is a scalar value between -1 and 1 that measures                                    #FillUpFNBStadium
the density of links inside communities as com-                                     #FillUpFNB
pared to links between communities (Girvan                                          FNB Stadium
and Newman, 2002). After evaluation, we used                                        Luthando Mbinda
the SLPA overlapping algorithm for community                                        Mavuso
detection in the graph.                                     Road accident topics    Bellville
                                                                                    N1 North
                                                                                    #PTATraffic
                                                            National Event topics   #YouthDay
                                                                                    Soweto
                                                            Other                   #WomenMustKnowThat
                                                                                    Shoprite
                                                                                    #TNABizBrief

                                                           Table 1: Keywords used for downloading tweets.
                                                           Keywords were determined by following trending
                                                           topics in South Africa from 4th June 2016 to 19th
                                                           June 2016




                                                      30
4   Results and Discussion                                     with high betweenness centrality play a major role
                                                               in the movement of tweets in the graph. Users with
This section discusses our experimental set up and             high resourceful measure have a high number of
results.                                                       tweets retweeted at least once by other users. We
4.1 Experimental Settings                                      took the top 50 users with the highest resource-
                                                               ful measure and top 50 users with the highest be-
To do this experiment, trending topics in South                tweenness centrality and computed the Jaccard co-
Africa shown in Table 1 were used to download                  efficient. The coefficient is the number between 0
tweets from 4th June 2016 to 19th June 2016.                   and 1. A coefficient of 0 means the two groups
Twitter implements a proprietary algorithm that                are completely unidentical. If the coefficient is 1,
shows the trending topics in Twitter data. Trend-              then the two groups are completely identical. Our
ing topics can either be hash tagged words or non              calculation yielded a coefficient of 0.23. This re-
hash tagged words. We manually observed trend-                 sult concludes that, top users who provide infor-
ing topics in South Africa from the Twitter web-               mation in the graph are not the top users who prop-
site for 16 days and used the Web API to down-                 agate the tweets through communities. Finally, we
load 131,790 tweets from 37,876 Twitter accounts.              compared topics in the communities to find over-
The topics were categorized into seven (7) groups,             laps. SLPA (Xie et al., 2011) was used to parti-
namely: controversial topics, developmental top-               tion the graph into communities. SLPA is a non-
ics, entertainment topics, political topics, road ac-          determistic algorithm, so we ran the algorithm 11
cident topics, national events topics and other.               times and recorded the average performance. The
                                                               algorithm produced 2,200 communities with an
4.2 Experimental Results
                                                               overlap of 7.3%. This shows that our graph had
We first start with the results of the language                a low overlapping density. Table 2 shows that
detection. Our experiments show 94.64% of                      all communities tweeted about Oscar Pistorius and
tweets were in English, 2.61% of tweets were                   there is not a clear cut division among communi-
in Afrikaans and 2.75% was detected as other.                  ties with regards to topics. Though communities
Other means the tweet was neither English nor                  focus on certain topics - group 5 and 10 talk more
Afrikaans. During the experiment, we noticed that              about political topics, group 3 entertainment top-
tweets were repeating in the dataset. This is be-              ics, all communities talk about other issues too.
cause users can retweet the same tweet, causing                These graph patterns suggests that citizens partic-
repetition. So, before detecting the language, we              ipate in public debates on a variety of topics.
filtered out all the repeating tweets. After filtering,
the number of tweets in the dataset was reduced to             5   Conclusions and Future Work
66,378. The result show that despite having many
                                                               We presented social attributes that help identify
languages, South Africa tweets in a common lan-
                                                               patterns that measure citizen participation in pub-
guage. This pattern suggests that people tweet so
                                                               lic debates in South Africa. Africa is highly mul-
that their message can be read across a larger spec-
                                                               tilingual, hence we chose the use of language as
trum of the population.
                                                               an attribute that can indicate participation in on-
The next result describes the tweet-retweet be-
                                                               line public discussions. We also considered user
haviour. The downloaded dataset had 58.88 %
                                                               retweeting behavior and how topics relate to on-
tweets and 41.12 % retweets. This pattern sug-
                                                               line communities. This exploratory study provides
gests that there is more original contribution in
                                                               the first step in Twitter analysis on South African
public debates.
                                                               online data. This paper considers only a snapshot
The last set of results show the analysis of the so-
                                                               of the South African Twitter data. In future, we
cial graph. Our results shows that 79.5% of users
                                                               aim to consider the temporal aspects of the graph.
in our dataset participate in conversation. To mea-
sure participation in conversation, we counted all
the users in our dataset who retweeted other user’s
tweets or their tweets were retweeted by others.
We used the Jaccard coefficient to measure the
similarity of users with high betweenness central-
ity and users with high resourceful measure. Users



                                                          31
 Group No.   No. Members   Top 4 topics mentioned in each community   % of topics mentioned from all the topics in Table 1
 1           10564         #OscarPistorius (4043)
                           #NandosDMgathering (175)                   72.09%
                           #Sarafina (89)
                           #SpyTapes (71)
 2           1704          #BangOut (419)
                           #OscarPistorius (18)                       37.21%
                           #Sarafina (8)
                           #AskAMan (8)
 3           1330          #AskAMan (567)
                           #OscarPistorius (88)                       65.12%
                           Shoprite (21)
                           #BangOut (20)
 4           333           #BangOut (87)
                           #Sarafina (14)                             30.23%
                           #AskAMan (6)
                           #OscarPistorius (9)
 5           300           #NandosDMgathering (94)
                           #ANCGPManifestoLaunch (35)                 30.23%
                           #OscarPistorius (16)
                           #YouthDay (5)
 6           288           #FreshAT5 (96)
                           #KentPhonikFridays (10)                    13.95%
                           #BangOut (4)
                           #OscarPistorius (4)
 7           273           #BangOut (81)
                           #OscarPistorius (14)                       23.26%
                           #GenNext2016 (11)
                           #AskAMan (4)
 8           147           #NandosDMgathering (25)
                           #JobSeekersWednesday (22)                  30.23%
                           #ANCGPManifestoLaunch (13)
                           #OscarPistorius (3)
 9           126           #BangOut (46)
                           Shoprite (3)                               20.93%
                           #OscarPistorius (2)
                           Soweto (2)
 10          124           #OscarPistorius (130)
                           #NandosDMgathering (21)                    13.95%
                           #ANCGPManifestoLaunch (3)
                           #FeesMustFall (2)
 11          114           #OscarPistorius (26)
                           #ANCGPManifestoLaunch (23)                 23.26%
                           #NandosDMgathering (17)
                           #AskAMan (15)
 12          102           #OscarPistorius (46)
                           Ertugral (6)                               13.95%
                           Soweto (4)
                           #YouthDay (1)

Table 2: Relationship between topics and communities. The table shows communities with more than
100 members. Column three shows the top mentioned topics in each community. The number of men-
tions is indicated in brackets. Column four shows the percentage of topics mentioned in each community
out of all the topics used for downloading the data shown in Table 1.




                                                        32
References                                                     Lichan Hong and Gregorio Convertino. 2011. Lan-
                                                                 guage matters in twitter: A large scale study. Fifth
Timothy Baldwin, Paul Cook, Marco Lui, Andrew                    International AAAI Conference on Weblogs and So-
  MacKinlay, and Li Wang. 2013. How noisy social                 cial Media.
  media text,how different social media sources? In
  Proceedings of the 6th International Joint Confer-           Mindaugas Idzelis. 2005. Jazzy: The java open source
  ence on Natural Language Processing.                           spell checker. http://jazzy.sourceforge.net/.
Nicola Barbieri, Francesco Bonchi, and Giuseppe                Paul Jaccard. 1902. Lois de distribution florale. Bul-
  Manco. 2013. Cascade-based community detection.                letin de la Socet Vaudoise des Sciences Naturelles,
  Sixth ACM international conference on Web search               pages 67–130.
  and data mining, pages 33–42.
                                                               Akshay Java, Xiaodan Song, Tim Finin, and Belle
Javier Borge-Holthoefer, Walid Magdy, Kareem Dar-                Tseng. 2007. Why we twitter: understanding mi-
   wish, and Ingmar Weber. 2015. Content and net-                croblogging usage and communities. 9th WebKDD
   work dynamics behind egyptian political polariza-             and 1st SNA-KDD 2007 workshop on Web mining
   tion on twitter. ACM Conference on Computer Sup-              and social network analysis, pages 56–65.
   ported Cooperative Work and Social Computing,
   (18):700–711.                                               Andrea Lancichinetti and Santo Fortunato. 2009.
                                                                 Community detection algorithms: a comparative
Ralf Brown. 2013. Selecting and weighting n-grams                analysis. arXiv:0908.1062.
  to identify 1100 languages. 16th international con-
  ference on text, speech and dialogue.                        Andrea Lancichinetti, Filippo Radicchi, Jose, Javier
                                                                 Ramasco, and Santo Fortunato. 2011. Finding sta-
Duanbing Chena, Mingsheng Shanga, Zehua Lvb, and                 tistically significant communities in networks. PLoS
  Yan Fua. 2010. Detecting overlapping communities               One, 6(4).
  of weighted networks via a local algorithm. Physica
  A: Statistical Mechanics and its Applications, pages         Marco Lui and Timothy Baldwin. 2014. Accurate
  4177–4187.                                                    language identification of twitter messages. NICTA
                                                                VRL.
David Ediger, Karl Jiang, Jason Riedy, David A. Bader,
                                                               Marco Lui, Jey Han Lau, and Timothy Baldwin. 2004.
  and Courtney Corley. 2010. Massive social network
                                                                Automatic detection and language identification of
  analysis: Mining twitter for social good. 39th Inter-
                                                                multilingual documents. Proceedings of the 2004
  national Conference on Parallel Processing, pages
                                                                ACM Symposium on Applied. Computing.
  583–593.
                                                               Martin Majlis. 2012. Yet another language identifie.
Linton C. Freeman. 1978. Centrality in social net-              Student Research Workshop at the 13th Conference
  works conceptual clarification. Social Networks,              of the European Chapter of the Association for Com-
  pages 215–239.                                                putational Linguistics, pages 46–54.
Wojciech Galuba, Karl Aberer, Dipanjan Chakraborty,            Yuexin Mao, Wei Wei, and Bing Wang. 2012. Cor-
 Zoran Despotovic, and Wolfgang Kellerer. 2010.                  relating s&p 500 stocks with twitter data. ACM In-
 Outtweeting the twitterers - predicting information             ternational Workshop on Hot Topics on Interdisci-
 cascades in microblogs. WOSN’10 Proceedings of                  plinary Social Networks Research, (1):69–72.
 the 3rd Wonference on Online social networks, pages
 3–3.                                                          Juan Martinez-Romo and Lourdes Araujo. 2013. De-
                                                                 tecting malicious tweets in trending topics using a
Michelle Girvan and Mark E. J. Newman. 2002. Com-                statistical analysis of language. Expert Systems with
  munity structure in social and biological networks.            Applications: An International Journal, 40:2992–
  Proceedings of the National Academy of Sciences                3000.
  99, pages 7821–7826.
                                                               Michael McCandless. 2010. ccuracy and perfor-
Moises Goldszmidt, Marc Najork, and Stelios Papari-              mance of googles compact language detector.
 zos. 2013. Boot-strapping language identifiers for              http://blog.mikemccandless.com/2011/10/accuracy-
 short colloquial postings. European Conference on               and-performance-of-googles.html.
 Machine Learning and Principles and Practice of
 Knowledge Discovery in Databases.                             Marcin Mikowski. 2010. Developing an open source,
                                                                rule based proofreading tool. Software: Practice
Steve Gregory.      2010.      Finding overlapping              and Experience, 40:543–566.
   communities in networks by label propagation.
   arXiv:0910.5516 [physics.soc-ph].                           Alan Mislove, Massimiliano Marcon, Krishna P. Gum-
                                                                 madi, Peter Druschel, and Bobby Bhattacharjee.
Nosakhere Griffin.      2015.    Fees must fall                  2007. Measurement and analysis of online social
  and the possibility of a new african university.               networks. Proceedings of the 7th ACM SIGCOMM
  www.Face2FaceAfrica.com.                                       conference on Internet measurement, pages 29–42.




                                                          33
Corrado Monti, Alessandro Rozza, Giovanni Zap-
  pella, Matteo Zignani, Adam Arvidsson, and Elanor
  Colleoni. 2013. Modelling political disaffection
  from twitter data. International Workshop on Issues
  of Sentiment Discovery and Opinion Mining, (2).
Shuyo Nakatani. 2010. Language detection library
  (slides). http://www.slideshare.net/shuyo/language-
  detection-library-for-java.
Shuyo Nakatani.         2012.       Short text lan-
  guage       detection      with        infinity-gram.
  http://shuyo.wordpress.com/2012/05/17/short-
  text-language-detection-with-infinity-gram/.
Jaimie Y. Park and Chin-Wan Chung. 2012. When
   daily deal services meet twitter: understanding twit-
   ter as a daily deal marketing platform. Annual ACM
   Web Science Conference, (4):233–242.
Frank Scheelen. 2003. libtextcat. http://software.wise-
  guys.nl/libtextcat/.
Sakaki Takeshi, Okazaki Makoto, and Matsuo Yutaka.
  2010. Earthquakeshakes twitter users: Real-time
  event detection by social sensors. International con-
  ference on World wide web, (19):851–860.
Jierui Xie, Boleslaw K. Szymanski, and Xiaoming Liu.
   2011. Slpa: Uncovering overlapping communities
   in social networks via a speaker-listener interaction
   dynamic process. IEEE ICDM workshop on DM-
   CCI.
Jierui Xie, Stephen Kelley, and Boleslaw K. Szyman-
   ski. 2013. Overlapping community detection in net-
   works: The state-of-the-art and comparative study.
   ACM Computing Surveys (CSUR), 45(43).
Chao Yang, Robert Chandler Harkreader, and Guofei
  Gu. 2011. Die free or live hard? empirical eval-
  uation and new design for fighting evolving twitter
  spammers. 14th international conference on Recent
  Advances in Intrusion Detection, 6961:318–337.
Reza Zafarani, Mohammad Ali Abbasi, and Huan Liu.
  2014. Social Media Mining. Cambridge University
  Press, NY, USA.




                                                           34