=Paper=
{{Paper
|id=Vol-1743/paper2
|storemode=property
|title=Measuring Citizen Participation in South African Public Debates using Twitter: An Exploratory Study
|pdfUrl=https://ceur-ws.org/Vol-1743/paper2.pdf
|volume=Vol-1743
|authors=Selvas Mwanza,Hussein Suleman
|dblpUrl=https://dblp.org/rec/conf/simbig/MwanzaS16
}}
==Measuring Citizen Participation in South African Public Debates using Twitter: An Exploratory Study==
Measuring citizen participation in South African public debates using Twitter: An exploratory study Selvas Mwanza Hussein Suleman ICT4D Research Centre Department of Computer Science University of Cape Town University of Cape Town Cape Town, South Africa Cape Town, South Africa smwanza@cs.uct.ac.za hussein@cs.uct.ac.za Abstract fin, 2015). This was mirrored on Twitter when the #FeesMustFall hash tag created for the protests This paper addresses the task of measur- trended on Twitter worldwide. This provides ev- ing Twitter social attributes that can be idence of the adoption of Twitter by citizens in used for detecting patterns that show user South Africa as a platform to participate in socio- participation in public debates in South economic issues. African. We propose a method that lever- Social media mining is the process of represent- ages observable information on Twitter ing, analysing, and extracting actionable patterns such as use of language, retweeting user from social media data (Zafarani et al., 2014). behaviour, and the relationship between Twitter data has been mined by different re- topics and the user social network graph. searchers around the world. Examples of Twit- Our experimental results suggest high de- ter mining includes: financial prediction (Mao et grees of citizen participation: people in al., 2012), extracting market and business insights an otherwise multilingual country tweet in (Park and Chung, 2012), political analysis (Monti a dominant language; there is more orig- et al., 2013), mass movement analysis (Borge- inal commentary and interactive discus- Holthoefer et al., 2015) and monitoring of natu- sion; and topics often span natural online ral disastere and crises (Takeshi et al., 2010). Al- communities. though a lot of research has been done, little at- tention has been given to Twitter data produced in 1 Introduction Africa. With the large user base and the ease of publishing In this paper, we address the task of measuring content, Twitter has become an ideal platform for citizen participation in public debates on Twit- many people to communicate and also serves as a ter. We use standard methods like language de- platform for expressing opinions on different top- tection in text, graph partitioning and graph cen- ics like politics, sports and socio-economic issues. trality measures to detect patterns of use of lan- Users on Twitter can converse and interact in dif- guage, retweeting user behaviour, and the relation- ferent ways. A user can follower another user. A ship between topics and user communities to mea- user who follow another user subscribes to receive sure user participation in public debates in South Twitter messages posted by the followed. Users African. can reference each other in messages using the The paper is organized as follows. Section 2 intro- @ symbol followed by the username (e.g., I miss duces the literature review on social media analy- @cindy my best friend). Users can also forward sis. Section 3 describes in detail our methodology a message to others. Twitter adds the key word for measuring citizen participation in South Africa RT @username at the beginning of all forwarded using Twitter data, while Section 4 reports on the tweets. The username after the @ symbol is the experiment design and the results. Finally, in Sec- name of the user who originally posted the mes- tion 5 we discuss the conclusions and outline fu- sage. In addition, Twitter users can use a # symbol ture work. to indicate what the message is about. In 2015, University students in South Africa protested against the increase in school fees (Grif- 26 2 Literature Review trality score has also been used to detect spammers in Twitter (Yang et al., 2011). The authors used This section looks at previous work that is related the betweenness centrality to rank users in a graph to our work. then use the ranking score to identify spammers. 2.1 Graph partitioning 2.3 Use of language detection in Twitter Graph partitioning or community detection aims Language detection is the task of detecting the to identify groups in a graph by only using the in- natural language in which a document is written formation encoded in the graph topology (Lanci- (Lui et al., 2004). Hong and Convertino (Hong chinetti and Fortunato, 2009). Lancichinetti and and Convertino, 2011) used language detection Fortunato (2009) reviewed various disjoint com- in Twitter data to discover cross-language differ- munity detection algorithms. Disjoint community ences in adoption of features such as URLs, hash- detection algorithms partition a graph into dis- tags, mentions, replies, and retweets. The authors joint groups and has a wide application. Recently, used a combination of LingPipe text classifier and with the introduction of social media mining, at- Google language API to to classify 62,556,331 tention has been given to overlapping community tweets into languages. The data was downloaded detection algorithms. Overlapping community de- for a period of four weeks. The authors then ana- tection algorithms identify a set of partitions that lyzed how each cluster uses URLs, hashtags, men- are not necessarily disjoint (Xie et al., 2013). A tions, replies, and retweets. Use of language has node in the graph can be found in more than one also been used as a primary tool for detecting spam partition. People in social media usually have in tweets (Martinez-Romo and Araujo, 2013). The connections to several social groups like family, authors examine the use of language in the topic, friends, and colleagues. Java (2007) used an over- a tweet, and the page linked from the tweet. They lapping detection algorithm called clique percola- make an assumption that the language model for tion method (CPM) to detect overlapping commu- a spam tweet will be substantially different: the nities in a Twitter network. CMP was used to find spammer is usually trying to divert traffic to sites how communities connect to each other by over- that have no semantic relation. They exploit this lapped components. Overlapping community de- divergence between the language models to effec- tection has also been used to explain how informa- tively classify tweets as spam or non-spam. tion cascades through Twitter communities (Bar- bieri et al., 2013). The authors used a community 3 Methodology detection algorithm to find the level of authority and passive interest of a node in each community In this section we describe in details three types of it belongs to. social attributes that can help in measuring citizen participation: use of language, retweeting user 2.2 Graph centrality measures behaviour, and relationship between topics and Node centrality measures node involvement in the user network graph. We use these three met- the walk structure of a network (Freeman, rics to detect patterns that measure citizen partici- 1978). Freeman defined three centrality measures, pation in public debates in South Africa. namely, degree, closeness and betweenness. De- gree centrality is a count of the number of edges 3.1 Use of Language incident upon a given node. Closeness defines South Africa is a multilingual country with nine the total geodesic (the length of a walk is de- official languages, namely: English, Afrikaans, fined as the number of edges it contains, and Zulu, Xhosa, Ndebele, Northern Sotho, Tsonga, the shortest path between two nodes is known Tswana and Venda. English and Afrikaans are as a geodesic) distance from a given node to all high resource languages while the other lan- other nodes. Betweenness measures the geodesics guages, which are Bantu languages, are low re- that pass through a given vertex. Centrality mea- source languages. In our work, we are interested in sures have been used in ranking and understanding detecting English and Afrikaans in tweets. Tweets nodes in social networks. Ediger (2010) used be- that cannot be detected as English or Afrikaans are tweenness centrality to rank nodes in clusters of categorized as other. conversations on Twitter data. Betweenness cen- Tweets are informal. They contain special tokens 27 such as @ for usernames, # for trending topics cially available products (Mikowski, 2010). and they have http links for related content. They Spelling check and correction was done also contain slang, misspellings and grammatical by using the jazzy spell checker (Idzelis, errors. We implemented a program called SATwit- 2005). Jazzy spell checker integrates the terCleaner that cleans the dataset before language DoubleMetaphone phonetic matching algo- detection. Cleaning involved doing the following: rithm and the Levenshtein distance using the near-miss strategy. The jazzy spell checker 1. Removing usernames: The program removes was chosen because it gives suggestions if all usernames in the dataset by searching for the word is not properly spelled. SATwit- words that starts with the @ symbol. This terCleaner employs the spell checker to pick follows the convention that all usernames in the first option in the suggestion list as a Twitter messages are prefixed with the @ replacement for the mispelled word. The symbol. method used in our work for grammar and 2. Removing hash tag (#) symbol in the mes- spell checking is limited to English text as sages: The program removes all hash tags by we could not find equivalent libray tools for searching for the # symbol. Afrikaans. Hence, only English text was cor- rected on grammar and spelling. 3. Removing URLs in the messages: Twitter users reference external sources by inserting 7. Replacing repeated characters in words with URLs in their messages. SATwitterCleaner the correct number of characters: We devel- implements a string pattern that identifies oped a method for English text that can re- URLs in Twitter messages and removes them. move repeated characters in words. English seldom uses words with more than two char- 4. Remove emoticons from the message: An acter repetition. However, there are words emoticon is a representation of a facial ex- with three character repetition. We compiled pression used in electronic communication a list of 21 English words with three character to convey the writer’s feelings. The online repetition. The program ignores all the words community uses different types of emoticons with repeated characters that are found in the for different expressions. We compiled 15 compiled list. Otherwise, if a word has re- emoticons used for happy expressions and 11 peated characters, the program first reduces emoticons used for sad expressions. The pro- the repeated characters to two. Then, using gram used this list to indentify and remove the jazzy spell checker (Idzelis, 2005), the emoticons from messages. program checks if the word is a correct En- glish word. If not, the spell checker is used to 5. Expand slang words into their actual mean- get the suggested close word. The program ing: Slang is the use of informal words and then computes the cosine similarity distance expressions that are not considered standard between the suggested word and the original in the speaker’s language or dialect but are word. If the distance is below a threshold, considered acceptable in certain social set- the suggested word is taken as a replacement, tings. Example: 2b means to be. We created a otherwise the program skips the replacement. slang dictionary of 5,364 slang words. Each We chose the similarity distance threshold of slang word in the dictionary was mapped to 1. its actual meaning. The slang dictionary was used by SATwitterCleaner to expand all slang After data cleaning, we used a combination of the words found in the dataset. Naive Bayesian method and simple word statis- tics to detect the English and Afrikaans tweets. 6. Correcting spelling and grammatical errors We used LangDetect, which implements a Naive in English tweets: The program employed a Bayes classifier, using a character n-gram based LanguageTool library to correct the grammer representation without feature selection, with a in tweets. LanguageTool (LT) is based on set of normalization heuristics to improve accu- surface text processing, without deep pars- racy (Nakatani, 2010). Lui and Baldwin (2014) ing, yet, it manages to get significantly bet- compared the performance of eight off-the-shelf ter results for some languages than commer- language detection systems to determine which 28 would be the most suitable for Twitter data. They the graph as undirected, so an edge from @A to compared langid.py (Baldwin et al., 2013), CLD2 @B also connects @B back to @A. All loops are (McCandless, 2010), LangDetect (Lui and Bald- discarded from the graph. Loops are formed when win, 2014), LDIG (Nakatani, 2012), whatlang a user retweets his/her own tweet. We also ignore (Brown, 2013), YALI (Majlis, 2012), TextCat duplicate user interactions so that only unique (Scheelen, 2003) and MSR-LID (Goldszmidt et user interactions are represented in the graph. Our al., 2013). They compared the systems on four graph had 30,114 vertices and 55,578 edges. In different Twitter datasets. They found that LDIG this paper we analyse the graph at two different outperforms all the algorithms though it supports levels, network level and group level. Network a limited number of languages and Afrikaans is level is the view of the entire graph. Group level is not one of them. Overall, they concluded that, the view of sub-graphs/communities in the graph. in their off-the-shelf configuration, only three sys- Network level tems (LangDetect, langid.py, CLD2) perform con- At a network level we calculated the betweenness sistently well on language detection of Twitter centrality of all the nodes in the graph. Freeman messages. Our Twitter messages cleaner, SATwit- (1978) defines betweenness centrality as: let terCleaner and the language detection program gij denote the number of geodesic paths from was developed in Java hence we chose LangDetect node i to node j, and let gikj denote the number because it has Java support. Simple word statistics of geodesic paths from i to j that pass through classify tweets by counting the number of words intermediary k. Then the betweenness centrality in a tweet that are English or Afrikaans. If the is defined as follows: number is higher than or equal to 50%, a tweet is classified as English or Afrikaans respectively. All the tweets that were not detected as English by LangDetect were classified by the simple word Betweenness centrality measures the influ- statistics. This allowed us to compensate for the ence/centrality of a node in a graph. According inacuracy of the LangDetect system. Only tweets to the definition, a node with high betweenness with more than three words were considered for centrality sits at a connection point of subgraphs. language detection. A node plays a major role in the movement of the data from one subgraph to the other. 3.2 Retweeting user behaviour Freeman applied the betweenness to connected Twitter adds the key word RT @username to all and undirected graphs. Social networks often forwarded tweets. RT mean retweet and @user- share common characteristics. Natural clusters name refers to the name of the user who originally form, but the clusters do not partition the graph made the tweet. (Mislove et al., 2007). We use this characteristic In our work, we want to measure how many tweets to make an assumption that our graph will be and retweets are present in the dataset. To find the largely connected and hence the betweenness can number of original tweets, we counted all tweets be applied. that do not start with RT @. To find the number of We also performed another measurement at the retweets in the dataset, we counted all the tweets network level we called resourceful measure. that starts with RT @ keyword. We calculated how many tweets from each node in the graph have been retweeted at least once. 3.3 Relationship between topics and user A node with a high resourceful measure has a network graph high number of tweets retweeted at least once by We created a social graph using retweets. Galuba other users. Resourceful measure measures how (2010) showed that retweets is the most powerful many tweets each node has contributed in the mechanism to diffuse information and a strong graph. In our work, we compared the resourceful indication of the direction of information flow in measure with the betweenness centrality measure Twitter. We created a graph using retweets be- of nodes to find the relationship between the top cause we wanted to see and measure how a graph producers of tweets in the graph and the top users form around the tweets. Users form vertices in the who propagate tweets to subgraphs. The Jaccard graph. We add an edge from user @A to user @B similarity coefficient (Jaccard, 1902) is a common whenever @A retweets a tweet from @B. We treat 29 index for binary variables. It is defined as the quotient between the intersection and the union Topic Categories Topics of the pairwise compared variables among two Controversial topics #OscarPistorius objects. Jaccard is calculated as follows: given Kim Martin two groups b and c, the percent similarity = [a/(a #FeesMustFall + b + c)] where a = number of elements present #Sarafina in both groups, b = number of elements present BCCSA only in group b, and c = number of elements Esethu present only in the group c. Jaccard coefficient is #TaxiStrike a number from 0 and 1. If the coefficient is 0, it Durban protests means the two groups are completely unidentical. Mbuyisa If the coefficient is 1, then the two groups are Developmental topics #ProjectKhanya completely identical. We used the Jaccard coeffi- Wastestopswithme cient to measure the similarity between the nodes Cleanerjoburg with high betweenness centrality and the nodes TeamUpToCleanUp with high resourceful measure. #JobSeekersWednesday Group level Durban protests We partitioned the graph into communities. Xie, Entertainment topics Pearl thusi Kelley and Szymanski (2013) did a review of #ExpressoShow the state of the art in overlapping community #BangOut detection algorithms. They reviewed a total of #FreshAT5 fourteen algorithms and concluded that, for low #KentPhonikFridays overlapping density networks, SLPA (Xie et al., #GenNext2016 2011), OSLOM (Lancichinetti et al., 2011), Game #iGazi (Chena et al., 2010) and COPRA (Gregory, 2010) #DateMyFamily offer better performance than the other tested #FridayStandIn algorithms. For networks with high overlapping Jessica Nkosi density and high overlapping diversity, both SLPA Ertugral and Game provide relatively stable performance. #AskAMan We evaluated two algorithms, namely, COPRA Political topics #NandosDMgathering and SLPA. We observed that SLPA performed #SpyTapes better than COPRA on our graph both in computer #ANCGPManifestoLaunch time and modularity. The modularity of a partition #ANCFriday is a scalar value between -1 and 1 that measures #FillUpFNBStadium the density of links inside communities as com- #FillUpFNB pared to links between communities (Girvan FNB Stadium and Newman, 2002). After evaluation, we used Luthando Mbinda the SLPA overlapping algorithm for community Mavuso detection in the graph. Road accident topics Bellville N1 North #PTATraffic National Event topics #YouthDay Soweto Other #WomenMustKnowThat Shoprite #TNABizBrief Table 1: Keywords used for downloading tweets. Keywords were determined by following trending topics in South Africa from 4th June 2016 to 19th June 2016 30 4 Results and Discussion with high betweenness centrality play a major role in the movement of tweets in the graph. Users with This section discusses our experimental set up and high resourceful measure have a high number of results. tweets retweeted at least once by other users. We 4.1 Experimental Settings took the top 50 users with the highest resource- ful measure and top 50 users with the highest be- To do this experiment, trending topics in South tweenness centrality and computed the Jaccard co- Africa shown in Table 1 were used to download efficient. The coefficient is the number between 0 tweets from 4th June 2016 to 19th June 2016. and 1. A coefficient of 0 means the two groups Twitter implements a proprietary algorithm that are completely unidentical. If the coefficient is 1, shows the trending topics in Twitter data. Trend- then the two groups are completely identical. Our ing topics can either be hash tagged words or non calculation yielded a coefficient of 0.23. This re- hash tagged words. We manually observed trend- sult concludes that, top users who provide infor- ing topics in South Africa from the Twitter web- mation in the graph are not the top users who prop- site for 16 days and used the Web API to down- agate the tweets through communities. Finally, we load 131,790 tweets from 37,876 Twitter accounts. compared topics in the communities to find over- The topics were categorized into seven (7) groups, laps. SLPA (Xie et al., 2011) was used to parti- namely: controversial topics, developmental top- tion the graph into communities. SLPA is a non- ics, entertainment topics, political topics, road ac- determistic algorithm, so we ran the algorithm 11 cident topics, national events topics and other. times and recorded the average performance. The algorithm produced 2,200 communities with an 4.2 Experimental Results overlap of 7.3%. This shows that our graph had We first start with the results of the language a low overlapping density. Table 2 shows that detection. Our experiments show 94.64% of all communities tweeted about Oscar Pistorius and tweets were in English, 2.61% of tweets were there is not a clear cut division among communi- in Afrikaans and 2.75% was detected as other. ties with regards to topics. Though communities Other means the tweet was neither English nor focus on certain topics - group 5 and 10 talk more Afrikaans. During the experiment, we noticed that about political topics, group 3 entertainment top- tweets were repeating in the dataset. This is be- ics, all communities talk about other issues too. cause users can retweet the same tweet, causing These graph patterns suggests that citizens partic- repetition. So, before detecting the language, we ipate in public debates on a variety of topics. filtered out all the repeating tweets. After filtering, the number of tweets in the dataset was reduced to 5 Conclusions and Future Work 66,378. The result show that despite having many We presented social attributes that help identify languages, South Africa tweets in a common lan- patterns that measure citizen participation in pub- guage. This pattern suggests that people tweet so lic debates in South Africa. Africa is highly mul- that their message can be read across a larger spec- tilingual, hence we chose the use of language as trum of the population. an attribute that can indicate participation in on- The next result describes the tweet-retweet be- line public discussions. We also considered user haviour. The downloaded dataset had 58.88 % retweeting behavior and how topics relate to on- tweets and 41.12 % retweets. This pattern sug- line communities. This exploratory study provides gests that there is more original contribution in the first step in Twitter analysis on South African public debates. online data. This paper considers only a snapshot The last set of results show the analysis of the so- of the South African Twitter data. In future, we cial graph. Our results shows that 79.5% of users aim to consider the temporal aspects of the graph. in our dataset participate in conversation. To mea- sure participation in conversation, we counted all the users in our dataset who retweeted other user’s tweets or their tweets were retweeted by others. We used the Jaccard coefficient to measure the similarity of users with high betweenness central- ity and users with high resourceful measure. Users 31 Group No. No. Members Top 4 topics mentioned in each community % of topics mentioned from all the topics in Table 1 1 10564 #OscarPistorius (4043) #NandosDMgathering (175) 72.09% #Sarafina (89) #SpyTapes (71) 2 1704 #BangOut (419) #OscarPistorius (18) 37.21% #Sarafina (8) #AskAMan (8) 3 1330 #AskAMan (567) #OscarPistorius (88) 65.12% Shoprite (21) #BangOut (20) 4 333 #BangOut (87) #Sarafina (14) 30.23% #AskAMan (6) #OscarPistorius (9) 5 300 #NandosDMgathering (94) #ANCGPManifestoLaunch (35) 30.23% #OscarPistorius (16) #YouthDay (5) 6 288 #FreshAT5 (96) #KentPhonikFridays (10) 13.95% #BangOut (4) #OscarPistorius (4) 7 273 #BangOut (81) #OscarPistorius (14) 23.26% #GenNext2016 (11) #AskAMan (4) 8 147 #NandosDMgathering (25) #JobSeekersWednesday (22) 30.23% #ANCGPManifestoLaunch (13) #OscarPistorius (3) 9 126 #BangOut (46) Shoprite (3) 20.93% #OscarPistorius (2) Soweto (2) 10 124 #OscarPistorius (130) #NandosDMgathering (21) 13.95% #ANCGPManifestoLaunch (3) #FeesMustFall (2) 11 114 #OscarPistorius (26) #ANCGPManifestoLaunch (23) 23.26% #NandosDMgathering (17) #AskAMan (15) 12 102 #OscarPistorius (46) Ertugral (6) 13.95% Soweto (4) #YouthDay (1) Table 2: Relationship between topics and communities. The table shows communities with more than 100 members. Column three shows the top mentioned topics in each community. The number of men- tions is indicated in brackets. Column four shows the percentage of topics mentioned in each community out of all the topics used for downloading the data shown in Table 1. 32 References Lichan Hong and Gregorio Convertino. 2011. Lan- guage matters in twitter: A large scale study. Fifth Timothy Baldwin, Paul Cook, Marco Lui, Andrew International AAAI Conference on Weblogs and So- MacKinlay, and Li Wang. 2013. How noisy social cial Media. media text,how different social media sources? In Proceedings of the 6th International Joint Confer- Mindaugas Idzelis. 2005. Jazzy: The java open source ence on Natural Language Processing. spell checker. http://jazzy.sourceforge.net/. Nicola Barbieri, Francesco Bonchi, and Giuseppe Paul Jaccard. 1902. Lois de distribution florale. Bul- Manco. 2013. Cascade-based community detection. letin de la Socet Vaudoise des Sciences Naturelles, Sixth ACM international conference on Web search pages 67–130. and data mining, pages 33–42. Akshay Java, Xiaodan Song, Tim Finin, and Belle Javier Borge-Holthoefer, Walid Magdy, Kareem Dar- Tseng. 2007. Why we twitter: understanding mi- wish, and Ingmar Weber. 2015. Content and net- croblogging usage and communities. 9th WebKDD work dynamics behind egyptian political polariza- and 1st SNA-KDD 2007 workshop on Web mining tion on twitter. ACM Conference on Computer Sup- and social network analysis, pages 56–65. ported Cooperative Work and Social Computing, (18):700–711. Andrea Lancichinetti and Santo Fortunato. 2009. Community detection algorithms: a comparative Ralf Brown. 2013. Selecting and weighting n-grams analysis. arXiv:0908.1062. to identify 1100 languages. 16th international con- ference on text, speech and dialogue. Andrea Lancichinetti, Filippo Radicchi, Jose, Javier Ramasco, and Santo Fortunato. 2011. Finding sta- Duanbing Chena, Mingsheng Shanga, Zehua Lvb, and tistically significant communities in networks. PLoS Yan Fua. 2010. Detecting overlapping communities One, 6(4). of weighted networks via a local algorithm. Physica A: Statistical Mechanics and its Applications, pages Marco Lui and Timothy Baldwin. 2014. Accurate 4177–4187. language identification of twitter messages. NICTA VRL. David Ediger, Karl Jiang, Jason Riedy, David A. Bader, Marco Lui, Jey Han Lau, and Timothy Baldwin. 2004. and Courtney Corley. 2010. Massive social network Automatic detection and language identification of analysis: Mining twitter for social good. 39th Inter- multilingual documents. Proceedings of the 2004 national Conference on Parallel Processing, pages ACM Symposium on Applied. Computing. 583–593. Martin Majlis. 2012. Yet another language identifie. Linton C. Freeman. 1978. Centrality in social net- Student Research Workshop at the 13th Conference works conceptual clarification. Social Networks, of the European Chapter of the Association for Com- pages 215–239. putational Linguistics, pages 46–54. Wojciech Galuba, Karl Aberer, Dipanjan Chakraborty, Yuexin Mao, Wei Wei, and Bing Wang. 2012. Cor- Zoran Despotovic, and Wolfgang Kellerer. 2010. relating s&p 500 stocks with twitter data. ACM In- Outtweeting the twitterers - predicting information ternational Workshop on Hot Topics on Interdisci- cascades in microblogs. WOSN’10 Proceedings of plinary Social Networks Research, (1):69–72. the 3rd Wonference on Online social networks, pages 3–3. Juan Martinez-Romo and Lourdes Araujo. 2013. De- tecting malicious tweets in trending topics using a Michelle Girvan and Mark E. J. Newman. 2002. Com- statistical analysis of language. Expert Systems with munity structure in social and biological networks. Applications: An International Journal, 40:2992– Proceedings of the National Academy of Sciences 3000. 99, pages 7821–7826. Michael McCandless. 2010. ccuracy and perfor- Moises Goldszmidt, Marc Najork, and Stelios Papari- mance of googles compact language detector. zos. 2013. Boot-strapping language identifiers for http://blog.mikemccandless.com/2011/10/accuracy- short colloquial postings. European Conference on and-performance-of-googles.html. Machine Learning and Principles and Practice of Knowledge Discovery in Databases. Marcin Mikowski. 2010. Developing an open source, rule based proofreading tool. Software: Practice Steve Gregory. 2010. Finding overlapping and Experience, 40:543–566. communities in networks by label propagation. arXiv:0910.5516 [physics.soc-ph]. Alan Mislove, Massimiliano Marcon, Krishna P. Gum- madi, Peter Druschel, and Bobby Bhattacharjee. Nosakhere Griffin. 2015. Fees must fall 2007. Measurement and analysis of online social and the possibility of a new african university. networks. Proceedings of the 7th ACM SIGCOMM www.Face2FaceAfrica.com. conference on Internet measurement, pages 29–42. 33 Corrado Monti, Alessandro Rozza, Giovanni Zap- pella, Matteo Zignani, Adam Arvidsson, and Elanor Colleoni. 2013. Modelling political disaffection from twitter data. International Workshop on Issues of Sentiment Discovery and Opinion Mining, (2). Shuyo Nakatani. 2010. Language detection library (slides). http://www.slideshare.net/shuyo/language- detection-library-for-java. Shuyo Nakatani. 2012. Short text lan- guage detection with infinity-gram. http://shuyo.wordpress.com/2012/05/17/short- text-language-detection-with-infinity-gram/. Jaimie Y. Park and Chin-Wan Chung. 2012. When daily deal services meet twitter: understanding twit- ter as a daily deal marketing platform. Annual ACM Web Science Conference, (4):233–242. Frank Scheelen. 2003. libtextcat. http://software.wise- guys.nl/libtextcat/. Sakaki Takeshi, Okazaki Makoto, and Matsuo Yutaka. 2010. Earthquakeshakes twitter users: Real-time event detection by social sensors. International con- ference on World wide web, (19):851–860. Jierui Xie, Boleslaw K. Szymanski, and Xiaoming Liu. 2011. Slpa: Uncovering overlapping communities in social networks via a speaker-listener interaction dynamic process. IEEE ICDM workshop on DM- CCI. Jierui Xie, Stephen Kelley, and Boleslaw K. Szyman- ski. 2013. Overlapping community detection in net- works: The state-of-the-art and comparative study. ACM Computing Surveys (CSUR), 45(43). Chao Yang, Robert Chandler Harkreader, and Guofei Gu. 2011. Die free or live hard? empirical eval- uation and new design for fighting evolving twitter spammers. 14th international conference on Recent Advances in Intrusion Detection, 6961:318–337. Reza Zafarani, Mohammad Ali Abbasi, and Huan Liu. 2014. Social Media Mining. Cambridge University Press, NY, USA. 34