=Paper=
{{Paper
|id=Vol-1395/paper_04
|storemode=property
|title=A Topical Crawler for Uncovering Hidden Communities of Extremist Micro-Bloggers on Tumblr
|pdfUrl=https://ceur-ws.org/Vol-1395/paper_04.pdf
|volume=Vol-1395
|dblpUrl=https://dblp.org/rec/conf/msm/0001S15
}}
==A Topical Crawler for Uncovering Hidden Communities of Extremist Micro-Bloggers on Tumblr==
<pdf width="1500px">https://ceur-ws.org/Vol-1395/paper_04.pdf</pdf>
<pre>
   A Topical Crawler for Uncovering Hidden Communities of
             Extremist Micro-Bloggers on Tumblr

                              Swati Agarwal                                         Ashish Sureka
           Indraprastha Institute of Information Technology,               Software Analytics Research Lab
                          Delhi (IIIT-D), India                                     (SARL), India
                           swatia@iiitd.ac.in                                     ashish@iiitd.ac.in


 ABSTRACT
 Research shows that microblogging websites such as Tum-
 blr are being misused as a platform to disseminate hate and
 extremism. We formulate the problem of locating such ex-
 tremist communities as a graph search problem. We propose
 a topical crawler based approach performing several tasks:
 searching for a blogger, computing its similarity against ex-
 emplary documents, filtering hate promoting bloggers, navi-
 gating through links to other bloggers and managing a queue
 of such bloggers for social network analysis. We conduct ex-
 periments on real world dataset and examine the e↵ective-
 ness of ’like’ and ’reblog’ features as links between bloggers.
 Experimental results demonstrates that the proposed solu-
 tion approach is e↵ective with an F-score of 0.80.
                                                                   Figure 1: Proposed Architecture for Extremist
                                                                   Community Detection
 Keywords
 Mining User Generated Content, Online Radicalization, So-
 cial Media Analytics
                                                                      In a graph traversal, a topical crawler returns relevant
 1.   PROBLEM DEFINITION & SOLUTION                                nodes to a specific topic. To define the relevance of a node,
 Tumblr is a popular and widely-used micro-blogging web-           it learns the characteristics and features of given topic and
 site. Previous research shows that such websites are used         computes the extent of similarity against a bunch of exem-
 as a platform for disseminating hate and extremism (due           plary documents. To collect training examples, we perform
 to low barrier to publication and anonymity) [1][2][3][4][5].     an iterative search on Tumblr using keyword based flagging,
 Automatic identification of hate and extremism promoting          where keyword is a search tag; for example, jihad, anti-Islam
 posts and bloggers is an important (from the perspective of       and hate. We perform a case study on Jihad and by manual
 the website moderators and law enforcement agencies) and          search on Tumblr posts we collect several relevant tags that
 a technically challenging problem. Large volume of data           are commonly used by extremist bloggers. We use these tags
 on Tumblr, free-form text and noisy content makes auto-           to initiate our process and collect all textual posts (avoid-
 mated analysis technically challenging [1][2][3][4][5]. Our       ing picture, audio, video and URLs), tags (associated with
 aim is to investigate the application of a topical crawling       resultant posts) and linked bloggers (post reblogged by and
 based algorithm for retrieving hate promoting bloggers on         liked by) with no redundancy. We perform a manual inspec-
 Tumblr. Our objective is to examine the e↵ectiveness of a         tion on resultant posts and posts made by linked bloggers
 random-walk based approach in social network graph traver-        to filter relevant (hate promoting) and unknown results. We
 sal. Furthermore, our goal is to examine the e↵ectiveness of      further extract more posts and linked bloggers from related
 re-blogging and like on a post as the links between two blog-     tags and run this framework recursively to collect our exem-
 gers and conduct experiments on large real world dataset to       plary documents (400 hate promoting posts). These training
 demonstrate the e↵ectiveness of our approach.                     examples contain the body and caption of only positive class
                                                                   (hate and extremism promoting content) posts which is used
                                                                   to train the model.

                                                                   Figure 1 illustrates the design and architecture of topical
 Copyright c 2015 held by author(s)/owner(s); copying permitted    crawler to locate extremist communities. As shown in Figure
 only for private and academic purposes.                           1, our proposed solution framework is an iterative multi-step
 Published as part of the #Microposts2015 Workshop proceedings,
 available online as CEUR Vol-1395 (http://ceur-ws.org/Vol-1395)   process primarily consisting of five phases: features (posts)
                                                                   extraction, data pre-processing, classification, frontier ex-
 #Microposts2015, May 18th, 2015, Florence, Italy.                 traction and graph traversal. In phase 1, we initiate our pro-
                                                                   cess using a positive class (hate promoting) blogger Ui called


· #Microposts2015 · 5th Workshop on Making Sense of Microposts · @WWW2015
 as ’seed’. We use Tumblr API 1 to fetch the URLs of n num-
 ber of textual posts and by using Jsoup Java library 2 we ex-     Table 1: Confusion Matrix and Accuracy Results for
 tract the content and caption of these posts (used as contex-     One Class Classifier
 tual metadata). These posts can be either re-blogged from                         (a) Confusion Matrix
 other users or originally posted by the user Ui . These posts                                             Predicted
 consist of multiple langauges. Therefore, in phase 2, we per-                                       Positive   Unknown
                                                                                      Positive         290         45
 form data pre-processing and filter English and non-English              Actual
                                                                                      Unknown           92         173
 posts using language detection library3 . We perform data
 pre-processing on these posts and remove English stopwords.                             (b) Accuracy Results
 In phase 3, we build a statistical model from the exemplary                 Precision     Recall   F-Score     Accuracy
 documents collected separately by semi-automatic process.                     0.75         0.86      0.80        0.77
 To compute the relevance of each blogger, we use charac-
 ter level n-gram language modeling approach. We find the
 extent of similarity between metadata and exemplary doc-          lished posts and given guidelines for annotation. To avoid
 uments using LingPipe API 4 - applying joint probability-         the biasness and to collect correct annotated results we per-
 based classification of character sequences . We implement a      form a horizontal and vertical partition on nodes and ar-
 one class classifier and filter extremism promoting bloggers      range these 600 bloggers into a 2D matrix where rows are
 from unknown bloggers. In phase 4, we extract the notes           the numbers of annotators grouped in 10 sets, 3 members
 associated with the posts (collected in phase 1) of relevant      each. Columns of the matrix are the number of bloggers
 bloggers. These notes contain the list of bloggers who liked      assigned to each member for annotation i.e. 60. We use
 and re-blogged a particular post. The number of notes repre-      majority voting approach for final annotation, the class of
 sent the popularity of a post and indicate the similar interest   a blogger is the one which is voted by at least two anno-
 between original poster and other bloggers in the list who        tators. Based upon the validation results we evaluate the
 may or may not be the direct followers of each other. We          accuracy of our model. Table 1(a) shows the confusion ma-
 use notes to extract frontier nodes of a blogger because of       trix for one class classification. Table 1(a) reveals that our
 two reasons: 1) due to the privacy policies Tumblr API does       model predicts 382 (290+92) bloggers as hate promoting and
 not allow developers to extract followers and following blogs     218 (173+45) bloggers as unknown. Table 1(a) shows that
 of Tumblr users. 2) Tumblr facilitates bloggers to track any      there is a missclassification of 13% and 34% in predicting
 number of tags so that whenever there is a new post pub-          hate promoting and unknown bloggers. Table 1(b) shows
 lished publicly on Tumblr containing any of these tags, it        the accuracy results of our classifier. Results shows that the
 automatically appears in a menu on user’s dashboard. They         precision, recall and f-score are reasonably high and we are
 can spread that post among their followers by re-blogging         able to predict hate promoting bloggers with an accuracy of
 it. Tracked tags allow bloggers to form a virtual community       77%. Our experimental analysis reveals that re-blogging is
 without following each other. For each frontier extracted in      a good indicator of connection between two bloggers. We
 phase 4, we compute the relevance score against exemplary         locate users who are central and influential among all and
 documents and discard unknown bloggers. In phase 5, we            play major role in the discovered communities. We perform
 manage a queue of relevant bloggers and perform directed          independent social network analysis on like and re-blog links
 graph traversal using random walk algorithm. To expand            among bloggers and conclude that re-blogging is a discrimi-
 our graph we select the next blogger in uniform distribution      natory feature to identify the communities of extremist blog-
 and extract it’s frontiers. We execute our focused crawler        gers sharing a common agenda.
 for each frontier without revisiting a blogger. This traversal
 results in a connected graph, where nodes represents a blog-
 ger (hate promoting) and edges represent the links (re-blog
                                                                   3.   REFERENCES
                                                                   [1] S. Agarwal and A. Sureka. Using knn and svm based
 and like) between two bloggers. We perform social network
                                                                       one-class classifier for detecting online radicalization on
 analysis on the resultant graph and locate extreme right
                                                                       twitter. In Distributed Computing and Internet
 communities of hate promoting bloggers.
                                                                       Technology (ICDCIT), pages 431–442, 2015.
                                                                   [2] E. A. Cano Basave, Y. He, K. Liu, and J. Zhao. A
 2.   RESULTS & CONCLUSION                                             weakly supervised bayesian model for violence detection
 We execute our topical crawler for a given seed blogger and           in social media. In Sixth International Joint Conference
 traverse through Tumblr network using random walk algo-               on Natural Language Processing, pages 109–117, 2013.
 rithm. For every new blogger, we compute its relevance and        [3] S. Kumar, F. Morstatter, R. Zafarani, and H. Liu.
 classify it as hate promoting or unknown using one class              Whom should i follow?: Identifying relevant users
 classifier. To examine the e↵ectiveness of our classifier, we         during crises. In ACM Conference on Hypertext and
 compute its accuracy using standard information retrieval             Social Media (HT), pages 139–147, 2013.
 techniques. In one execution of our topical crawler, we were      [4] A. Sureka and S. Agarwal. Learning to classify hate
 able to collect 600 bloggers. We hired 30 graduate students           and extremism promoting tweets. In Joint Conference
 as volunteers from di↵erent department to label these blog-           in Intelligence Security Informatics (JISIC), pages
 gers as hate promoting or unknown according to their pub-             320–320. IEEE, 2014.
 1                                                                 [5] J. Xu, T.-C. Lu, R. Compton, and D. Allen. Civil
   https://www.tumblr.com/docs/en/api/v2                               unrest prediction: A tumblr-based exploration. In
 2
   http://jsoup.org/apidocs/                                           Social Computing, Behavioral-Cultural Modeling and
 3
   https://code.google.com/p/language-detection/                       Prediction, pages 403–411. Springer, 2014.
 4
   http://alias-i.com/lingpipe/index.html


                                                                                                                             27
· #Microposts2015 · 5th Workshop on Making Sense of Microposts · @WWW2015

</pre>