=Paper=
{{Paper
|id=Vol-1395/paper_04
|storemode=property
|title=A Topical Crawler for Uncovering Hidden Communities of Extremist Micro-Bloggers on Tumblr
|pdfUrl=https://ceur-ws.org/Vol-1395/paper_04.pdf
|volume=Vol-1395
|dblpUrl=https://dblp.org/rec/conf/msm/0001S15
}}
==A Topical Crawler for Uncovering Hidden Communities of Extremist Micro-Bloggers on Tumblr==
A Topical Crawler for Uncovering Hidden Communities of
Extremist Micro-Bloggers on Tumblr
Swati Agarwal Ashish Sureka
Indraprastha Institute of Information Technology, Software Analytics Research Lab
Delhi (IIIT-D), India (SARL), India
swatia@iiitd.ac.in ashish@iiitd.ac.in
ABSTRACT
Research shows that microblogging websites such as Tum-
blr are being misused as a platform to disseminate hate and
extremism. We formulate the problem of locating such ex-
tremist communities as a graph search problem. We propose
a topical crawler based approach performing several tasks:
searching for a blogger, computing its similarity against ex-
emplary documents, filtering hate promoting bloggers, navi-
gating through links to other bloggers and managing a queue
of such bloggers for social network analysis. We conduct ex-
periments on real world dataset and examine the e↵ective-
ness of ’like’ and ’reblog’ features as links between bloggers.
Experimental results demonstrates that the proposed solu-
tion approach is e↵ective with an F-score of 0.80.
Figure 1: Proposed Architecture for Extremist
Community Detection
Keywords
Mining User Generated Content, Online Radicalization, So-
cial Media Analytics
In a graph traversal, a topical crawler returns relevant
1. PROBLEM DEFINITION & SOLUTION nodes to a specific topic. To define the relevance of a node,
Tumblr is a popular and widely-used micro-blogging web- it learns the characteristics and features of given topic and
site. Previous research shows that such websites are used computes the extent of similarity against a bunch of exem-
as a platform for disseminating hate and extremism (due plary documents. To collect training examples, we perform
to low barrier to publication and anonymity) [1][2][3][4][5]. an iterative search on Tumblr using keyword based flagging,
Automatic identification of hate and extremism promoting where keyword is a search tag; for example, jihad, anti-Islam
posts and bloggers is an important (from the perspective of and hate. We perform a case study on Jihad and by manual
the website moderators and law enforcement agencies) and search on Tumblr posts we collect several relevant tags that
a technically challenging problem. Large volume of data are commonly used by extremist bloggers. We use these tags
on Tumblr, free-form text and noisy content makes auto- to initiate our process and collect all textual posts (avoid-
mated analysis technically challenging [1][2][3][4][5]. Our ing picture, audio, video and URLs), tags (associated with
aim is to investigate the application of a topical crawling resultant posts) and linked bloggers (post reblogged by and
based algorithm for retrieving hate promoting bloggers on liked by) with no redundancy. We perform a manual inspec-
Tumblr. Our objective is to examine the e↵ectiveness of a tion on resultant posts and posts made by linked bloggers
random-walk based approach in social network graph traver- to filter relevant (hate promoting) and unknown results. We
sal. Furthermore, our goal is to examine the e↵ectiveness of further extract more posts and linked bloggers from related
re-blogging and like on a post as the links between two blog- tags and run this framework recursively to collect our exem-
gers and conduct experiments on large real world dataset to plary documents (400 hate promoting posts). These training
demonstrate the e↵ectiveness of our approach. examples contain the body and caption of only positive class
(hate and extremism promoting content) posts which is used
to train the model.
Figure 1 illustrates the design and architecture of topical
Copyright c 2015 held by author(s)/owner(s); copying permitted crawler to locate extremist communities. As shown in Figure
only for private and academic purposes. 1, our proposed solution framework is an iterative multi-step
Published as part of the #Microposts2015 Workshop proceedings,
available online as CEUR Vol-1395 (http://ceur-ws.org/Vol-1395) process primarily consisting of five phases: features (posts)
extraction, data pre-processing, classification, frontier ex-
#Microposts2015, May 18th, 2015, Florence, Italy. traction and graph traversal. In phase 1, we initiate our pro-
cess using a positive class (hate promoting) blogger Ui called
· #Microposts2015 · 5th Workshop on Making Sense of Microposts · @WWW2015
as ’seed’. We use Tumblr API 1 to fetch the URLs of n num-
ber of textual posts and by using Jsoup Java library 2 we ex- Table 1: Confusion Matrix and Accuracy Results for
tract the content and caption of these posts (used as contex- One Class Classifier
tual metadata). These posts can be either re-blogged from (a) Confusion Matrix
other users or originally posted by the user Ui . These posts Predicted
consist of multiple langauges. Therefore, in phase 2, we per- Positive Unknown
Positive 290 45
form data pre-processing and filter English and non-English Actual
Unknown 92 173
posts using language detection library3 . We perform data
pre-processing on these posts and remove English stopwords. (b) Accuracy Results
In phase 3, we build a statistical model from the exemplary Precision Recall F-Score Accuracy
documents collected separately by semi-automatic process. 0.75 0.86 0.80 0.77
To compute the relevance of each blogger, we use charac-
ter level n-gram language modeling approach. We find the
extent of similarity between metadata and exemplary doc- lished posts and given guidelines for annotation. To avoid
uments using LingPipe API 4 - applying joint probability- the biasness and to collect correct annotated results we per-
based classification of character sequences . We implement a form a horizontal and vertical partition on nodes and ar-
one class classifier and filter extremism promoting bloggers range these 600 bloggers into a 2D matrix where rows are
from unknown bloggers. In phase 4, we extract the notes the numbers of annotators grouped in 10 sets, 3 members
associated with the posts (collected in phase 1) of relevant each. Columns of the matrix are the number of bloggers
bloggers. These notes contain the list of bloggers who liked assigned to each member for annotation i.e. 60. We use
and re-blogged a particular post. The number of notes repre- majority voting approach for final annotation, the class of
sent the popularity of a post and indicate the similar interest a blogger is the one which is voted by at least two anno-
between original poster and other bloggers in the list who tators. Based upon the validation results we evaluate the
may or may not be the direct followers of each other. We accuracy of our model. Table 1(a) shows the confusion ma-
use notes to extract frontier nodes of a blogger because of trix for one class classification. Table 1(a) reveals that our
two reasons: 1) due to the privacy policies Tumblr API does model predicts 382 (290+92) bloggers as hate promoting and
not allow developers to extract followers and following blogs 218 (173+45) bloggers as unknown. Table 1(a) shows that
of Tumblr users. 2) Tumblr facilitates bloggers to track any there is a missclassification of 13% and 34% in predicting
number of tags so that whenever there is a new post pub- hate promoting and unknown bloggers. Table 1(b) shows
lished publicly on Tumblr containing any of these tags, it the accuracy results of our classifier. Results shows that the
automatically appears in a menu on user’s dashboard. They precision, recall and f-score are reasonably high and we are
can spread that post among their followers by re-blogging able to predict hate promoting bloggers with an accuracy of
it. Tracked tags allow bloggers to form a virtual community 77%. Our experimental analysis reveals that re-blogging is
without following each other. For each frontier extracted in a good indicator of connection between two bloggers. We
phase 4, we compute the relevance score against exemplary locate users who are central and influential among all and
documents and discard unknown bloggers. In phase 5, we play major role in the discovered communities. We perform
manage a queue of relevant bloggers and perform directed independent social network analysis on like and re-blog links
graph traversal using random walk algorithm. To expand among bloggers and conclude that re-blogging is a discrimi-
our graph we select the next blogger in uniform distribution natory feature to identify the communities of extremist blog-
and extract it’s frontiers. We execute our focused crawler gers sharing a common agenda.
for each frontier without revisiting a blogger. This traversal
results in a connected graph, where nodes represents a blog-
ger (hate promoting) and edges represent the links (re-blog
3. REFERENCES
[1] S. Agarwal and A. Sureka. Using knn and svm based
and like) between two bloggers. We perform social network
one-class classifier for detecting online radicalization on
analysis on the resultant graph and locate extreme right
twitter. In Distributed Computing and Internet
communities of hate promoting bloggers.
Technology (ICDCIT), pages 431–442, 2015.
[2] E. A. Cano Basave, Y. He, K. Liu, and J. Zhao. A
2. RESULTS & CONCLUSION weakly supervised bayesian model for violence detection
We execute our topical crawler for a given seed blogger and in social media. In Sixth International Joint Conference
traverse through Tumblr network using random walk algo- on Natural Language Processing, pages 109–117, 2013.
rithm. For every new blogger, we compute its relevance and [3] S. Kumar, F. Morstatter, R. Zafarani, and H. Liu.
classify it as hate promoting or unknown using one class Whom should i follow?: Identifying relevant users
classifier. To examine the e↵ectiveness of our classifier, we during crises. In ACM Conference on Hypertext and
compute its accuracy using standard information retrieval Social Media (HT), pages 139–147, 2013.
techniques. In one execution of our topical crawler, we were [4] A. Sureka and S. Agarwal. Learning to classify hate
able to collect 600 bloggers. We hired 30 graduate students and extremism promoting tweets. In Joint Conference
as volunteers from di↵erent department to label these blog- in Intelligence Security Informatics (JISIC), pages
gers as hate promoting or unknown according to their pub- 320–320. IEEE, 2014.
1 [5] J. Xu, T.-C. Lu, R. Compton, and D. Allen. Civil
https://www.tumblr.com/docs/en/api/v2 unrest prediction: A tumblr-based exploration. In
2
http://jsoup.org/apidocs/ Social Computing, Behavioral-Cultural Modeling and
3
https://code.google.com/p/language-detection/ Prediction, pages 403–411. Springer, 2014.
4
http://alias-i.com/lingpipe/index.html
27
· #Microposts2015 · 5th Workshop on Making Sense of Microposts · @WWW2015