Conversation Level Constraints on Pedophile Detection in Chat Rooms Notebook for PAN at CLEF 2012 Claudia Peersman, Frederik Vaassen, Vincent Van Asch, Walter Daelemans Computational Linguistics and Psycholinguistics Research Center, Antwerp University {claudia.peersman, frederik.vaassen, vincent.vanasch, walter.daelemans}@ua.ac.be Abstract. In this paper we present a new approach for detecting online pedophiles in chat rooms that combines the results of predictions on the level of the individual post, the level of the user and the level of the entire conversation, and describe the results of this three-stage system in the PAN 2012 competition. Also, we describe a resampling and a filtering strategy to circumvent issues regarding the unbalanced dataset. Finally, we describe the creation of a dictionary of words and expressions relating to predators’ grooming stages, which we used to identify which posts in the predators’ conversations were most distinctive for their grooming behavior. 1 Introduction Between 2009 and 2011, the EU Kids Online project1 organized a survey of nationally representative samples of children between the ages of 9 and 16 regarding their Internet use, in 25 different EU member states. This study not only showed that going online is very much embedded in children’s lives — 9 to 16 year olds spend 88 minutes a day online on average, with 49% of these adolescents going online in their bedroom —, it also stated that 34% of children had added people they had never met face-to-face to their friends list, that 15% had sent strangers personal information, and that 14% had sent a picture or a video of themselves to a stranger. Moreover, the study showed that younger children usually do not possess the digital skills needed to manage their privacy settings in their user profiles or to block unwanted messages [5]. Unfortunately, it is impossible for social network moderators or law enforcement agencies to manually check the 1 http://www2.lse.ac.uk/media@lse/research/EUKidsOnline/Home.aspx (last accessed on August 9th, 2012) vast amount of communications online in order to tackle the risk of children being groomed by online sexual predators. We therefore turn to new, automated techniques for identifying Internet pedophiles to help create a safer Internet for children. In this context, the author identification competition of the PAN 2012 lab created a sub-task in which the primary aim was to automatically identify Internet predators among other chatters. This paper describes a three-stage approach to tackling this problem. The approach plays on the strengths of two different text classifiers and models conversation-level constraints to improve the quality of the predictions. One of the major challenges of this classification task lies in the class imbalance inherent to the problem: as in real life, the number of non- predator users in the dataset vastly outnumbers the number of predators. Section 2 briefly describes the data and the pre-processing steps we performed. In Section 3, we explain how we used resampling and filtering techniques to circumvent this class imbalance. We will then describe in more detail how the three-stage approach combines the strengths of a high-recall post-level classifier with those of a high-precision user- based classifier, and how the combined predictions of these two classifiers are further improved by imposing conversation-level constraints, which boosts precision significantly. The Sexual Predator Identification sub-task also had a secondary aim: to identify which posts in the predators’ conversations were most distinctive for the predators’ grooming behavior. Because there were no guidelines to which kind of posts were to be considered as grooming, we based our approach on Lanning’s [4] analysis of the different stages of the grooming process, which include collecting information about the victim, lowering the victim’s inhibitions, isolating the victim from adult supervision, initiating the abuse and (possibly) attempting to meet with the victim. In Section 4, we describe the creation of a dictionary of words and expressions that refer to these stages of grooming. In Section 5, finally, we present the results of our system in the PAN 2012 competition and discuss some issues regarding the competition’s evaluation measures. 2 Preprocessing of the Data The PAN 2012 sexual predator identification training dataset consisted of 66,914 conversations involving 97,671 unique users of which 142 were labeled as a predator (0.15%). There were 2,015 conversations in which a predator was involved and these conversations constituted 4.52% of the total number of 900,631 posts. No conversation contained more than one predator. The majority of the conversations contained two users (68%), followed by single-user conversations (19%). The maximum number of users per conversation was 30. Most users (95%) were involved in only one conversation, while one user was involved in as many as 3,868 conversations. The most prolific predator produced posts in 182 conversations, while 20% of the predators were only represented by one conversation. Our system was developed using two different splits of the PAN training data with each of the splits containing a training and a validation set. During the splitting, the conversations were clustered so that no user was present in two different clusters. Distributing clusters rather than conversations ensured that no user in training also appeared in validation, which prevented overfitting of user-specific features. In addition, no user in the validation set for split 1 is included in the validation set of Split 2. For example, for split 1, 13.2% of the conversations ended up in the validation set. There were 29 predators in this set, which constitutes 0.2% of the users. The complementary data ended up in the training set. The statistics for the two splits are given in Table 1. Table 1. Statistics on the clustered splits of the PAN training set. Partitions Training Validation Conversations Predators Conversations Predators Split 1 86.83% 113 (0.14%) 13.17% 29 (0.18%) Split 2 86.84% 110 (0.13%) 13.16% 32 (0.20%) 3 Task 1: Sexual Predator Identification Due to both the criminal character of this topic and the privacy issues that are involved, so far there is only one dataset publicly available which displays chat conversations by sexual predators: the Perverted Justice2 website (PJ) contains over 500 English chat conversations collected by adult volunteers pretending to be adolescents and as such were approached by an alleged pedophile. However, for machine learning algorithms to be effective in identifying online sexual predators, they would need to be trained with both illegal conversations between predators and their victims and sexually oriented conversations between consenting adults [8]. However, since such data are rarely made public, Pendar [8] only experimented with the PJ dataset. His kNN classification experiments based on word token n- grams achieved up to 93.4% F-score when identifying the predators from the pseudo-victims. To tackle the issue of limited dataset availability, Bogdanova et al. [1-2] as well as Rachid et al. [9] and Peersman et al. [7] investigated some other setups. Bogdanova et al. [1- 2] experimented with new feature types such as emotional markers, emoticons and imperative sentences and computed sex-related lexical chains spanning over the PJ conversations. Using a corpus of cybersex chat logs3 and the Naval Postgraduate School (NPS) chat corpus4 as negative information for the predator class in the PJ dataset, their Naïve Bayes classifier yielded an accuracy of 92% for PJ predators vs. NPS and 94% for PJ predators vs. cybersex based on their high-level features. However, because these high-level features were partly derived from the PJ dataset itself, the experiment may have overestimated accuracy by detecting predators from the same PJ dataset. When dealing with the limited availability of data, both Rachid et al. [9] and Peersman et al. [7] proposed a system that in a first step classifies each user according to age group and gender, enabling the detection of predators using a fake adolescent user profile and distinguishing conversations between an adolescent and an adult from conversations between adults or adolescents only. As the number of non-predator users in the PAN 2012 sexual predator dataset was far greater than the number of predator users and 2 http://www.perverted-justice.com (last accessed on August 11th, 2012) 3 www.oocities.org/urgrl21f/ (last accessed on August 11th, 2012) 4 http://faculty.nps.edu/cmartell/NPSChat.htm (last accessed on August 11th, 2012) no meta-information about the users’ age and/or gender was available, in this paper we present a new approach that combines the results of predictions on the level of the individual post, the level of the user and of the entire conversation. Moreover, to circumvent problems resulting from the data imbalance in favor of the non-predator class, we describe both a resampling and a filtering strategy. More specifically, we trained a post-level classifier based on a balanced subset (described in Section 3.1), and a classifier on the user level based on a filtered subset of the training data (Section 3.2). We then combined the output of these two systems and imposed conversation-level constraints that significantly improved the quality of the output (Section 3.3). Figure 1 shows a simplified schematic representation of our three-stage approach to sexual predator identification. Post Resampling Classifier (posts) (high recall) Conversation Final Combined Level Predator User Prediction Constraints ID List Filtering Classifier (users) (high precision) Figure 1. Overview of the system architecture In all our experiments, we used LIBSVM [3] with the c-parameter set to 32.0, g to 0.0078125. We also set b to 1 in order to get the probability outputs for each class. These parameters were experimentally determined to provide good generalization during parameter optimization. 3.1 Classification on the Post Level Our first classifier was designed to make a distinction between posts by predators and posts by other users. We first resampled the number of users to create an equal distribution of predator and non-predator posts in the training sets of both our splits. Because it is very unlikely that predators can be identified based on no more than a few lines in a conversation, in our first resampling process we only included data from users of whom at least ten posts were present in the training set. This resulted in 40,722 posts for the predator class. We then attempted to match this number by randomly selecting data from other users which led to a set of 44,237 other users’ posts, thus ensuring a balanced distribution of post instances over both classes. We tokenized each post and represented each token (words, punctuation and emoticons) using binary token unigram vectors. We chose this type of features, because during preliminary experiments these features performed better than more complex features types such as the approach words, communicative desensitization words and relationship words that were used by Bogdanova et al. [1-2]. Applying the clustered experimental setup we described in Section 2, using token unigram features only, the SVM classifier yielded a 64.6% accuracy and a 0.14 precision, a 0.77 recall and 0.23 F-score for the predator class for Split 1. For Split 2, the system achieved 76.9% accuracy, 0.23 precision, 0.67 recall and 0.34 F-score for the predator class. This means that our post classifier was able to identify 11,153 predator posts correctly out of 15,702 posts in both our validation sets combined, but also rendered 53,563 false positive predator posts. After predicting which individual posts were possibly written by a predator, we aggregated the post-level predictions in order to identify which users in the data set were most likely to be predators. For this we used LiBSVM’s probability outputs (b = 1) and took the average of the 10 most “suspicious” posts for each user —the 10 posts with the highest probability output for the predator class— as the final predator probability for each user. Then, to determine a good threshold for labeling a user as a predator based on this value, we performed a grid search on the validation set of the first split and then evaluated the threshold on the validation set of the second split. The best results were achieved when applying a threshold of 0.85, in which case the classifier identified 56 out of a total of 60 predators correctly (28 in each validation set). However, the classifier also returned 93 false positives (55 in the first and 38 in the second set). Table 2 (see Section 3.3) shows the precision, recall and F-scores for the predator class for both probability aggregation strategies. As is clear from our results, our post classifier showed a high recall but a low precision score. Therefore, we decided to create a second system on the level of the user instead of the post level in order to produce a complementary system with a higher precision. In the next section we go into our user-based classification experiments and describe the filter we used to resample the initial dataset. 3.2 Classification on the User Level Because the PAN 2012 dataset also contains chat room conversations from a variety of domains other than the predator conversations (e.g. gaming, programming), we decided to compile a filter that would exclude all users that did not produce any “suspicious” posts. We therefore built a filter based on a list of words and expressions that could be linked to the typical stages in a predator’s grooming process (e.g. [4], [6]), which are discussed in Section 4. It was our hypothesis that by only including data from users who had produced at least one post that was caught by our filter, we would be able to create a classifier that focuses on exactly those elements that distinguish pedophile chat from other chat that contained similar vocabulary, such as sexually oriented chat conversations between consenting adults and/or people making arrangements to meet up. This resulted in a resampled training set of 111 predators (98.2%) and 39,453 others (48.4%) in Split 1 and 107 predators (97.3%) and 39,500 others (46.7%) in Split 2. In our validation sets, using this filter, 24.6% (i.e. 7,891) of the users were automatically labeled as non-predator. Because our error analysis showed that only six predators —five in the training sets and one in the validation sets— who did not produce more than two posts in the entire PAN 2012 training dataset, were lost because of this filter, we decided not to adapt our filter. Next, all posts from the same user were gathered into a single instance vector. This way, our second system would directly classify users as either a predator or a “non-predator” user, and no further aggregation steps were necessary. In the two splits, the users that were excluded by the filter were automatically labeled as non-predator in both the training and validation sets. As we expected, the “harder” classification experiment —because the data points in the filtered dataset lie closer together in the vector space— yielded a higher precision score than our aggregated post classifier, with only 7 false positives. The recall score, however, was much lower than for the aggregated post classifier, with 49 out of 60 predators identified correctly in the validation sets. The results of the user classification experiments are displayed in Table 2 (see Section 3.3). Because the user classifier had a higher precision score than the post classifier, we decided to investigate which combination of the outputs achieved the best F-score for the predator class. It is to the results of these experiments we turn next. 3.3 Combining the Results with Post-processing on the Conversation Level Starting from a post-based classifier that produced high recall and low precision scores for the predator class and a user-based classifier that achieved high precision and low recall scores, we decided to experiment with different ways of combining these outputs to create a system that played on the strengths of our high-precision user classifier, while still retaining some of the recall of the post-level classifier. One way to combine the outputs of several classifiers is to use ensemble methods, but these methods require an intricate experimental procedure (embedded cross-validation) to avoid overfitting on the test data. As we distributed clusters rather than conversations in both our partitions, we could not split these partitions into more sub-splits without users of the training sub-splits also appearing in the validation sub-splits and thus risking overfitting of user-specific features (see Section 2). Therefore, we decided to experiment with different ways of weighting the classifier outputs. Concretely, to determine the combined probability of a user being a predator, we used the following formula: 𝑃 𝑝𝑟𝑒𝑑 = 𝑤! ∗  𝑃! 𝑝𝑟𝑒𝑑 + 𝑤! ∗ 𝑃! (𝑝𝑟𝑒𝑑) where wu and wp are the weights on the probability of the user being a predator according to the user classifier and the post classifier respectively. After performing a grid search on the validation set of Split 1 and testing the weighted combinations on the Split 2 validation set, the best results were achieved when the user classifier was assigned a high weight (wu = 0.73), while the post classifier was assigned a complementary weight of 0.27 (wp). The resulting system managed to find 2 more predators than the user classifier alone (51 vs. 49), while still being relatively precise (10 false positives). When performing an error analysis on the predicted predators, we discovered that in some cases both users in a conversation were labeled as a predator. We suspect that the (pseudo-)victims, when replying to a predator during a conversation, mirrored some of the predator's vocabulary, which would explain why they were incorrectly labeled. To resolve this issue, we used the predator probabilities of the user classifier to determine which of both users in the conversation was in fact the predator. We again performed a grid search on the validation set of the first split to determine a good threshold so that as many false positives as possible could be excluded, without losing any of the true predators. By applying a threshold of 0.75, we were able to further reduce the number of false positives from 10 to 3: from 6 to 2 in the validation set of Split 1 and from 4 to 1 in the validation set of Split 2. Table 2 provides an overview of the best results of the single classifiers and their best performing weighted combinations. Table 2. Overview of the combined results on the validation sets of the single and combined classifiers for the predator class. True False False Predator Predator Predator Results Positives Negatives Positives Precision Recall F-score Post Classifier 56 4 93 0.38 0.93 0.54 User Classifier 49 11 7 0.88 0.82 0.84 Post + User 51 9 10 0.84 0.85 0.84 After Post- 51 9 3 0.94 0.85 0.90 processing Based on the results of these experiments we finally retrained our models on the entire PAN 2012 training set, and performed the final classification experiment on the until then unknown PAN 2012 test set. This resulted in a list of 188 identified predators. We then applied the same post-processing strategy, which led to our final list of 170 predators. Of these predator IDs 152 were found to be correct, resulting in a 0.89 precision, a 0.60 recall and a 0.72 F-score for the predator class. 4 Task 2: Identifying the Grooming Posts The second part of the Sexual Predator Identification sub-task consisted of detecting the specific posts in the predators’ chat that were most distinctive for grooming behavior. Lanning [4] already argued that pedophiles groom their victims following five predictable stages: identifying a possible victim, collecting information about the intended victim, filling a need, lowering inhibitions and initiating the abuse. With regard to automatically detecting online grooming, McGhee et al. [6] were the first to incorporate the stage division into their research. Based on an expanded dictionary of terms they applied a rule-based approach, which categorized a post as belonging to the stage of gaining personal information, grooming (which included lowering inhibitions or reframing and sexual references), approach or none. Their rule-based approach outperformed the machine learning algorithms they tested and reached up to 75.1% accuracy in determining whether a PJ post was predatory or not. Because there were no gold standard labels available for this task that distinguished grooming from non-grooming posts in the predators’ chat conversations, we decided to adopt a similar approach by creating a dictionary-based filter that only selects those posts that were linked to one of the following stages we adopted from Lanning [4] and McGhee et al. [6] in the predators’ grooming processes: (1) the sexual topic, which includes discussing erogenous parts of the body, mentioning and performing sexual acts and sexually oriented adjectives and multi-word expressions; (2) reframing, which was already defined by McGhee et al. [6:8] as “the redefinition of sexual behaviors into non-sexual terms, such as connecting sexual acts to messing around, practicing or teaching” and (3) approach, which contains words and expressions that refer to meeting in person. After manually analyzing part of the predator conversations in the PAN training set, we decided to add three extra stages that seemed typical of online grooming: (4) requests for data, i.e. pictures, videos or using the webcam; (5) isolation from adult supervision (e.g. “home alone?”) and (6) age, which includes references to old(er) vs. young(er) and child-related vocabulary (e.g. “tummy”, “tiny”). Although McGhee et al. [6] also mentioned the stage of building up a trusting or friendly relationship with the victim, we did not find this especially distinctive of pedophile grooming behavior and did not include this into our filter. Next, we started compiling our grooming filter by adding the words from the dictionary by McGhee et al. [6] if they fit into one of our predefined stage categories. We then heavily expanded each category by manually selecting synonyms and related terms on the English Urban Dictionary website5 and the English Synonyms website6. The complete dictionary can be obtained for research purposes by contacting the authors of this paper. As we mentioned in Section 3.2, our filter alone could reduce the number of users by at least 24.6%, but it was definitely not enough to identify the highly limited number of predators. Therefore, we only applied this filter to perform a pre-selection for our user classifier and to select the grooming posts of the users that were already identified as a predator by our best scoring three-stage classification system (see Section 3.3). Using this strategy, our filter labeled 4,717 posts in the PAN 2012 test set as belonging to one of our six main grooming stages. Of these posts, 1,688 were found to be correct by the organizers, resulting in a 35.8 precision, a 26.1 recall and a 30.2 F-score. 5 Discussion In this paper we proposed a new approach to detect Internet predators grooming their victims in chat rooms. Our experiments showed that a weighted combination of a high-recall post classifier and a high-precision user classifier achieved better results for the predator class than each system separately. Moreover, imposing conversation- level constraints boosted precision significantly, resulting in a final F- score of 0.90 for the predator class in cross-validation on the PAN 2012 training set. Interesting to see, was that the use of a dictionary-based filter could not only reduce the number of possibly suspicious users by 24.6 to 53.3%; it also enabled us to create a user classifier that focuses on exactly those elements that distinguish pedophile chat from other chat that contained similar vocabulary. This filter also proved to be very effective when detecting which of the posts in the identified predators’ chat conversations were most distinctive for grooming behavior. 5 http://www.urbandictionary.com/ (last accessed on June 22nd, 2012). 6 http://www.synonyms.net/ (last accessed on June 22nd, 2012). To calculate the final F-score the competition’s organizers decided to set the β-factor to 0.5 to emphasize precision when detecting predators online to optimize the time needed for a police agent to check the output, while they set the β-factor to 3.0 to emphasize recall when detecting predators’ grooming posts to collect as much evidence as possible. In our opinion, it would be more important to heavily reduce the number of possibly suspicious users that are to be checked manually, while still retaining all of the actual predators, in which case our post classifier produced the best results with a recall score of 0.93 and reducing the number of possibly suspicious users from over 32,000 to 149 in our validation sets while only losing 4 predators. Likewise, when manually checking a suspicious user’s communications, a police officer or a moderator will need a swift access to the most striking posts to be able to quickly discard remaining false positives, which means precision is of high importance here. Therefore, in future research we will work on a system that combines a high-recall classifier with a grooming scoring system that will rank the remaining suspicious users according to the presence of the grooming stages in their conversations. This will also enable both law enforcement agencies and moderators to take action more quickly regarding imminent meetings and abuse, when these stages are attributed higher weights in the scoring system. 5 References [1] Bogdanova, D., Rosso, P. and Solorio, T. Modelling Fixated Discourse in Chats with Cyberpedophiles. In: Proceedings of the EACL 2012 Workshop on Computational Approaches to Deception Detection. Avignon, France. p.86-90 (2012a) [2] Bogdanova, D., Rosso, P. and Solorio, T. On the Impact of Sentiment and Emotion Based Features in Detecting Online Sexual Predators. In: Proceedings of the 3rd Workshop on Computational Approaches to Subjectivity and Sentiment Analysis. Jeju, Republic of Korea. p.110-118 (2012b) [3] Chang, C. and Lin, C. LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1--27:27 (2011) [4] Lanning, K. Child Molesters: A Behavioral Analysis. For Professionals Investigating the Sexual Exploitation of Children. National Center for Missing & Exploited Children, USA. http://www.missingkids.com/en_US/publications/NC70.pdf (2010) [5] Livingstone, S., Haddon, L., Görzig, A., and Ólafsson, K. EU Kids Online final report (2011) http://www2.lse.ac.uk/media@lse/research/EUKidsOnline/EU% 20Kids%20II%20(2009- 11)/EUKidsOnlineIIReports/Final%20report.pdf [6] McGhee, I., Bayzick, J., Kontostathis, A., Edwards, L., McBride, A. and Emma Jakubowski Learning to Identify Internet Sexual Predation. International Journal of Electronic Commerce, 15:3, pp. 103 (2011) [7] Peersman, C., Daelemans, W. and Van Vaerenbergh, L. Predicting age and gender in online social networks. In: Proceedings of the 3rd Workshop on Search and Mining User- Generated Contents. Glasgow, UK. http://www.cpl.ua.ac.be/sites/default/files/smuc1504- peersman.pdf (2011) [8] Pendar, N. Toward Spotting the Pedophile: Telling victim from predator in text chats. In: The Proceedings of the First IEEE International Conference on Semantic Computing. California, USA. p.235-241 (2007) [9] Rashid, A., Rayson, P., Greenwood, P., Walkerdine, J., Duquenoy, P., Watson, P., Brennan, M. and Jones, M. Isis: Protecting Children in Online Social Networks. At The International Conference on Advances in the Analysis of Online Paedophile Activity. Paris, France. http://eprints.mdx.ac.uk/4738/1/Isis_overview_v3.pdf (2009)