A Two-step Approach for Effective Detection of Misbehaving Users in Chats⋆ Notebook for PAN at CLEF 2012 Esaú Villatoro-Tello2, Antonio Juárez-González1, Hugo Jair Escalante1 , Manuel Montes-y-Gómez1, and Luis Villaseñor-Pineda1 1 Laboratorio de Tecnologías del Lenguaje, Coordinación de Ciencias Computacionales Instituto Nacional de Astrofísica, Óptica y Electrónica (INAOE), Mexico. {antjug, hugojair, mmontesg, villasen}@ccc.inaoep.mx 2 Information Technologies Department, Universidad Autónoma Metropolitana (UAM), Unidad Cuajimapla, Mexico City, Mexico. evillatoro@correo.cua.uam.mx Abstract This paper describes the system jointly developed by the Language Technologies Lab from INAOE and the Language and Reasoning Group from UAM for the Sexual Predators Identification task at the PAN 2012. The presented system focuses on the problem of identifying sexual predators in a set of suspi- cious chatting. It is mainly based on the following hypotheses: (i) terms used in the process of child exploitation are categorically and psychologically different than terms used in general chatting; and (ii) predators usually apply the same course of conduct pattern when they are approaching a child. Based on these hypotheses, our participation at the PAN 2012 aimed to demonstrate that it is possible to train a classifier to learn those particular terms that turn a chat con- versation into a case of online child exploitation; and, that it is also possible to learn the behavioral patters of predators during a chat conversation allowing us to accurately distinguish victims from predators. 1 Introduction It is well known that the World Wide Web (WWW) has vastly penetrated into social liv- ing and allows connecting people from different geographic regions through new forms of communication. Examples of such communication forms are instant messaging ser- vices, chat rooms, social networks (e.g., Facebook, Twitter, etc.) and blogs. These ser- vices have become very popular tools for personal as well as for group communication, as they are cheap, easy to use, virtual and private in nature [6]. Such online services allow users to hide their personal information behind the mon- itor; which, on the one hand, makes this type of communication a source of fun, but on the other hand, it also represents a threat. The privacy and virtual nature of these services augment the possibilities of some heinous acts which one may not commit in the real world. Examples of such acts are the online paedophiles who “groom” children, ⋆ This work was done under partial support of CONACYT (project grants 134186 and 106013). We also thank SNI-Mexico, INAOE and UAM for their assistance. that is, who meet underage victims online, engage in sexually explicit text or video chat with them, and eventually convince the children to meet them in person. According to the National Center for Missing & Exploited Children [9] and the Office of Juve- nile Justice and Delinquency Prevention3, one out of every seven children receives an unwanted sexual solicitation online. Traditionally, a term that is used to describe such malicious actions with a potential aim of sexual exploitation or emotional connection with a child is referred as “Child Grooming” or “Grooming Attack” [3], which has been defined by [1] as: a communica- tion process by which a perpetrator applies affinity seeking strategies, while simultane- ously acquiring information about and sexually desensitizing targeted victims in order to develop relationships that result in need fulfilment (e.g. “physical sexual molesta- tion.”). Nowadays, the usual way to catch these sexual predators is for trained law enforce- ment officers or volunteers to pose as children in online chat rooms, thus predators fall into the trap and are identified. However, online sexual predators always outnumber the law enforcement officers and volunteers. An organization that employs this method- ology to catch sexual predators is the Perverted Justice group, located in the United States. This organization has been able to convict more than 500 predators since 20044. Nevertheless, there is a great need for software applications that can flag suspicious on- line chats automatically, either as a tool to aid law enforcement officials or as parental control features offered by chat service providers. In this paper we propose a novel methodology that faces the problem of sexual predators identification as a text classification task by means of a supervised approach. The identification process is divided in two main stages: the Suspicious Conversations Identification (SCI) stage and the Victim From Predator disclosure (VFP) stage. Per- formed experiments showed that it is possible to train a classifier to learn those particu- lar terms that turn a chat conversation into a case of online child exploitation; and, that it is also possible to learn the behavioral patters of predators during a chat conversation allowing us to accurately distinguishing victims from predators. The rest of the paper is organized as follows. Section 2 presents recent work on the task of sexual predators identification. Section 3 describes the proposed methodology for identifying both suspicious chat conversations and the sexual predator within a chat conversation. Section 4 presents the experimental settings as well as the results achieved in the context of the PAN 2012 competition. Finally, Section 6 depicts our conclusions and formulates directions for future work. 2 Related Work Traditionally, the problem of identifying grooming attacks has been tackled through text classification strategies. In [5] the problem is reduced to the task of distinguish- ing predators from victims within chat conversations. Such conversations are known to be cases of grooming attacks, particularly they used a set of 701 conversations ob- tained from the perverted-justice web page. In order to solve the problem, Pendar et al. 3 http://www.ojjdp.gov/ 4 http://www.perverted-justice.com/ separated those interventions that belong to the set of victims from those that belong to the set of predators, i.e., a two-class problem. Next, authors remove the stopwords, and then computed word unigrams, bigrams and trigrams. Subsequently, they processed them with the classification algorithms Support Vector Machine (SVM) and k-nearest neighbors (k-NN). Authors performed several experiments varying the number of fea- tures from 5000 to 10000, and concluded that the k-NN algorithm with k equal to 30 and employing a feature vector of 10000 elements provides the most effective classification resulting in a value of f -measure that equals to 0.943. Similarly, in [3], Michalopoulos et al. proposed a decision making method to be used for recognizing potential grooming threats by extracting information from cap- tured dynamic dialogues. Their proposed method makes use of the following three clas- sification classes: i) Gaining Access: indicate predators intention to gain access to the victim; ii) Deceptive Relationship: indicate the deceptive relationship that the predator tries to establish with the minor, and are preliminary to a sexual exploitation attack; and iii) Sexual Affair: clearly indicate predator’s intention for a sexual affair with the victim. The classification process computes the probability that a captured dialogue belongs to each one of the above classes. At the end, their system decides if there is a threat in the conversation by means of a linear combination of the computed probabilities. For their experiments, authors employed a set of 219 chat conversations (73 for each class), they removed all the stopwords and applied a spelling correction strategy. Michalopoulos et al. concluded that Naïve Bayes is the most appropriate technique, not only for reaching the highest average classification score of 96%, but also for being fastest than all the other algorithms that were evaluated. Finally, the work proposed by RahmanMiah et al. faces the problem of grooming attack identification from a more general point of view [6]. Contrary to the work pro- posed in [5,3], the authors of [6] define a method for identifying when a conversation is a case of child exploitation instead of detecting which user is the predator directly. The system proposed by RahmanMiah et al. defines three classes of conversations: i) Child Exploitation: cases of grooming attacks; ii) Sexual Fantasies: conversations be- tween adults with a high degree of sexual content; and iii) general chatting: general conversations with no sexual content. The proposed system applies traditional text cat- egorization techniques in combination with psychometric and categorical information provided by LIWC (Linguistic Inquiry and Word Count [4]). For their experiments, authors employed a set of 392 conversations, they do not apply any pre-processing to the texts neither a spelling correction process. Authors conclude that psychometric and categorical information can be used by classifiers as a feature set to predict the sus- pected child exploitation in chats. These psychometric features significantly improve the performance of Naïve Bayes classifiers to predict child-exploitation type chats. Our proposal differs from previous works in that it attacks both problems at once, i.e., we are able to identify when a chat conversation is a case of child exploitation and subsequently we are able to tell which user is the sexual predator. Thus our proposal can be used for both purposes (detecting suspect conversations and identifying predators); besides, we show that the two-step approach outperforms a single-stage method. 3 Proposed Method The proposed method for detection of misbehaving users in chats is based on two main hypotheses: (i) terms used in the process of child exploitation are categorically and psy- chologically different than terms used in general chatting; and (ii) predators usually ap- ply the same course of conduct pattern when they are approaching a child. Accordingly, we propose a new methodology for solving the problem of sexual predators identifica- tion, which is divided in two main stages: the Suspicious Conversations Identification (SCI) stage and the Victim From Predator disclosure (VFP) stage. Figure 1 shows the general architecture of the proposed system. Figure 1. General overview of the proposed sexual predators identification system. Notice that the goal of the first stage is to act as a filter, i.e., it helps distinguishing between general chatting and possible cases of online child exploitation; in this way, the set of conversations to be analyzed by the VFP module will be reduced. Hence we can focus only on conversations that potentially include sexual predators for a more fine grained analysis. Consequently, the goal of the second stage is to identify (to point at) the potential predator from a possible case of child exploitation. The associated classification problem is less complex than trying to discriminate between predators and typical users directly, see Section 4.4. 3.1 Pre-processing stage Our proposed system does not include any module for preprocessing the texts, i.e., we did not remove any punctuation mark, stopwords and, neither apply a stemming process. The main reason for not applying any preprocessing was because the text in chat conversations had unique characteristics that distinguish them from any other type of text [2,6,7], for example, chat conversations do not follow any grammar rules (i.e., are grammatically informal and unstructured), plus the frequent orthographical errors and the common use of abbreviations and emoticons. We believe that using emoticons and intentional misspelled words may contain valu- able contextual information in a chat text. For example, in the grooming phase the per- petrator may amend the relationship by an emphasized “soryyyyyyyyy” when the child felt threatening by any obtrusive language. Another example may be the emoticon for “hug (>:d<)” and “kiss (:-*)” for a soft introduction of sexual stage. Preserving such information makes traditional language processing tools, such as stemmers and POS taggers, unsuitable for processing the chat texts [2,6]. 3.2 Filtering stage Although we did not apply any pre-processing stage, it is important to mention that we did apply a pre-filtering stage to all the conversations that were given for the PAN 2012 sexual predator competition. The goals of the pre-filtering stage were: i) to help us focusing only in the most important cases and, ii) to reduce the computational cost for automatically processing all the information. This pre-filtering stage consisted in removing all the conversations that accomplish at least one of the following conditions: 1) Conversations that had only one partici- pant, 2) Conversations that had less than 6 interventions per-user and 3) Conversations that had long sequences of unrecognised characters (apparently images). Table 1 shows information about the training data before and after applying this pre-filtering stage. Number of... Original data Filtered data Chat conversations 66,928 6,588 Users 97,690 11,038 Sexual Predators 148 136 Table 1. Number of chat conversation, users and sexual predators that remain after applying the pre-filtering stage on the training data. As it can be seen, by means of the pre-filtering stage we are able to reach a sub- stantial reduction ratio (90% approximately) of conversations/users. It is important to notice that by doing this pre-filtering, we also removed a few sexual predators. Thus, even if our proposed system works perfectly we will not be able to identify the 100% of the sexual predators. Nevertheless, we think the information from interventions of removed predators was not enough to effectively recognize them as predators anyways. 3.3 Suspicious Conversations Identification As we have mentioned before, our system faces the problem of sexual predators iden- tification as a text classification task (TC). Accordingly, for training the SCI classifier (Figure 1) we employed traditional TC techniques to construct a model that distinguish between general chatting and cases of child exploitation. In order to properly train our SCI classifier, we labeled as suspicious conversations all the chat conversations were at least one predator appears, resulting in 5790 non- suspicious conversations and 798 suspicious ones. During the experimentation phase, the SCI classifier represents the conversations by means of the bag of words represen- tation (BOW) employing either a boolean or a TF-IDF weighting scheme. Since we did not apply any preprocessing stage to the chat texts, we obtained features vectors of 117015 elements. 3.4 Victim From Predator disclosure Similarly to the SCI classifier, our VFP stage was designed using traditional TC tech- niques. The goal of the VFP classifier was to recognize sexual predators in suspicious chat conversations, as detected by the SCI method. For training the VFP classifier we divided all the text conversations, where a predator was involved, into interventions. This means that if a text chat involved two different users, we had two sets of interven- tions. Therefore, we used as examples of victims the interventions of the users that had a conversation with a predator, resulting in 194 examples of victims; and as examples of predators we used the interventions of the 136 users already labeled as predators. For the experiments performed, the VFP classifier employs a BOW representation us- ing either a boolean or a TF-IDF weighting scheme. For the VFP classifier we obtained features vectors of 16709 elements. 4 Experimental Setup 4.1 Data set For our experiments we used the data set provided in the context of the PAN 2012 Lab: Sexual Identification competition. As we mentioned in Section 3.2 we were given for training a total of 66928 different chat conversations, where 97690 different users are involved and only 148 are tagged as sexual predators (See Table 1). Additionally, a test data set for evaluation was provided. Such corpus contained 155129 chat texts, where 218702 different users are involved and only 250 are tagged as sexual predators. For a more detailed explanation on how the training and test corpora were constructed please visit http://pan.webis.de/. 4.2 Classification methods Two classifiers from the CLOP toolbox [8] are used in the text classification experi- ments; these are Neural Networks (NN) and Support Vector Machines (SVM) classi- fiers. The NN classifier was set as a two layer neural network with a single hidden layer of 10 units. For the SVM we tried linear and polynomial kernels. During the development phase we adopted two-fold cross validation to estimate the performance of our methods using training data only. This validation technique was used for all of our experiments. For the final evaluation of our system we used the test data provided by organizers of PAN 2012, see Section 4.1. An analysis of the results using both training and test data sets is given in the following section. The evaluation of training-set results was carried out mainly by means of the classification accuracy, which indicates the overall percentage of text chats correctly classified. Additionally, due to the class imbalance, we also report results in terms of F1 measure. Regarding the final evaluation of the system on test data, we used the measures proposed by organizers, namely: F-0.5 measure, precision and recall. 4.3 Baseline definition As baseline we used the traditional paradigm for solving the problem of sexual preda- tors identification. Figure 2 provides a general view of the baseline definition. Figure 2. General overview of the baseline system for identifying Sexual Predators. As can be noticed, the problem of identifying sexual predators is performed in one single step. For the baseline experiment we employed a BOW representation using ei- ther a boolean or a TF-IDF weighting scheme. By following the same procedure estab- lished for the SCI and the VFP classifiers, under this configuration we obtained features vectors of 117015 elements. 4.4 Experimental results In this section we report experimental results obtained by the components of our two- step approach, as well as the results obtained with the baseline method, using training data. Next, in Section 4.5, we report the performance of the system in the test data set. Baseline results. Table 2 shows results obtained by the baseline configuration. In order to evaluate if a dimensionality reduction strategy could be helpful for the classifier, we performed several experiments varying the size of the features vectors that were used to train the classifier. For this purpose we employed the well known information gain (IG) method to rank and preserve the most distinctive features. Recall that the size of the features vector is 117015 elements, hence using only 10% of the features means that the classifiers employed only 11,702 features to represent the 11,038 chat conversations. The baseline configuration faces the problem of learning a model that helps to classify between normal users and sexual predators from a highly unbalanced corpora, i.e., 10,902 normal users and only 136 sexual predators. Table 2 indicates that using a binary weighting scheme allows a better performance for the classifier. It is also possible to observe that reducing the dimensionality of the algorithm Weighting Num. of features Accuracy F-measure SVM binary 100% 0.9935 0.6869 SVM binary 70% 0.9935 0.6869 SVM binary 40% 0.9922 0.6587 SVM binary 10% 0.9716 0.2737 SVM tf-idf 100% 0.9926 0.6611 SVM tf-idf 70% 0.9926 0.6611 SVM tf-idf 40% 0.9907 0.5253 SVM tf-idf 10% 0.9846 0.1747 NN binary 100% 0.9936 0.6697 NN binary 70% 0.9928 0.6153 NN binary 40% 0.9939 0.5955 NN binary 10% 0.9933 0.6696 Table 2. Results obtained with the baseline configuration. Several experiment applying a dimen- sionality reduction strategy were performed. feature’s vector does not allow a significant improvement. On the contrary, it decreases the ability of the classifier to accurately distinguish between normal users and sexual predators. Because of these results we decided to not use any dimensionality reduction strategy for both SCI and VFP classifiers. SCI results. Table 3 shows results obtained from the SCI classifier. As mentioned in previous sections, the aim of this classifier is to distinguish between general chatting and possible cases of online child exploitation. It is worth mentioning that the train- ing data employed by the SCI classifier represented an unbalanced corpus, since there are 5790 conversation labeled as general chatting and only 798 text chats labeled as cases of online child exploitation, however, obtained results showed that it is possible to accurately distinguish suspicious conversations. Algorithm Weighting Accuracy F-measure SVM binary 0.9848 0.9361 SVM tf-idf 0.9883 0.9516 NN binary 0.9874 0.9464 NN tf-idf 0.9825 0.9254 Table 3. Results obtained by the SCI module. Experimental results showed that both classifiers (i.e., SVM and NN) are suitable for solving the problem of classifying suspicious conversations. Contrary to the results obtained with the baseline configuration, the SCI classifier using SVM as classification method obtained better results when chat conversations are represented by means of a BOW considering a tf-idf weighting scheme. We concluded from these experiments that, terms used in the process of child ex- ploitation are categorically and psychologically different than terms used in general chatting, which allows to train a classifier to learn those particular terms and accurately detect cases of online child exploitation. VFP results. Table 4 shows results obtained from the VFP classifier. As we mentioned in Section 3.4, the aim of this particular module is, once a conversation has been tagged as a suspicious, to point at the sexual predator, i.e., to tell which user is the victim and which one is the predator. Algorithm Weighting Accuracy F-measure SVM binary 0.9148 0.9138 SVM tf-idf 0.9259 0.9305 NN binary 0.9407 0.9424 NN tf-idf 0.9296 0.9337 Table 4. Results obtained for the VFP module. Obtained results showed that the proposed methods are adequate for solving the problem of classifying victims and predators. Similarly to the results obtained in the SCI classifier, using SVM as classification method obtained better results when a tf-idf weighting scheme was employed. Nevertheless, for the case of the VFP classifier, the best results were obtained when using NN algorithm and a binary weighing scheme. From the experiments performed in this section we were able to show evidence that suggest predators apply the same course of conduct pattern when they are approaching a child, and that our proposed method it is able learn these behavioral patters of predators during a chat conversation allowing us to accurately distinguish victims from predators. 4.5 PAN 2012 competition results Previous sections described obtained results employing the training data set, and all ex- periments were performed in a controlled scenario. However, the main goal of the PAN 2012 Lab was to evaluate in real scenarios the proposed method for sexual predators identification. This section reports official results obtained by our system on test data from the PAN 2012 competition. In order to apply our proposed methodology, the first step consisted on applying the filtering stage (Section 3.2) to test data. Table 5 shows some statistics of the test data before and after applying the filtering stage. From this table we can observe similar reduction ratios as in training data. A total of 28 predators were removed by our filtering approach. We manually analyzed the interventions of removed predators and found that most of them contained a few characters only; we consider that with such little information it is not possible to effectively identify to removed predators. After filtering the test data, we were able to apply our proposed method (Figure 1). The next step was to represent all the remaining conversations (i.e., 15,330) into the Number of... Original data Filtered data Chat conversations 155,129 15,330 Users 218,702 25,120 Sexual Predators 254 222 Table 5. Number of chat conversation, users and sexual predators that remain after applying the pre-filtering stage on the test data. generated model for the SCI classifier. Following, the chat conversations that the SCI classified as suspicious were divided into interventions and represented accordingly to the model generated for the VFP classifier. Our team submitted three different runs: i) Baseline: it corresponds to the config- uration showed in Figure 2 employing as classification method a Neural Network and using a binary weighting scheme with no dimensionality reduction; ii) SCI(NN-B)& VFP(NN-TF-IDF): it means that the SCI module was configured for using a NN and a binary weighting scheme, whereas the VFP module used a NN with a tf-idf weight- ing scheme; and iii)SCI(NN-B)& VFP(NN-B): it means that both the SCI and the VFP modules were configured for using a NN and a binary weighting scheme. Official results of submitted runs are showed in Table 6. The leading evaluation measure was F-0.5 measure, which emphasizes the importance of the precision of the system, which is particularly important for sexual predator detection as mentioned by the organizers of the PAN 2012 competition. Run Recall Precision F-measure F-measure (β = 0.5) Baseline 0.4055 0.9537 0.5691 0.7507 SCI(NN-B)& VFP(NN-TF-IDF) 0.7874 0.9479 0.8602 0.9107 SCI(NN-B)& VFP(NN-B) 0.7874 0.9804 0.8734 0.9346 Table 6. Official evaluation results from the submitted runs. As it is possible to observe, the best configuration was the third one, i.e., using in both modules a binary weighting scheme. Thus showing that the hypotheses of our work were right. Indeed the configuration SCI(N N − B)&V F P (N N − B) obtained the highest performance among the 16 teams that participated in the sexual predator identification track of PAN 2012. The closest entry (snider12-run-2012-06-16-0032) achieved an F-0.5 measure of 0.9168, whereas the average was of 0.5105. The results obtained by our group are promising and motivate us to pursuing several future work directions. 5 Identifying predators’ bad behavior An additional task to that of sexual predator detection, proposed by organizers of PAN 2012, was to search for those lines (interventions) that reveal predator’s bad behav- ior. Traditionally, such lines are manually identified and used as evidence to convict paedophiles. PAN 2012 participants were encouraged to propose new ideas to automat- ically solve this issue. We approached the line-detection task with a language-models based approach. Our main idea was based in the following statement; it is well known that every predator follows three main stages when approaching a child [3]: i) gain access to the victim, ii) involve the victim in a deceptive relationship and, iii) launch and prolong a sexually abuse relationship. Based on these facts, we believe that if we can generate language models from each one of the stages mentioned above, we will be able to find those lines that represent a bad behavior. We were particularly interested in the 2nd and the 3rd stages, since from our point of view these should be the most critical sections within a child exploitation chat conversation. Our proposed solution works as follows: first we automatically divide all the con- versations where a predator appears in three sections, such division is made without considering any type of contextual frontiers, i.e., we did not identify where a child ap- proaching stage begins or ends. Next we generated the language model (lm) of the 2nd and the 3rd parts5 . Finally, for a user that is tagged as predator, we compute the per- plexity against the lm of each one of its interventions, and we delivered as the most distinctive lines of bad behaving those with the minor perplexity value. For the competition, we proceed as follows: from the set of users labeled as sexual predators by our system (SCI(NN-B)& VFP(NN-B)), we select the 50 most distinctive lines and delivered as examples of predators bad behaving. Table 7 shows examples of the lines that were delivered. Official evaluation results indicated that by following this procedure we were able to identify just 1 revealing line. We believe that our proposed idea could be very effective, although more work is necessary in order to improve its performance. Although this result seems absolutely negative we have to mention that the winner participant of this sub-task submitted 63, 290 lines (grozea12-run-2012-06- 14-1706b). From the 2nd parts From the 3rd parts what do you want me to be? what do u want me to say what do u want me to say what do u want me to wear what do u want me to wear what do you want me to be? do u want to talk to me too do u want me to do it Table 7. Examples of lines with the minor perplexity values. 6 Conclusions We have proposed a new methodology for detecting sexual predators in text chats. Our proposal differs from traditional approaches in that it divides the problem in two stages: 5 We used the SML toolkit http://svr-www.eng.cam.ac.uk/ prc14/toolkit.html to this end. the Suspicious Conversations Identification (SCI) stage and the Victim From Predator disclosure (VFP) stage. The goal of the first stage is to work as a filter, i.e., it helps distinguishing between general chatting and possible cases of online child exploitation; in this way, the set of conversations to be analyzed will be reduced, hence we can focus only on conversations that potentially include sexual predators for a more fine grained analysis. Consequently, the goal of the second stage is to identify (to point at) the potential predator from a possible case of child exploitation. Performed experiments showed that it is possible to train a classifier to learn those particular terms that turn a chat conversation into a case of online child exploitation; and, that it is also possible to learn the behavioral patters of predators during a chat conversation allowing us to accurately distinguish victims from predators. Our partici- pation in the PAN 2012 forum showed that the proposed methodology is able to produce very good results in a realistic scenario, obtaining an F-measure (β = 0.5) of 0.8936, which was the best ranked result among all of the participants. As future work we plan to include some linguistic features, such as proposed by [6]. We believe that the inclusion of such type of features can be helpful for increasing the recall levels of our proposed system. In addition, we also believe that in the process of identifying the interventions that depict the predator’s bad behavior this type of infor- mation could be very helpful. References 1. Harms C. Grooming: An operational definition and coding scheme. In Sex Offender Law Re- port, Vol. 8, Num. 1. pp. 1-6, 2007. 2. Kucukyilmaz T., Cambazoglu B. B., Aykanat C. and Can F. Chat mining: predicting user and message attributes in computer-mediated communication. In Information Processing and Management Vol. 44(4), pp. 1448-1466. 2008. 3. D. Michalopoulos and I. Mavridis. Utilizing document classification for grooming attack recognition. In IEEE Symposium on Computers and Communications (ISCC 2011), pp. 864- 869, 2011. 4. O’Connell R. A Typology of Child Cyber- sexploitation and Online Grooming Practices. In Cyberspace Research Unit, University of Central Lancashire. 2003. Retrieved from http://image.guardian.co.uk/sys-files/Society/documents/2003/07/17/Groomingreport.pdf (accessed August 2012). 5. Pendar N. Toward Spotting the Pedophile Telling victim from predator in text chats. In IEEE International Conference on Semantic Computing. Irvine California USA, pp. 235-241, 2007. 6. RahmanMiah M. W., Yearwood J., and Kulkarni S. Detection of child exploiting chats from a mixed chat dataset as text classification task. In Proceedings of the Australian Language Technology Association Workshop, pp. 157-165, 2011. 7. Rosa K. D., and Ellen J. Text classification methodologies applied to micro-text in military chat. In Proceedings of the eight IEEE International Conference on Machine Learning and Applications (ICMLA ’09), pp. 710-714, 2009. 8. A. Saffari and I Guyon. Quick Start Guide for CLOP. Technical report, Graz-UT and CLOP- INET, May, 2006. 9. Wolak J., Mitchell K., and Finkelhor D. Online victimization of youth: Five years later. In National Center for Missing & Exploited Children Builletin 07-06-025, National Center for Missing & Exploited Children, Alexandia, VA, 2006.