Searching Sexual Predators in Social Networks

              Yuridiana Alemán, Darnes Vilariño, and David Pinto

                   Facultad de Ciencias de la Computación
             Benemérita Universidad Autónoma de Puebla, México
 yuridiana.aleman@gmail.com, darnes@solarium.cs.buap.mx, dpinto@cs.buap.mx


       Abstract. In this paper we propose a two-step technique for detecting
       sexual predators from social network dialogues. One step for detecting
       dialogues in which a sexual predators participates, and the second step
       is for detecting, from the whole dialogue users, the one that is the sexual
       predator. From the three different supervised classifier employed, Ran-
       dom Forests obtained the best results in the first step, whereas Neural
       Networks performed best in the second step.


Keywords: Search, Supervised classification, Sexual predators.


1     Introduction
Sexual predators have found a new manner of select victims through the use
of social networks. It is relatively easy for these predators to pretend to be a
child or teenager with the aim of obtain the confidence of their victim. Due to
this fact, there exist diverse attempts for detecting these kind of behaviors by
analyzing conversations in chat rooms.
     Major research works take Pendar [1] as a reference point for this topic. Here,
the authors use a dataset gathered from a website named “Perverted Justice”1
for conducting a study using automatic text categorization techniques for identi-
fying online sexual predators. More recently, Villatoro[2] performed conversation
filtering by removing the shortest conversations with unintelligible characters or
those conversations in which the chat participants have a very low number of
interventions. With this pre-processing step, it is possible drastically to reduce
the amount of texts contained in the training set.


2     Methodology
The methodology proposed for searching/identifying sexual predators is shown
in Figure 1. This proposal is made up of two steps: 1) A classification process
that allows to discriminate those conversations in which a sexual predator partic-
ipate; and, 2) A classification process that allows to discriminate the predator’s
dialogues with respect to other participants. In both steps, we use the following
classification algorithms: neural networks, random forests and decision trees.
1
    http://perverted-justice.com
           Fig. 1. Methodology for detecting sexual predator conversations


    We built and used three lexical resources (dictionaries) for pre-processing
step: emoticons (“:-)” is normalized as “happy”), contractions(“isn’t” is norma-
lized as “is not”), and SMS vocabulary (“10q” is normalized as “thank you”).
Afterwards, we extracted the features using POS-tagger[3], we used every mor-
phological feature as attribute2 . For the experiments carried out, we used the
union of two different conversation sets: Perverted Justice used in [1] and the
PAN 2012 Training set 3 that contain conversations provided by the PAN 2012
conference committee, which are structured in XML format.


3     Experimental results

We have carried out experiments using several classification algorithms imple-
mented in the Weka[4] tool. We selected the following classification algorithms
that obtained the best results: Decision trees[5], Random forests[6], and Neural
networks (BackPropagation algorithm).
    For the evaluation of results, we use the Weka option “Use training set” for
step 1 and “Cross-validation” with 10 folds for step 2. Subsequently, we obtain
the Precision (P ), Recall (R) and F-Score (F ), and we used the best F -score
model for using in the second step.
    Table 1 shows the results obtained for every classifier in both steps. TRC
is the total of retrieved instances, and TCP is the total of positive instances.
Actually, from 2,353 positive conversations of the dataset, the models identified
around 2,000 conversations. The best performance was obtained by Random For-
est with, 0.983 of precision, 0.882 recall and 0.930 of F -measure, thus, identifying
very well those dialogues in which a sexual predator participates.
    Using the conversations detected by the Random Forest classifier, we have
re-constructed the dialogues in which one sexual predator participates. Thus,
the second corpus contains 480 conversations of sexual predators and 442 con-
versations of non-sexual predators (922 conversations). In this step, Random
2
     http://bit.ly/WHsvBN
3
    http://pan.webis.de/
             Table 1. Rankings obtained for classification conversations

                   Classifier   TRC PCR Precision Recall F-Score
                              Step 1: Conversations
               Decision Trees  1,146 1,110      0.968 0.471 0.634
               Neural Networks 905 766          0.846 0.325 0.470
               Random Forests 2,111 2,076     0.983 0.882 0.930
                                  Step 2: Users
               Decision Trees    472 347        0.735 0.723 0.729
               Neural Netwoks    516 399      0.773 0.831 0.801
               Random Forests    538 402        0.747 0.838 0.790


Forests retrieved more predators than Neural Netwoks did, but it also retrieved
false positives. Considering the F -score measure, then the best result is obtained
using Neural Networks (399 of 480 predators retrieved).


4    Conclusions and future work
We presented a two step system for detecting sexual predators on-line. The
conversation representation using PoS tags allowed to identify terminology em-
ployed by sexual predators, as shown by the values obtained in the experiments.
   The normalization of texts have had a high impact in the results obtained
and, need to be further investigated. Additionally, we are interested on analizing
new features that allow us to detect “all” the conversations in which a sexual
predators participates.


References
1. Pendar, N.: Toward spotting the pedophile telling victim from predator in text
   chats. In: Proceedings of the International Conference on Semantic Computing.
   ICSC ’07, Washington, DC, USA, IEEE Computer Society (2007) 235–241
2. Villatoro-Tello, E., Juárez-González, A., Escalante, H.J., y Gómez, M.M., Pineda,
   L.V.: A two-step approach for effective detection of misbehaving users in chats. In:
   CLEF (Online Working Notes/Labs/Workshop). (2012)
3. Toutanova, K., Klein, D., Manning, C.D., Singer, Y.: Feature-rich part-of-speech
   tagging with a cyclic dependency network. In: Proceedings of the 2003 Conference
   of the North American Chapter of the Association for Computational Linguistics
   on Human Language Technology - Volume 1. NAACL ’03, Stroudsburg, PA, USA,
   Association for Computational Linguistics (2003) 173–180
4. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The
   weka data mining software: an update. SIGKDD Explor. Newsl. 11(1) (November
   2009) 10–18
5. Quinlan, J.R.: C4.5: Programs for Machine Learning (Morgan Kaufmann Series in
   Machine Learning). 1 edn. Morgan Kaufmann (October 1992)
6. Breiman, L.: Random forests. Mach. Learn. 45(1) (October 2001) 5–32