<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Searching Sexual Predators in Social Networks</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Yuridiana Alem´an</string-name>
          <email>yuridiana.aleman@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Darnes Vilarin˜o</string-name>
          <email>darnes@solarium.cs.buap.mx</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>David Pinto</string-name>
          <email>dpinto@cs.buap.mx</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Facultad de Ciencias de la Computaci ́on Benem ́erita Universidad Aut ́onoma de Puebla</institution>
          ,
          <addr-line>M ́exico</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper we propose a two-step technique for detecting sexual predators from social network dialogues. One step for detecting dialogues in which a sexual predators participates, and the second step is for detecting, from the whole dialogue users, the one that is the sexual predator. From the three different supervised classifier employed, Random Forests obtained the best results in the first step, whereas Neural Networks performed best in the second step.</p>
      </abstract>
      <kwd-group>
        <kwd>Search</kwd>
        <kwd>Supervised classification</kwd>
        <kwd>Sexual predators</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Sexual predators have found a new manner of select victims through the use
of social networks. It is relatively easy for these predators to pretend to be a
child or teenager with the aim of obtain the confidence of their victim. Due to
this fact, there exist diverse attempts for detecting these kind of behaviors by
analyzing conversations in chat rooms.</p>
      <p>
        Major research works take Pendar [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] as a reference point for this topic. Here,
the authors use a dataset gathered from a website named “Perverted Justice”1
for conducting a study using automatic text categorization techniques for
identifying online sexual predators. More recently, Villatoro[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] performed conversation
filtering by removing the shortest conversations with unintelligible characters or
those conversations in which the chat participants have a very low number of
interventions. With this pre-processing step, it is possible drastically to reduce
the amount of texts contained in the training set.
The methodology proposed for searching/identifying sexual predators is shown
in Figure 1. This proposal is made up of two steps: 1) A classification process
that allows to discriminate those conversations in which a sexual predator
participate; and, 2) A classification process that allows to discriminate the predator’s
dialogues with respect to other participants. In both steps, we use the following
classification algorithms: neural networks, random forests and decision trees.
      </p>
      <sec id="sec-1-1">
        <title>1 http://perverted-justice.com</title>
        <p>
          We built and used three lexical resources (dictionaries) for pre-processing
step: emoticons (“:-)” is normalized as “happy”), contractions(“isn’t” is
normalized as “is not”), and SMS vocabulary (“10q” is normalized as “thank you”).
Afterwards, we extracted the features using POS-tagger[
          <xref ref-type="bibr" rid="ref3">3</xref>
          ], we used every
morphological feature as attribute2. For the experiments carried out, we used the
union of two different conversation sets: Perverted Justice used in [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] and the
PAN 2012 Training set 3 that contain conversations provided by the PAN 2012
conference committee, which are structured in XML format.
3
        </p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Experimental results</title>
      <p>
        We have carried out experiments using several classification algorithms
implemented in the Weka[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] tool. We selected the following classification algorithms
that obtained the best results: Decision trees[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], Random forests[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], and Neural
networks (BackPropagation algorithm).
      </p>
      <p>For the evaluation of results, we use the Weka option “Use training set” for
step 1 and “Cross-validation” with 10 folds for step 2. Subsequently, we obtain
the Precision (P ), Recall (R) and F-Score (F ), and we used the best F -score
model for using in the second step.</p>
      <p>Table 1 shows the results obtained for every classifier in both steps. TRC
is the total of retrieved instances, and TCP is the total of positive instances.
Actually, from 2,353 positive conversations of the dataset, the models identified
around 2,000 conversations. The best performance was obtained by Random
Forest with, 0.983 of precision, 0.882 recall and 0.930 of F -measure, thus, identifying
very well those dialogues in which a sexual predator participates.</p>
      <p>Using the conversations detected by the Random Forest classifier, we have
re-constructed the dialogues in which one sexual predator participates. Thus,
the second corpus contains 480 conversations of sexual predators and 442
conversations of non-sexual predators (922 conversations). In this step, Random</p>
      <sec id="sec-2-1">
        <title>2 http://bit.ly/WHsvBN 3 http://pan.webis.de/</title>
        <p>Classifier TRC PCR Precision Recall F-Score</p>
        <p>Step 1: Conversations
Decision Trees 1,146 1,110 0.968 0.471 0.634
Neural Networks 905 766 0.846 0.325 0.470
Random Forests 2,111 2,076 0.983 0.882 0.930</p>
        <p>Step 2: Users
Decision Trees 472 347 0.735 0.723 0.729
Neural Netwoks 516 399 0.773 0.831 0.801</p>
        <p>Random Forests 538 402 0.747 0.838 0.790
Forests retrieved more predators than Neural Netwoks did, but it also retrieved
false positives. Considering the F -score measure, then the best result is obtained
using Neural Networks (399 of 480 predators retrieved).
4</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Conclusions and future work</title>
      <p>We presented a two step system for detecting sexual predators on-line. The
conversation representation using PoS tags allowed to identify terminology
employed by sexual predators, as shown by the values obtained in the experiments.</p>
      <p>The normalization of texts have had a high impact in the results obtained
and, need to be further investigated. Additionally, we are interested on analizing
new features that allow us to detect “all” the conversations in which a sexual
predators participates.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Pendar</surname>
          </string-name>
          , N.:
          <article-title>Toward spotting the pedophile telling victim from predator in text chats</article-title>
          .
          <source>In: Proceedings of the International Conference on Semantic Computing. ICSC '07</source>
          , Washington, DC, USA, IEEE Computer Society (
          <year>2007</year>
          )
          <fpage>235</fpage>
          -
          <lpage>241</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Villatoro-Tello</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          , Ju´
          <fpage>arez</fpage>
          -Gonz´alez,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Escalante</surname>
          </string-name>
          , H.J., y G´omez,
          <string-name>
            <given-names>M.M.</given-names>
            ,
            <surname>Pineda</surname>
          </string-name>
          ,
          <string-name>
            <surname>L.V.</surname>
          </string-name>
          :
          <article-title>A two-step approach for effective detection of misbehaving users in chats</article-title>
          . In: CLEF (Online Working Notes/Labs/Workshop). (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Toutanova</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Klein</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manning</surname>
            ,
            <given-names>C.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Singer</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Feature-rich part-of-speech tagging with a cyclic dependency network</article-title>
          .
          <source>In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1. NAACL '03</source>
          ,
          <string-name>
            <surname>Stroudsburg</surname>
          </string-name>
          , PA, USA, Association for Computational Linguistics (
          <year>2003</year>
          )
          <fpage>173</fpage>
          -
          <lpage>180</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Hall</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Frank</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Holmes</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pfahringer</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Reutemann</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Witten</surname>
            ,
            <given-names>I.H.</given-names>
          </string-name>
          :
          <article-title>The weka data mining software: an update</article-title>
          .
          <source>SIGKDD Explor. Newsl</source>
          .
          <volume>11</volume>
          (
          <issue>1</issue>
          ) (
          <year>November 2009</year>
          )
          <fpage>10</fpage>
          -
          <lpage>18</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Quinlan</surname>
            ,
            <given-names>J.R.</given-names>
          </string-name>
          :
          <source>C4</source>
          .
          <article-title>5: Programs for Machine Learning</article-title>
          (Morgan Kaufmann Series in Machine Learning).
          <volume>1</volume>
          edn. Morgan Kaufmann (
          <year>October 1992</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Breiman</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Random forests</article-title>
          .
          <source>Mach. Learn</source>
          .
          <volume>45</volume>
          (
          <issue>1</issue>
          ) (
          <year>October 2001</year>
          )
          <fpage>5</fpage>
          -
          <lpage>32</lpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>