<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Vuong.nm@ou.edu.vn (V. M. Ngo); Susan.Mckeever@tudublin.ie convolutional neural network models or deep perceptual
(S. Mckeever); Christina.Thorpe@tudublin.ie (C. Thorpe) hashing algorithms with the goal of removing CSAM</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Identifying Online Child Sexual Texts in Dark Web through Machine Learning and Deep Learning Algorithms</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Vuong M. Ngo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Susan Mckeever</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Christina Thorpe</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Information System Management Center, Ho Chi Minh City Open University</institution>
          ,
          <addr-line>Ho Chi Minh City</addr-line>
          ,
          <country country="VN">Vietnam</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>School of Computer Science, Technological University Dublin</institution>
          ,
          <addr-line>Grangegorman, Dublin 7</addr-line>
          ,
          <country country="IE">Ireland</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>School of Informatics and Cybersecurity, Technological University Dublin</institution>
          ,
          <addr-line>Blanchardstown, Dublin 15</addr-line>
          ,
          <country country="IE">Ireland</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>Predators often use the dark web to discuss and share Child Sexual Abuse Material (CSAM) because the dark web provides a degree of anonymity, making it more dificult for law enforcement to track the criminals involved. In most countries, CSAM is considered as forensic evidence of a crime in progress. Processing, identifying and investigating CSAM is often done manually. This is a time-consuming and emotionally challenging task. In this paper, we propose a novel model based on artificial intelligence algorithms to automatically detect CSA text messages in dark web forums. Our algorithms have achieved impressive results in detecting CSAM in dark web, with a recall rate of 89%, a precision rate of 92.3% and an accuracy rate of 87.6%. Moreover, the algorithms can predict the classification of a post in just 1 microsecond and 0.3 milliseconds on standard laptop capabilities. This makes it possible to integrate our model into social network sites or edge devices to for real-time CSAM detection.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Child sexual exploitation material</kwd>
        <kwd>CSEM</kwd>
        <kwd>CSAM</kwd>
        <kwd>text content</kwd>
        <kwd>artificial intelligent</kwd>
        <kwd>forums</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>avoid detection and hide their activities.</p>
      <p>In this context, we propose a CSAM detection
intelliIn general, Child Sexual Abuse Material (CSAM) includes gence model based on both classical Machine Learning
any visual, written or audio material that depicts or de- (ML) and Deep Learning (DL) techniques. Our CSAM
scribes sexual abuse of children. This can include pho- detection model can be used to monitor and remove CSA
tographs, videos, stories, chats, comments, drawings or texts on online platforms in real-time and with high
acany other media1. The production and distribution of curacy, providing better protection for children. We have
CSAM has negative impacts on victims and society. Vic- also created a manually labelled dataset of CSAM and
tims can live with long psychological, emotional, and non-CSAM content that can be used to train and test
physical harm [1]. A high volume of CSAM is created CSAM detection algorithms. In the future, our model will
and shared daily on both surface web platforms such as be able to detect perpetrator behaviours, collect forensic
social network sites and dark web forums. It is not vi- evidence, and extract valuable knowledge for child
agenable for human experts to investigate, detect and prevent cies, hotlines, education programs and policy makers.
CSAM manually [2]. However, automatically detecting The remainder of the paper is organised as follows.
and analysing online CSA text can be extremely chal- In the Section 2 we review the related work. Section
lenging and time-consuming, due to language complex- 3 presents our system architecture and the machine
ity, contextual ambiguity, dynamic nature of language learning and deep learning algorithms. The evaluation
and large volume of data. This is particularly the case methodology and experimental results for the system are
for CSAM shared on the dark web, where privacy and shown in Section 4. Finally, we conclude and give some
anonymity are prioritized. Moreover, perpetrators often future directions in Section 5.
use code words, slang, or other forms of obfuscation to</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
    </sec>
    <sec id="sec-3">
      <title>3. System Architecture and Algorithms</title>
      <p>
        and DL models to process CSA text. In [8], Naive Bayes
(NB), Support Vector Machine (SVM) and Random Forest
(RF) were applied to detect online abusive and bullying
comments on Facebook and Twitter. In [9], the histogram 3.1. System Architecture
gradient boosted decision trees were exploited for
predatory chat conversation detection. In [
        <xref ref-type="bibr" rid="ref4">10</xref>
        ], Convolutional Supervised learning in classical ML and DL is a popular
Neural Network (CNN) and Long Short Term Memory method for text classification based on learning patterns
(LSTM) were applied for YouTube comments. In [11], from labelled training samples [18, 19]. Every supervised
deepWalk model and graph embedding representations learning algorithm has its strengths and weaknesses.
were used to detect abuse chat logs in French on the Therefore, to find a suitable algorithm to classify CSAM
SpaceOrigin game. In [12], Logistic Regression (LR), XG- post contents, we apply the two most popular classical
Boost and Multi Layer Perceptron (MLP) were exploited ML algorithms, NB and SVM, and the two most popular
to detect sexual predatory chats in social networks. To DL algorithms, LSTM and BERT (Bidirectional Encoder
train and test ML/DL models, the papers [8] and [11] Representations from Transformers). More details can
created their own datasets and the papers [9], [
        <xref ref-type="bibr" rid="ref4">10</xref>
        ] and be found in Section 3.2.
[12] used datasets of third-parties. However, these papers Figure 1 shows our system architecture used to design
considered the clear web not the dark web. and implement our novel algorithm for CSAM text
clas
      </p>
      <p>Dark web data was also processed in [13], [14], [15], sification. In that, the Supervised Method component
[16] and [17]. However, the approaches did not auto- implements NB, SVM, LSM and BERT algorithms. The
matically detect CSA text on the dark web by using post algorithms tokenizes the post texts and transforms them
contents and artificial intelligence. In [ 13], the 450 au- into vector representation using TF.IDF2 (in NB and SVM)
thorised hidden service sites were manually classified. or embedding layers3 (in LSTM and BERT).
In [14], the authors analysed seven popular dark web The Evaluation component is used to determine the
sites to monitor the sites by using their metadata, e.g. the execution times (i.e., training time and prediction time)
number of users, site names and common users in sites. and the classification performance metrics (i.e., precision,
In [15], the authors statistically analysed some simple recall and accuracy) of each combination of algorithms.
metadata e.g. victim ages and the number of CSAM re- To avoid overfitting, we apply 5-fold cross-validation of
ports per year. In [16], K-Means algorithm was applied our dataset. Then the algorithm uses 4 folds for the
trainto cluster the forum comments into the selected seven ing set and the remaining fold for the testing set. This
labels, i.e. breach, financial, drug, vendor, account, prod- process is repeated until every fold serves as the testing
uct and other. In [17], the authors manually analysed
transcripts of 53 anonymous suspects in United Kingdom
to understand suspects’ interaction behaviors and sexual
interests.
2https://scikit-learn.org/stable/modules/generated/sklearn.featur
e_extraction.text.TfidfVectorizer.html
3https://www.tensorflow.org/api_docs/python/tf/keras/layers/Em
bedding
set. The average of the recorded classification measures
of five rounds are the classification performance
measures for the algorithm. Finally, we analyse experimental
results on the data set to recommend the best algorithm
for CSAM text classification in the dark web.</p>
      <sec id="sec-3-1">
        <title>3.2. Machine Learning and Deep Learning</title>
        <p>NB is a specific type of probabilistic classifier that
relies on applying Bayes’ theorem with certain simplifying
assumptions. NB is widely used in natural language
processing, spam filtering, and other applications where it is
necessary to classify items into diferent categories based Figure 2: sexual abuse single words in dark web forums
on probabilistic features. It assumes that the features
are strongly independent to simplify computation. We
used the Gaussian Naive Bayes algorithm implemented
in [20], with parameters: ℎ= 1 and  _= True.</p>
        <p>Where, ℎ is the additive smoothing parameter and
 _ determines whether to learn class’ prior
probabilities or not.</p>
        <p>SVM represents patterns as points in space and divides
the data points by a clear gap. It constructs a maximum
margin separator and can perform a non-linear
classification by using the so-called kernel trick. We used the
C-support vector classification algorithm implemented
in [21], with parameters: = 1.0, = ’linear’, Figure 3: sexual abuse two-word-phrases in dark web forums
= 3 and = ’auto’. Where,  is the
regularization parameter.  is the used kernel type.
 is the degree of the polynomial kernel function 3.3. Training and Testing Datasets
and  is kernel coeficient.</p>
        <p>LSTM is a special kind of Recurrent Neural Network Our first step is to create a labelled dataset that can be
(RNN). RNN is a type of neural network commonly used for training or fine-tuning our classifier. The
laused to develop natural language processing models. belled dataset used for our study was collected and
supRNN remembers the sequence of the data and exploits plied by the company Web-IQ, which provided us with
data patterns and feedback loops for prediction. LSTM over 352,000 posts from 8 dark web forums in 2022, of
was applied to avoid the long-term dependency which approximately 221,000 were in English.
problem in regular RNN. We used the Bidirectional- Using a dictionary of 12,628 Sexual Abuse Phrases
LSTM algorithm implemented in [22], with param- (SAPs) extracted from THORN project4 and Web-IQ
eters: = (1000, 128, input_length=200), dark web forums5, we were able to detect approximately
(LSTM(64)), (0.5) and 177,000 English posts with no SAP and approximately
(1, activation=’sigmoid’) 44,000 English posts with at least one SAP. This provides</p>
        <p>BERT is a language model using the transformer en- us with a high level grouping of posts, but with
refinecoder architecture to process tokens in text. BERT ap- ment required to allow for CSAM posts that does not
plies pre-training and fine-tuning. Pre-training is an contain any SAPs, and vice versa. Figures 2 and 3 show
unsupervised way on a general large corpus of text to the word clouds of single words and two-word-phrases
create BERT model. Fine-tuning is a supervised train- related to sexual abuse, extracted from post contents in
ing BERT model on a specific downstream task with dark web forums. The size of each word in the clouds
relatively few labels, because the general linguistic pat- represents its frequency in the forums.
terns have already been learnt during pre-training. We From the group of 177,000 posts with no SAP, experts
used BERT algorithm implemented in [23], with pa- randomly selected 2,000 non-CSAM posts and 500 CSAM
rameters: (bert_en_uncased_preprocess_3, posts. From the group of 44,000 posts with at least one
bert_en_uncased_L-12_H-768_A-12_4), (1, activa- SAP, experts randomly selected 2,000 CSAM posts and
tion=’sigmoid’) and =’adam’. 100 non-CSAM posts. Ultimately, our manually labelled</p>
        <sec id="sec-3-1-1">
          <title>4https://www.thorn.org/</title>
          <p>5https://web-iq.com/solutions/osint-on-premises
dataset contains 4,600 posts from the dark web, including
2,500 CSAM posts and 2,100 non-CSAM posts.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiment and Results</title>
      <sec id="sec-4-1">
        <title>4.1. Experiment Setup and Quality</title>
      </sec>
      <sec id="sec-4-2">
        <title>Measures</title>
        <sec id="sec-4-2-1">
          <title>The algorithms were implemented using Python 3.10,</title>
          <p>scikit-learn library 1.2.2 (for NB and SVM), keras library
1.1.2 run on top of tensorflow library 2.10.0 (for LSTM
and BERT). All experiments were run under Windows 10
(64-bit) on a Dell laptop with an Intel Core i7 CPU (3.00
GHz) and 16 GB memory.</p>
          <p>For the purpose of measuring the quality of the
predicted classes of posts compared to the correct classes, we
apply the most commonly used metrics namely accuracy,
precision and recall ([24, 25]). The metrics are derived
from four categories in the confusion matrix: True
Positive (TP), False Positive (FP), True Negative (TN) and
False Negative (FN) as follows:
• TP: Posts in which the algorithm predicted CSAM
and the correct class was also CSAM.
• FP: Posts in which the algorithm predicted CSAM,
but the correct class was non-CSAM.
• TN: Posts in which the algorithm predicted
non</p>
          <p>CSAM and the correct class was non-CSAM.
• FN: Posts in which the algorithm predicted
non</p>
          <p>CSAM, but the correct class was CSAM.</p>
        </sec>
        <sec id="sec-4-2-2">
          <title>Accuracy (ACC) in binary classification is defined as a</title>
          <p>ratio between the correctly classified samples to the total
  + 
number of samples:  =   +  + +  . The
accuracy puts the same emphasis on all these factors.
However, when categorising pairs, there is usually a bias:
it is much easier to identify true negatives correctly, due
to their large number. So, we also look at the the precision
   
 =   +  and the recall  =   +  .</p>
        </sec>
      </sec>
      <sec id="sec-4-3">
        <title>4.2. Results</title>
        <p>Using the 5-fold cross-validation methodology, each
experimental round includes a training set of 3,680 posts
(2,000 CSAM and 1,680 non-CSAM) and a testing set
of 920 posts (500 CSAM and 420 non-CSAM). Table 1
presents the average training time, average prediction
time, average precision, average recall and average
accuracy of four algorithm combinations as follows:
• NB: The training time and prediction time were
0.5 and 0.001 seconds, respectively. The precision We proposed and implemented a novel algorithm based
was 76.1%, recall was 89% and accuracy was 78.8%. on machine learning and natural language processing to
automatically detect and classify CSAM text post
content in dark web. In the experimental evaluation on the</p>
        <p>The combination of the NB algorithm has the fastest
execution time, taking only about 1 microsecond to
detect a post on our laptop’s capabilities. The second best
performing algorithm is SVM, which takes about 0.3
milliseconds. These fast prediction times make our models
well-suited for processing CSA text in real-time on
social networks. Additionally, our models can run on edge
devices with limited computational resources and power
supply.</p>
        <p>In terms of classification precision, the SVM
combination performs the best with 92.3%, followed by LSTM and
BERT as the second and third-best performers,
respectively. Meanwhile, the NB combination has the highest
recall rate of 89%, followed by LSTM as the second-best
performer. When it comes to accuracy, SVM is the best
with 87.6% which is slightly higher than LSTM with 87.1%.</p>
        <p>The BERT algorithm has long training and prediction
times, and it is not suitable for binary classification of
CSAM posts in dark web.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion and Future Work References</title>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <sec id="sec-6-1">
        <title>The paper is an extension of the long abstract [27] being</title>
        <p>part of the N-Light project which is funded by the Safe
Online Initiative of End Violence and the Tech Coalition
through the Tech Coalition Safe Online Research Fund
(Grant number: 21-EVAC-0008-Technological University
Dublin). Dr. Vuong Ngo has conducted the research
while serving as a data scientist at TU Dublin.
dataset of 4,600 CSAM and non-CSAM posts with 5-fold
cross-validation, the combination of NB algorithm
performed the best in terms of classification recall and
execution time. On the other hand, the SVM combination
performed the best in terms of classification precision
and accuracy, and was the second-best in execution time.
The choice of NB and SVM depends on the specific goals
and requirements of the CSAM classification task. NB is
maximize the number of true positives which could be
useful identifying and removing CSAM posts from
online platforms to protect potential victims. On the other
hand, SVM is minimize false positives which could be
useful for identifying CSAM posts to extract information
about potential predators and victims for investigative
purposes.</p>
        <p>As part of our future work, functional APIs will be
implemented to create a user-friendly web application.
Furthermore, we aim to leverage the metadata associated
with CSAM posts to identify the characteristics,
conversation and behaviours of perpetrators. This information
can be valuable in developing more efective models for
preventing and addressing CSA text on social media
platforms. We also will recognise named entities in CSA text
to supply important concepts for ML models [26].
M. Liechti, V. Lenders, Blackwidow: Monitoring
the dark web for cyber security information, in: the
11th International Conference on Cyber Conflict
(CyCon), volume 900, 2019, pp. 1–21. doi:10.239
19/CYCON.2019.8756845.
[15] E. Kokolaki, E. Daskalaki, K. Psaroudaki,
M. Christodoulaki, P. Fragopoulou,
Investigating the dynamics of illegal online activity: The
power of reporting, dark web, and related
legislation, Computer Law &amp; Security Review 38 (2020)
105440. doi:10.1016/j.clsr.2020.105440.
[16] S. Nazah, S. Huda, J. H. Abawajy, M. M. Hassan, An
unsupervised model for identifying and
characterizing dark web forums, IEEE Access 9 (2021) 112871–
112892. doi:10.1109/ACCESS.2021.3103319.
[17] J. Woodhams, J. A. Kloess, B. Jose, C. E.
HamiltonGiachritsis, Characteristics and behaviors of
anonymous users of dark web platforms suspected of child
sexual ofenses, Frontiers in Psychology 12 (2021)
623668. doi:10.3389/fpsyg.2021.623668.
[18] T. Tran, L. Nguyen, V. Ngo, Machine learning based
english sentiment analysis, Journal of Science and
Technology 52(4D) (2014) 142–155. doi:https://
doi.org/10.48550/arXiv.1905.06643.
[19] S. J. Russell, P. Norvig, Artificial Intelligence: A</p>
        <p>Modern Approach (4th Edition), Pearson, 2022.
[20] Scikit-learn, Multinomial naive bayes, https://scik
it-learn.org/stable/modules/generated/sklearn.na
ive_bayes.MultinomialNB.html#sklearn.naive_b
ayes.MultinomialNB, 2023. Version 1.2.2, accessed
April 01, 2023.
[21] Scikit-learn, C-support vector classification, https:
//scikit-learn.org/stable/modules/generated/sklear
n.svm.SVC.html, 2023. Version 1.2.2, accessed April
01, 2023.
[22] Keras, Long short term memory, https://www.te
nsorflow.org/api_docs/python/tf/keras/layers/LS
TM, 2023. Version 1.1.2, accessed April 01, 2023.
[23] Keras, Text classsification with bert, https://www.
tensorflow.org/text/tutorials/classify_text_with_
bert, 2023. Version 1.1.2, accessed April 01, 2023.
[24] V. M. Ngo, T. V. T. Duong, T. B. T. Nguyen, P. T.</p>
        <p>Nguyen, O. Conlan, An eficient classification
algorithm for traditional textile patterns from diferent
cultures based on structures, Journal on Computing
and Cultural Heritage (JOCCH) 14(4) (2021) 1–22.
doi:https://doi.org/10.1145/3465381.
[25] A. Tharwat, Classification assessment methods,
Applied Computing and Informatics 17(1) (2021)
168–192. doi:https://doi.org/10.1016/j.ac
i.2018.08.003.
[26] V. M. Ngo, G. Munnelly, F. Orlandi, P. Crooks,
D. O’Sullivan, O. Conlan, A semantic search
engine for historical handwritten document images,
in: G. Berget, M. M. Hall, D. Brenn, S. Kumpulainen</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          (Eds.),
          <source>Linking Theory and Practice of Digital Libraries, LNCS</source>
          , vol.
          <volume>12866</volume>
          , Springer,
          <year>2021</year>
          , pp.
          <fpage>60</fpage>
          -
          <lpage>65</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          doi:https://doi.org/10.1007/978-3-
          <fpage>030</fpage>
          -86324-
          <issue>1</issue>
          _
          <fpage>7</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>S.</given-names>
            <surname>Mckeever</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Thorpe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. M.</given-names>
            <surname>Ngo</surname>
          </string-name>
          ,
          <article-title>Determining child sexual abuse posts based on artificial intelligence, in: the 2023 International Society for the Prevention of Child Abuse &amp; Neglect Congress (ISPCAN-</article-title>
          <year>2023</year>
          ), Edinburgh, Scotland, UK,
          <source>September 24-27</source>
          ,
          <year>2023</year>
          ,
          <year>2023</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>4</lpage>
          . doi:https://doi.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <source>org/10</source>
          .21427/S3GQ-3536.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>