<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Machine Learning Techniques for the Classification of Product Descriptions from Darknet Marketplaces</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Clemens Heistracher</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Franck Mignet</string-name>
          <email>franck.mignet@nl.thalesgroup.com</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sven Schlarb</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Austrian Institute of Technology</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <fpage>29</fpage>
      <lpage>31</lpage>
      <abstract>
        <p>Over the past decade, the darknet has created unprecedented opportunities for traficking in illicit goods, such as weapons and drugs, and it has provided new ways to ofer crime as a service. Natural language processing techniques can be applied to find the types of goods that are traded in these markets. In this paper we present the results of evaluating state-of-the-art machine learning methods for the classification of darknet market ofers. Several embeddings, such as GloVe embeddings [20], Fasttext [15], Tensor Flow Universal Sentence Encoder [7], Flair's contextual string embedding [2] and term-frequency inverse-document-frequency (TF-IDF), as well as our domain-specific darknet embedding have been evaluated with a series of machine learning models, such as Random Forest, SVM, Naïve Bayes and Multilayer Perceptron. To find the best combination of feature set and machine learning model for this task, the performance was evaluated on a publicly available collection covering 13 darknet markets with more than 10 million product ofers [6]. After extracting unique advertisements from the corpus, the classifier was trained on a subset with those advertisements that contain strings related to weapons. The purpose was to determine how well the classifier can distinguish between diferent types of advertisements which seem all to be related to weapons according to the keywords they contain. The best performance for this classification task was achieved using the</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Linear Support Vector Machine model with the Tensor Flow Universal
Sentence Encoder for feature extraction, resulting in a micro-f1-score of 96%.</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>
        Darknet markets (DNMs) provide a largely anonymous platform for the trade in
illegal goods and services, and drugs represent a large part of the product range
[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Illicit drug traficking in DNMs is a very dynamic area, as marketplaces
are constantly emerging and sometimes disappearing after a short time. For this
reason, police authorities and organizations active in the field of preventing and
combating organized crime need techniques which allow them to quickly analyse
collected DNM web data and extract information in a cost efective and eficient
manner.
      </p>
      <p>Classification is one of the common tasks of text mining and natural language
processing (NLP). It is used to divide a set of texts into groups according to
predefined labels, and this technique can also be applied to automatically classify
product listings of DNMs. Since it is a supervised machine learning technique,
ground truth training data is needed to build the classifier. Many darknet market
websites provide menus which allow their customers to navigate between product
categories. In principle, these categories could serve as product labels. However,
ifrst, this categorization was created by the vendor and is therefore not
necessarily trustworthy. Second, most markets do not indicate a product category and
therefore would be excluded from the analysis. Third, some markets and vendors
use false categories to obscure what is being ofered. Finally, the categories vary
between diferent DNMs which makes it dificult to create a consistent cross-DNM
overview about the product ofers available. To overcome these dificulties, we have
created a manually annotated dataset that was obtained from thirteen DNMs and
which therefore contains consistent labels over a high range of examples.</p>
      <p>
        In principle, classification by keywords is a simple and eficient way to
characterize texts. However, the selection of the right keywords requires a profound
understanding of the domain, which is labor intensive and hard to obtain in a
hidden society. Furthermore, words are often used diferently in the darknet and
homonyms are frequent. For example, fruits, weapons, popular brands or celebrity
names are sometimes used as brands for drugs and as code words. A string search
for “ak47” results in many listings related to drugs, some literature on the weapon
and a few ofers for a weapon in the Gwern dataset [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. This illustrates the
dififculty of a categorization by keyword. Additionally, a good classifier is robust
against modification of single words because the whole document is used for the
classification. Our dataset was filtered using a keyword based search for strings
related to firearms and therefore only contains product ofers that appear to be
related to firearms. Therefore, our task can be seen as supervised multi-label text
classification on the results of a keyword based search.
      </p>
      <p>In the following, we present state of the art word and document embeddings.
Then, in section 3, we briefly discuss the literature on classification of product ofers
on illicit online markets, before describing the dataset we used for our experiments
in section 4. Finally, we cover the experimental setup in section 5 and our results
in section 6.</p>
    </sec>
    <sec id="sec-3">
      <title>2. Background</title>
      <p>
        Recent advances in NLP have shown the benefit of pre-trained word embeddings
for various tasks such as named entity recognition [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], sentiment analysis [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] and
acronym disambiguation [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. Word embeddings are vector representations for
single words in a continuous lower-dimensional space (lower than the number of
unique tokens), they can carry semantic and syntactic relationships between words
and therefore boost classification results [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. They are usually trained on very large
corpora of unlabeled data and can assist learning and generalisation. Document
embeddings are the extension of word embeddings for sentences and short text
documents [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>For our experiments, that will be described in section 5, we calculate vector
representations for documents using the following techniques:</p>
      <p>
        Term frequency-inverse document frequency (TF-IDF) serves as a simple
benchmark to advanced document embeddings. TF-IDF relies on the assumption
that the occurrence of a frequent word adds little information about a document
compared to the occurrence of an infrequent word. The frequency of the word “is”
carries little information about a document, whereas the existence of a word like
‘perceptron’ usually indicates that the topic of the document is related to machine
learning as ‘perceptron’ is a technical term only used in machine learning. TF-IDF
is the normalized number of occurrences of a word in a document, weighted by the
number of its occurrences in the whole corpus. The TF-IDF representation of a
document consist of the TF-IDF values for all words in the document [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ].
      </p>
      <p>
        The GloVe [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] word embedding is based on the co-occurrence of words in the
training corpus. We used the pre-trained model for the word embedding
‘glovewiki-gigaword-100’ that was trained on the english wikipedia and a news dataset.
The sum of all word embeddings is used as one document embedding.
      </p>
      <p>
        The contextual string embedding Flair [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] tackles the problem of polysemy and
homonyms in word vectors. Flair vectors depend on their context and words with
multiple meanings can have diferent representations for each of them. For our
experiments we used a combination of the pre-trained embeddings ‘news-forward’
and ‘news-backwards’.
      </p>
      <p>
        Fasttext [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] is an extension to Word2Vec that learns embeddings for character
n-grams and therefore shows better performance for rare and unknown words as
parts of a word still might be known to the model. We used the pre-trained
“wikinews-300d-1M” embedding containing 300 dimensional word vectors for one million
words.
      </p>
      <p>
        Pre-trained embeddings are usually trained on texts taken from a non-criminal
context, e.g., Wikipedia or news articles. The language used in the darknet can
vary significantly to those sources mentioned above. This is due to diferences in
contents itself, genre (article versus product ofer), style (formal versus colloquial)
and the use of domain-specific expressions. Therefore, the transferability of those
embeddings to the darknet domain can be limited. The benefit in generalisation
of a pre-trained model might be counteracted by diferent usage of words in the
darknet. To answer the question whether a domain specific embedding trained
on the darknet outperforms a general embedding that was trained on clear net
data, we trained a fasttext darknet embedding [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] on the full Gwerns dataset.
In theory, this embedding contains darknet-specific information and might boost
performance for product ofers where very specific language is used.
      </p>
      <sec id="sec-3-1">
        <title>Tensorflow Universal Sentence Encoder is using a deep averaging network</title>
        <p>
          to calculate feature vectors of dimension 512. It was trained on NLP tasks in eight
languages [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ].
        </p>
        <p>We selected our document embeddings to represent three categories:
State-ofthe-art methods that use a bag of words approach and therefore don’t take the
order of words into account (GloVe and FastText), embeddings that depend on
the order of words (Flair, Universal Sentence Encoder) and a simple statistical
benchmark (TF-IDF)</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>3. Related work</title>
      <p>
        Since Christin [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] showed in 2012 that Silk Road mostly catered drugs, several
attempts to classify products on DNMs have been made. Most publications use
Bag of Words (BOW) [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] or TF-IDF [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] to vectorize texts in combination with
Support Vector Machines, Logistic Regression and Naive Bayes as machine learning
models. Feature reduction is often performed using principle component analysis
[
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] and latent Dirichlet allocation [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        Automated keyword extraction for product categories was discussed by Ghosh
et al. [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], who proposed a method using diferences in term-frequency per category
to identify keywords. Further, diferences in word embeddings for legitimate and
illicit sources are used by Yuan et al. [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ] to detect keywords and decipher code
words.
      </p>
      <p>
        More recently, LSTMs [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] and word embeddings [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] have been used for the
task of text classification on DNMs.
      </p>
      <p>Our contribution is the comparison of multiple state of the art document
embeddings using a range of established classifiers on a darknet dataset. We evaluate
whether pre-trained embeddings improve the classification performance over
simple vector space models, such as TF-IDF, when being transferred to the darknet
domain. Further, we test our models on a dataset which contains similar product
descriptions aggregated by a keyword based search.</p>
    </sec>
    <sec id="sec-5">
      <title>4. Dataset and Exploratory Analysis</title>
      <p>
        Our dataset is based on a subset of the Darknet Market Archives [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] called Grams,
which contains crawls of thirteen darknet markets in the period from 09.06.2014
to 12.07.2015, containing approximately 10 million product ofers. Filtering for
unique product descriptions resulted in 226 661 datapoints.
      </p>
      <p>
        Hence we are interested in a dataset that is related to firearms, the final dataset
for our experiment contains only product descriptions with at least one occurrence
of a keyword related to firearm names. The list of firearm keywords was extracted
from a publicly available dataset [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ] and the Grams dataset. The number of ofers
related to firearms is 1590. The subset was then manually annotated. Table 1
shows the categories and the number of documents assigned to it. Categories with
less than 100 datapoints were excluded from our experiments. For details on the
composition of the dataset please contact the authors.
      </p>
      <sec id="sec-5-1">
        <title>Category</title>
        <p>Drug
Weapon
Book
Crime as a Service (CaaS)
Excluded</p>
      </sec>
      <sec id="sec-5-2">
        <title>Number of ofers assigned</title>
        <p>977
284
153
116
60</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>5. Experimental Setup</title>
      <p>
        In our experiments we examine combinations of machine learning classifiers and
document embeddings for a text classification task in the darknet domain. We have
selected a range of commonly used models for classification tasks. We use the
scikitlearn implementations for: RandomForestClassifier, LinearSVC, GaussianNB,
LogisticRegression (LR), DecisionTreeClassifier, AdaBoostClassifier, KNeighbors
Classifier, MLPClassifier using default parameters [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ].
      </p>
      <p>We use our dataset to train and evaluate classifiers on the task of assigning a
label to new product ofers on DNMs. To find the best combination of feature set
and machine learning model for this task, several embeddings, GloVe , Fasttext,
Tensor Flow Universal Sentence Encoder, Flair’s contextual string embedding and
our darknet embedding have been used.</p>
      <p>After extracting unique advertisements from the corpus, the classifier was
trained using the subset with those advertisements that contain strings related to
weapons. The purpose was to determine how well the classifier can distinguish
between diferent types of advertisements which seem all to be related to weapons
according to the keywords they contain. For our experiments we train all models
with all embeddings. We use a fourfold cross-validation that preserves the
percentage of samples for each class which is consistent for all experiments. The model’s
score is the average of the scores for each fold in the cross-validation.</p>
      <p>
        In binary classification, the precision is the number of true positives over all
predicted positives and recall is the number of true positives over all actual
positives. The f1-score is the harmonic mean between precision and recall. For
multiclass classification, the micro f1-score is calculated globally by counting the total
true positive, false negatives and false positives, while for the macro-f1 score, the
standards f-1 score is calculated for each class and the scores for all classes are
averaged without weighing them. To evaluate the performance of our models, we
report micro-f1-score as well as macro-f1-score, to measure the overall performance,
but also take the performance for classes with fewer samples into account. [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ]
      </p>
    </sec>
    <sec id="sec-7">
      <title>6. Results</title>
      <p>We present the macro f1-score and the mirco f1-score for each combination of
embedding and classifier in Figure 1. The overall best performance was achieved
with Tensorflow Universal Sentence Encoder and a linear SVM resulting in a score
of macro f1-score of 0.93 and a micro f1-score of 0.96. Analysis of the best results
per classifier shows that Tensorflow Universal Sentence Encoder performs best for
ifve out of eight embeddings. The simple TF-IDF performs better than others
for Decision Tree and Gaussian Naive Bayes. Only AdaBoostClassifier works best
with our darknet embedding trained with fasttext. Further, the comparison of
classifiers shows that Linear SVC performs best for five out of six embeddings. The
good performance of SVMs is expected, as it is the prevalent model in previous
works. Overall, Tensorflow Universal Sentence Encoder appears to generate the
best features for our task. However, TF-IDF ranks second place with a simple
and lightweight implementation that doesn’t require pre-training on huge datasets.
The comparison of micro and macro scores indicates that the performance for all
classes is balanced for Tensorflow Universal Sentence Encoder.</p>
      <p>The comparison of the pre-trained fasttext embedding and our darknet
embedding shows similar performance as each of the embeddings outperforms the other
one in four cases. The best score for the darknet embedding is 0.03 lower than
the best score for the pre-trained embedding and therefore no benefit over the
pre-trained embedding could been shown for the darknet embedding.</p>
      <p>Further, we present a detailed analysis of the best combination, which is linear
SVM with Tensorflow Universal Sentence Encoder. To achieve more significant
results, we reduce the proportion of training data to 25%. The training dataset
contains a total of 382 samples, with 242, 69, 36, 35 texts for Drug, Weapon, Book
and CaaS respectively. The now larger test set contains a total of 1148 samples,
with 735, 215, 117, 81 texts for Drug, Weapon, Book and CaaS respectively. The
predictions for this experiment are shown in a confusion matrix (Figure 2).</p>
      <p>We show the metrics for each class in Table 2. It can be seen that precision as
well as recall achieve almost perfect scores over all four classes. The class "Drug"
performed best with a f1-score of 0.99, whereas "Book" achieved the lowest score
with 0.88. Further, micro and macros averages and the counts per group (Support)
are listed.</p>
      <sec id="sec-7-1">
        <title>F1-score</title>
      </sec>
      <sec id="sec-7-2">
        <title>Precision</title>
      </sec>
      <sec id="sec-7-3">
        <title>Recall</title>
      </sec>
      <sec id="sec-7-4">
        <title>Support Book</title>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>7. Conclusion</title>
      <p>In this paper, we evaluated state of the art text embeddings for classification tasks
in the darknet domain using multiple classifiers. To show the benefits of a text
classification compared to a keyword-based search, we have trained the classifier
on the results of a keyword-based search. We used a subset of the Grams crawl in
Gwern’s achieve, that contains strings related to weapons and we showed that a text
classifier is able to correctly determine labels with an overall accuracy of 97 %. Best
results are achieved with features generated by the Tensorflow Universal Sentence
encoder using SVMs. However, other state of the art embedding do not beat the
established TF-IDF vectorization in this task.</p>
      <p>Acknowledgements. The research described in this paper was carried out as
part of the COPKIT project which has received funding from the European Union’s
Horizon 2020 research and innovation programme under grant agreement No 786687</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Adamsson</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          , MA thesis, Uppsala University, Department of Information Technology,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Akbik</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Blythe</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vollgraf</surname>
          </string-name>
          , R.:
          <article-title>Contextual String Embeddings for Sequence Labeling</article-title>
          ,
          <source>in: COLING</source>
          <year>2018</year>
          , 27th International Conference on Computational Linguistics,
          <year>2018</year>
          , pp.
          <fpage>1638</fpage>
          -
          <lpage>1649</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Al</given-names>
            <surname>Nabki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. W.</given-names>
            ,
            <surname>Fidalgo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            ,
            <surname>Alegre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            ,
            <surname>Paz</surname>
          </string-name>
          , I. de:
          <article-title>Classifying illegal activities on TOR network based on web textual contents</article-title>
          ,
          <source>in: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume</source>
          <volume>1</volume>
          ,
          <string-name>
            <given-names>Long</given-names>
            <surname>Papers</surname>
          </string-name>
          ,
          <year>2017</year>
          , pp.
          <fpage>35</fpage>
          -
          <lpage>43</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Armona</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stackman</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          : Learning Darknet Markets, Federal Reserve Bank of New York mimeo (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Bojanowski</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grave</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Joulin</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Enriching word vectors with subword information, Transactions of the Association for Computational Linguistics 5 (</article-title>
          <year>2017</year>
          ), pp.
          <fpage>135</fpage>
          -
          <lpage>146</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Branwen</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Christin</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Décary-Hétu</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , et al.:
          <source>Dark Net Market archives</source>
          ,
          <fpage>2011</fpage>
          -
          <lpage>2015</lpage>
          , https://www.gwern.net/DNM-archives, dataset, Accessed:
          <fpage>2019</fpage>
          -01-23,
          <year>July 2015</year>
          , url: https://www.gwern.net/DNM-archives.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Chidambaram</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cer</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , et al.:
          <article-title>Learning Cross-Lingual Sentence Representations via a Multi-task Dual-Encoder Model</article-title>
          ,
          <source>in: Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019)</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>250</fpage>
          -
          <lpage>259</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Choshen</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Eldad</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hershcovich</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sulem</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Abend</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          :
          <article-title>The Language of Legal and Illegal Activity on the Darknet</article-title>
          ,
          <source>in: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>4271</fpage>
          -
          <lpage>4279</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Christin</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          :
          <article-title>Traveling the Silk Road: A measurement analysis of a large anonymous online marketplace</article-title>
          ,
          <source>in: Proceedings of the 22nd international conference on World Wide Web</source>
          ,
          <year>2013</year>
          , pp.
          <fpage>213</fpage>
          -
          <lpage>224</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Dai</surname>
            ,
            <given-names>A. M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Olah</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Le</surname>
            ,
            <given-names>Q. V.</given-names>
          </string-name>
          :
          <article-title>Document embedding with paragraph vectors</article-title>
          ,
          <source>arXiv preprint arXiv:1507.07998</source>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <article-title>Europol/EMCDDA: Drugs and the darknet</article-title>
          .
          <article-title>Perspectives for enforcement, research and policy</article-title>
          ,
          <source>tech. rep., Europol, European Monitoring Centre for Drugs and Drug Addiction (EMCDDA)</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Ghosh</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Porras</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yegneswaran</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nitz</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Das</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>ATOL: A framework for automated analysis and categorization of the Darkweb Ecosystem</article-title>
          , in: Workshops at the Thirty-
          <source>First AAAI Conference on Artificial Intelligence</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Graczyk</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kinningham</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Automatic product categorization for anonymous marketplaces</article-title>
          ,
          <source>tech. rep.</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Jones</surname>
            ,
            <given-names>K. S.:</given-names>
          </string-name>
          <article-title>A statistical interpretation of term specificity and its application in retrieval</article-title>
          ,
          <source>Journal of documentation</source>
          (
          <year>1972</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Joulin</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grave</surname>
          </string-name>
          , É.,
          <string-name>
            <surname>Bojanowski</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Bag of Tricks for Eficient Text Classification</article-title>
          ,
          <source>in: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume</source>
          <volume>2</volume>
          ,
          <string-name>
            <given-names>Short</given-names>
            <surname>Papers</surname>
          </string-name>
          ,
          <year>2017</year>
          , pp.
          <fpage>427</fpage>
          -
          <lpage>431</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ji</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yan</surname>
          </string-name>
          , J.:
          <article-title>Acronym disambiguation using word embedding</article-title>
          ,
          <source>in: TwentyNinth AAAI Conference on Artificial Intelligence</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shah</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mackey</surname>
            ,
            <given-names>T. K.:</given-names>
          </string-name>
          <article-title>A machine learning approach for the detection and characterization of illicit drug dealers on instagram: model evaluation study</article-title>
          ,
          <source>Journal of medical Internet research 21.6</source>
          (
          <issue>2019</issue>
          ),
          <year>e13803</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Lilleberg</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
          </string-name>
          , Y.:
          <article-title>Support vector machines and word2vec for text classification with semantic features</article-title>
          ,
          <source>in: 2015 IEEE 14th International Conference on Cognitive Informatics</source>
          &amp;
          <article-title>Cognitive Computing (ICCI* CC)</article-title>
          , IEEE,
          <year>2015</year>
          , pp.
          <fpage>136</fpage>
          -
          <lpage>140</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Pedregosa</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Varoquaux</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gramfort</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , et al.:
          <article-title>Scikit-learn: Machine Learning in Python</article-title>
          ,
          <source>Journal of Machine Learning Research</source>
          <volume>12</volume>
          (
          <year>2011</year>
          ), pp.
          <fpage>2825</fpage>
          -
          <lpage>2830</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Pennington</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Socher</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manning</surname>
          </string-name>
          , C. D.:
          <article-title>GloVe: Global Vectors for Word Representation</article-title>
          ,
          <source>in: Empirical Methods in Natural Language Processing (EMNLP)</source>
          ,
          <year>2014</year>
          , pp.
          <fpage>1532</fpage>
          -
          <lpage>1543</lpage>
          , url: http://www.aclweb.org/anthology/D14-1162.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Ren</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ji</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Improving twitter sentiment classification using topic-enriched multi-prototype word embeddings</article-title>
          ,
          <source>in: Thirtieth AAAI conference on artificial intelligence</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <surname>Santos</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Canuto</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Neto</surname>
            ,
            <given-names>A. F.</given-names>
          </string-name>
          :
          <article-title>A comparative analysis of classification methods to multi-label tasks in diferent application domains</article-title>
          ,
          <source>Int. J. Comput. Inform. Syst. Indust. Manag. Appl</source>
          <volume>3</volume>
          (
          <year>2011</year>
          ), pp.
          <fpage>218</fpage>
          -
          <lpage>227</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <surname>Tasneem</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          :
          <string-name>
            <surname>Semi-Automatic Weapons Without A Background Check Can Be Just A Click Away</surname>
          </string-name>
          , https://www.npr.org/sections/alltechconsidered/2016/06/17/482483537/ semi-automatic
          <article-title>-weapons-without-a-background-check-can-be-just-a-clickaway, dataset</article-title>
          , Accessed:
          <fpage>2019</fpage>
          -02-20,
          <year>June 2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <surname>Yuan</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lu</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liao</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          : Reading Thieves'
          <article-title>cant: automatically identifying and understanding dark jargons from cybercrime marketplaces</article-title>
          ,
          <source>in: 27th {USENIX} Security Symposium ({USENIX} Security 18)</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>1027</fpage>
          -
          <lpage>1041</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>