Proceedings of the 11th International Conference on Applied Informatics Eger, Hungary, January 29–31, 2020, published at http://ceur-ws.org Machine Learning Techniques for the Classification of Product Descriptions from Darknet Marketplaces Clemens Heistrachera , Franck Mignetb , Sven Schlarba a Austrian Institute of Technology name.surname@ait.ac.at b Thales Research & Technology Netherlands franck.mignet@nl.thalesgroup.com Abstract Over the past decade, the darknet has created unprecedented opportu- nities for trafficking in illicit goods, such as weapons and drugs, and it has provided new ways to offer crime as a service. Natural language processing techniques can be applied to find the types of goods that are traded in these markets. In this paper we present the results of evaluating state-of-the-art machine learning methods for the classification of darknet market offers. Several embeddings, such as GloVe embeddings [20], Fasttext [15], Ten- sor Flow Universal Sentence Encoder [7], Flair’s contextual string embed- ding [2] and term-frequency inverse-document-frequency (TF-IDF), as well as our domain-specific darknet embedding have been evaluated with a series of machine learning models, such as Random Forest, SVM, Naïve Bayes and Multilayer Perceptron. To find the best combination of feature set and machine learning model for this task, the performance was evaluated on a publicly available collection covering 13 darknet markets with more than 10 million product offers [6]. After extracting unique advertisements from the corpus, the classifier was trained on a subset with those advertisements that contain strings related to weapons. The purpose was to determine how well the classifier can distinguish between different types of advertisements which seem all to be related to weapons according to the keywords they contain. The best performance for this classification task was achieved using the Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 128 Linear Support Vector Machine model with the Tensor Flow Universal Sen- tence Encoder for feature extraction, resulting in a micro-f1-score of 96%. Keywords: Natural language processing, machine learning, text classification, document embedding, darknet markets 1. Introduction Darknet markets (DNMs) provide a largely anonymous platform for the trade in illegal goods and services, and drugs represent a large part of the product range [11]. Illicit drug trafficking in DNMs is a very dynamic area, as marketplaces are constantly emerging and sometimes disappearing after a short time. For this reason, police authorities and organizations active in the field of preventing and combating organized crime need techniques which allow them to quickly analyse collected DNM web data and extract information in a cost effective and efficient manner. Classification is one of the common tasks of text mining and natural language processing (NLP). It is used to divide a set of texts into groups according to pre- defined labels, and this technique can also be applied to automatically classify product listings of DNMs. Since it is a supervised machine learning technique, ground truth training data is needed to build the classifier. Many darknet market websites provide menus which allow their customers to navigate between product categories. In principle, these categories could serve as product labels. However, first, this categorization was created by the vendor and is therefore not necessar- ily trustworthy. Second, most markets do not indicate a product category and therefore would be excluded from the analysis. Third, some markets and vendors use false categories to obscure what is being offered. Finally, the categories vary between different DNMs which makes it difficult to create a consistent cross-DNM overview about the product offers available. To overcome these difficulties, we have created a manually annotated dataset that was obtained from thirteen DNMs and which therefore contains consistent labels over a high range of examples. In principle, classification by keywords is a simple and efficient way to char- acterize texts. However, the selection of the right keywords requires a profound understanding of the domain, which is labor intensive and hard to obtain in a hidden society. Furthermore, words are often used differently in the darknet and homonyms are frequent. For example, fruits, weapons, popular brands or celebrity names are sometimes used as brands for drugs and as code words. A string search for “ak47” results in many listings related to drugs, some literature on the weapon and a few offers for a weapon in the Gwern dataset [6]. This illustrates the dif- ficulty of a categorization by keyword. Additionally, a good classifier is robust against modification of single words because the whole document is used for the classification. Our dataset was filtered using a keyword based search for strings related to firearms and therefore only contains product offers that appear to be related to firearms. Therefore, our task can be seen as supervised multi-label text classification on the results of a keyword based search. 129 In the following, we present state of the art word and document embeddings. Then, in section 3, we briefly discuss the literature on classification of product offers on illicit online markets, before describing the dataset we used for our experiments in section 4. Finally, we cover the experimental setup in section 5 and our results in section 6. 2. Background Recent advances in NLP have shown the benefit of pre-trained word embeddings for various tasks such as named entity recognition [2], sentiment analysis [21] and acronym disambiguation [16]. Word embeddings are vector representations for single words in a continuous lower-dimensional space (lower than the number of unique tokens), they can carry semantic and syntactic relationships between words and therefore boost classification results [18]. They are usually trained on very large corpora of unlabeled data and can assist learning and generalisation. Document embeddings are the extension of word embeddings for sentences and short text documents [10]. For our experiments, that will be described in section 5, we calculate vector representations for documents using the following techniques: Term frequency-inverse document frequency (TF-IDF) serves as a simple benchmark to advanced document embeddings. TF-IDF relies on the assumption that the occurrence of a frequent word adds little information about a document compared to the occurrence of an infrequent word. The frequency of the word “is” carries little information about a document, whereas the existence of a word like ‘perceptron’ usually indicates that the topic of the document is related to machine learning as ‘perceptron’ is a technical term only used in machine learning. TF-IDF is the normalized number of occurrences of a word in a document, weighted by the number of its occurrences in the whole corpus. The TF-IDF representation of a document consist of the TF-IDF values for all words in the document [14]. The GloVe [20] word embedding is based on the co-occurrence of words in the training corpus. We used the pre-trained model for the word embedding ‘glove- wiki-gigaword-100’ that was trained on the english wikipedia and a news dataset. The sum of all word embeddings is used as one document embedding. The contextual string embedding Flair [2] tackles the problem of polysemy and homonyms in word vectors. Flair vectors depend on their context and words with multiple meanings can have different representations for each of them. For our experiments we used a combination of the pre-trained embeddings ‘news-forward’ and ‘news-backwards’. Fasttext [15] is an extension to Word2Vec that learns embeddings for character n-grams and therefore shows better performance for rare and unknown words as parts of a word still might be known to the model. We used the pre-trained “wiki- news-300d-1M” embedding containing 300 dimensional word vectors for one million words. Pre-trained embeddings are usually trained on texts taken from a non-criminal 130 context, e.g., Wikipedia or news articles. The language used in the darknet can vary significantly to those sources mentioned above. This is due to differences in contents itself, genre (article versus product offer), style (formal versus colloquial) and the use of domain-specific expressions. Therefore, the transferability of those embeddings to the darknet domain can be limited. The benefit in generalisation of a pre-trained model might be counteracted by different usage of words in the darknet. To answer the question whether a domain specific embedding trained on the darknet outperforms a general embedding that was trained on clear net data, we trained a fasttext darknet embedding [5] on the full Gwerns dataset. In theory, this embedding contains darknet-specific information and might boost performance for product offers where very specific language is used. Tensorflow Universal Sentence Encoder is using a deep averaging network to calculate feature vectors of dimension 512. It was trained on NLP tasks in eight languages [7]. We selected our document embeddings to represent three categories: State-of- the-art methods that use a bag of words approach and therefore don’t take the order of words into account (GloVe and FastText), embeddings that depend on the order of words (Flair, Universal Sentence Encoder) and a simple statistical benchmark (TF-IDF) 3. Related work Since Christin [9] showed in 2012 that Silk Road mostly catered drugs, several attempts to classify products on DNMs have been made. Most publications use Bag of Words (BOW) [4] or TF-IDF [3] to vectorize texts in combination with Support Vector Machines, Logistic Regression and Naive Bayes as machine learning models. Feature reduction is often performed using principle component analysis [13] and latent Dirichlet allocation [1]. Automated keyword extraction for product categories was discussed by Ghosh et al. [12], who proposed a method using differences in term-frequency per category to identify keywords. Further, differences in word embeddings for legitimate and illicit sources are used by Yuan et al. [24] to detect keywords and decipher code words. More recently, LSTMs [17] and word embeddings [8] have been used for the task of text classification on DNMs. Our contribution is the comparison of multiple state of the art document em- beddings using a range of established classifiers on a darknet dataset. We evaluate whether pre-trained embeddings improve the classification performance over sim- ple vector space models, such as TF-IDF, when being transferred to the darknet domain. Further, we test our models on a dataset which contains similar product descriptions aggregated by a keyword based search. 131 4. Dataset and Exploratory Analysis Our dataset is based on a subset of the Darknet Market Archives [6] called Grams, which contains crawls of thirteen darknet markets in the period from 09.06.2014 to 12.07.2015, containing approximately 10 million product offers. Filtering for unique product descriptions resulted in 226 661 datapoints. Hence we are interested in a dataset that is related to firearms, the final dataset for our experiment contains only product descriptions with at least one occurrence of a keyword related to firearm names. The list of firearm keywords was extracted from a publicly available dataset [23] and the Grams dataset. The number of offers related to firearms is 1590. The subset was then manually annotated. Table 1 shows the categories and the number of documents assigned to it. Categories with less than 100 datapoints were excluded from our experiments. For details on the composition of the dataset please contact the authors. Category Number of offers assigned Drug 977 Weapon 284 Book 153 Crime as a Service (CaaS) 116 Excluded 60 Table 1: Number of offers per category 5. Experimental Setup In our experiments we examine combinations of machine learning classifiers and document embeddings for a text classification task in the darknet domain. We have selected a range of commonly used models for classification tasks. We use the scikit- learn implementations for: RandomForestClassifier, LinearSVC, GaussianNB, Lo- gisticRegression (LR), DecisionTreeClassifier, AdaBoostClassifier, KNeighbors Classifier, MLPClassifier using default parameters [19]. We use our dataset to train and evaluate classifiers on the task of assigning a label to new product offers on DNMs. To find the best combination of feature set and machine learning model for this task, several embeddings, GloVe , Fasttext, Tensor Flow Universal Sentence Encoder, Flair’s contextual string embedding and our darknet embedding have been used. After extracting unique advertisements from the corpus, the classifier was trained using the subset with those advertisements that contain strings related to weapons. The purpose was to determine how well the classifier can distinguish between different types of advertisements which seem all to be related to weapons according to the keywords they contain. For our experiments we train all models 132 with all embeddings. We use a fourfold cross-validation that preserves the percent- age of samples for each class which is consistent for all experiments. The model’s score is the average of the scores for each fold in the cross-validation. In binary classification, the precision is the number of true positives over all predicted positives and recall is the number of true positives over all actual posi- tives. The f1-score is the harmonic mean between precision and recall. For multi- class classification, the micro f1-score is calculated globally by counting the total true positive, false negatives and false positives, while for the macro-f1 score, the standards f-1 score is calculated for each class and the scores for all classes are averaged without weighing them. To evaluate the performance of our models, we report micro-f1-score as well as macro-f1-score, to measure the overall performance, but also take the performance for classes with fewer samples into account. [22] 6. Results We present the macro f1-score and the mirco f1-score for each combination of embedding and classifier in Figure 1. The overall best performance was achieved with Tensorflow Universal Sentence Encoder and a linear SVM resulting in a score of macro f1-score of 0.93 and a micro f1-score of 0.96. Analysis of the best results per classifier shows that Tensorflow Universal Sentence Encoder performs best for five out of eight embeddings. The simple TF-IDF performs better than others for Decision Tree and Gaussian Naive Bayes. Only AdaBoostClassifier works best with our darknet embedding trained with fasttext. Further, the comparison of classifiers shows that Linear SVC performs best for five out of six embeddings. The good performance of SVMs is expected, as it is the prevalent model in previous works. Overall, Tensorflow Universal Sentence Encoder appears to generate the best features for our task. However, TF-IDF ranks second place with a simple and lightweight implementation that doesn’t require pre-training on huge datasets. The comparison of micro and macro scores indicates that the performance for all classes is balanced for Tensorflow Universal Sentence Encoder. The comparison of the pre-trained fasttext embedding and our darknet embed- ding shows similar performance as each of the embeddings outperforms the other one in four cases. The best score for the darknet embedding is 0.03 lower than the best score for the pre-trained embedding and therefore no benefit over the pre-trained embedding could been shown for the darknet embedding. Further, we present a detailed analysis of the best combination, which is linear SVM with Tensorflow Universal Sentence Encoder. To achieve more significant results, we reduce the proportion of training data to 25%. The training dataset contains a total of 382 samples, with 242, 69, 36, 35 texts for Drug, Weapon, Book and CaaS respectively. The now larger test set contains a total of 1148 samples, with 735, 215, 117, 81 texts for Drug, Weapon, Book and CaaS respectively. The predictions for this experiment are shown in a confusion matrix (Figure 2). We show the metrics for each class in Table 2. It can be seen that precision as well as recall achieve almost perfect scores over all four classes. The class "Drug" 133 Figure 1: Text classification heatmap of macro-f1-score performed best with a f1-score of 0.99, whereas "Book" achieved the lowest score with 0.88. Further, micro and macros averages and the counts per group (Support) are listed. Book CaaS Drug Weapon micro avg macro avg F1-score 0.88 0.94 0.99 0.95 0.97 0.94 Precision 0.92 0.95 0.98 0.95 0.97 0.95 Recall 0.84 0.94 1 0.95 0.97 0.93 Support 117 81 735 215 1148 1148 Table 2: Metrics per class and averaged metrics for best embed- ding/classifier combination 7. Conclusion In this paper, we evaluated state of the art text embeddings for classification tasks in the darknet domain using multiple classifiers. To show the benefits of a text classification compared to a keyword-based search, we have trained the classifier on the results of a keyword-based search. We used a subset of the Grams crawl in Gwern’s achieve, that contains strings related to weapons and we showed that a text classifier is able to correctly determine labels with an overall accuracy of 97 %. Best 134 Figure 2: Confusion matrix of best embedding/classifier combina- tion results are achieved with features generated by the Tensorflow Universal Sentence encoder using SVMs. However, other state of the art embedding do not beat the established TF-IDF vectorization in this task. Acknowledgements. The research described in this paper was carried out as part of the COPKIT project which has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 786687 References [1] Adamsson, H., MA thesis, Uppsala University, Department of Information Technol- ogy, 2017. [2] Akbik, A., Blythe, D., Vollgraf, R.: Contextual String Embeddings for Sequence Labeling, in: COLING 2018, 27th International Conference on Computational Lin- guistics, 2018, pp. 1638–1649. [3] Al Nabki, M. W., Fidalgo, E., Alegre, E., Paz, I. de: Classifying illegal ac- tivities on TOR network based on web textual contents, in: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguis- tics: Volume 1, Long Papers, 2017, pp. 35–43. [4] Armona, L., Stackman, D.: Learning Darknet Markets, Federal Reserve Bank of New York mimeo (2014). [5] Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information, Transactions of the Association for Computational Lin- guistics 5 (2017), pp. 135–146. [6] Branwen, G., Christin, N., Décary-Hétu, D., et al.: Dark Net Market archives, 2011-2015, https://www.gwern.net/DNM-archives, dataset, Accessed: 2019-01-23, July 2015, url: https://www.gwern.net/DNM-archives. 135 [7] Chidambaram, M., Yang, Y., Cer, D., et al.: Learning Cross-Lingual Sentence Representations via a Multi-task Dual-Encoder Model, in: Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019), 2019, pp. 250– 259. [8] Choshen, L., Eldad, D., Hershcovich, D., Sulem, E., Abend, O.: The Lan- guage of Legal and Illegal Activity on the Darknet, in: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, pp. 4271–4279. [9] Christin, N.: Traveling the Silk Road: A measurement analysis of a large anonymous online marketplace, in: Proceedings of the 22nd international conference on World Wide Web, 2013, pp. 213–224. [10] Dai, A. M., Olah, C., Le, Q. V.: Document embedding with paragraph vectors, arXiv preprint arXiv:1507.07998 (2015). [11] Europol/EMCDDA: Drugs and the darknet. Perspectives for enforcement, research and policy, tech. rep., Europol, European Monitoring Centre for Drugs and Drug Addiction (EMCDDA), 2017. [12] Ghosh, S., Porras, P., Yegneswaran, V., Nitz, K., Das, A.: ATOL: A frame- work for automated analysis and categorization of the Darkweb Ecosystem, in: Work- shops at the Thirty-First AAAI Conference on Artificial Intelligence, 2017. [13] Graczyk, M., Kinningham, K.: Automatic product categorization for anonymous marketplaces, tech. rep., 2015. [14] Jones, K. S.: A statistical interpretation of term specificity and its application in retrieval, Journal of documentation (1972). [15] Joulin, A., Grave, É., Bojanowski, P., Mikolov, T.: Bag of Tricks for Efficient Text Classification, in: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, 2017, pp. 427–431. [16] Li, C., Ji, L., Yan, J.: Acronym disambiguation using word embedding, in: Twenty- Ninth AAAI Conference on Artificial Intelligence, 2015. [17] Li, J., Xu, Q., Shah, N., Mackey, T. K.: A machine learning approach for the detection and characterization of illicit drug dealers on instagram: model evaluation study, Journal of medical Internet research 21.6 (2019), e13803. [18] Lilleberg, J., Zhu, Y., Zhang, Y.: Support vector machines and word2vec for text classification with semantic features, in: 2015 IEEE 14th International Conference on Cognitive Informatics & Cognitive Computing (ICCI* CC), IEEE, 2015, pp. 136– 140. [19] Pedregosa, F., Varoquaux, G., Gramfort, A., et al.: Scikit-learn: Machine Learning in Python, Journal of Machine Learning Research 12 (2011), pp. 2825–2830. [20] Pennington, J., Socher, R., Manning, C. D.: GloVe: Global Vectors for Word Representation, in: Empirical Methods in Natural Language Processing (EMNLP), 2014, pp. 1532–1543, url: http://www.aclweb.org/anthology/D14-1162. [21] Ren, Y., Zhang, Y., Zhang, M., Ji, D.: Improving twitter sentiment classification using topic-enriched multi-prototype word embeddings, in: Thirtieth AAAI conference on artificial intelligence, 2016. 136 [22] Santos, A., Canuto, A., Neto, A. F.: A comparative analysis of classification methods to multi-label tasks in different application domains, Int. J. Comput. Inform. Syst. Indust. Manag. Appl 3 (2011), pp. 218–227. [23] Tasneem, R.: Semi-Automatic Weapons Without A Background Check Can Be Just A Click Away, https://www.npr.org/sections/alltechconsidered/2016/06/17/482483537/ semi-automatic-weapons-without-a-background-check-can-be-just-a-click- away, dataset, Accessed: 2019-02-20, June 2016. [24] Yuan, K., Lu, H., Liao, X., Wang, X.: Reading Thieves’ cant: automatically identifying and understanding dark jargons from cybercrime marketplaces, in: 27th {USENIX} Security Symposium ({USENIX} Security 18), 2018, pp. 1027–1041. 137