<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Italian Conference on Cybersecurity, June</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>C-OSINT: COVID-19 Open Source artificial INTelligence framework</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Leonardo Ranaldi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Aria Nourbakhsh</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Francesca Fallucchid</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fabio Massimo Zanzotto</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Enterprise Engineering, University of Rome Tor Vergata</institution>
          ,
          <addr-line>Roma</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Innovation and Information Engineering, Guglielmo Marconi University</institution>
          ,
          <addr-line>Roma</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2022</year>
      </pub-date>
      <volume>2</volume>
      <fpage>0</fpage>
      <lpage>23</lpage>
      <abstract>
        <p>With the emergence of COVID-19 disease worldwide, a market of the products related to this disease formed across the Internet. By the time these goods were in short supply, many uncontrolled Dark Web Marketplaces (DWM) were active in selling these products. At the same time, Dark Web Forums (DWF) became proxies for spreading false ideas, fake news about COVID-19, and advertising products sold in DWMs. This study investigates the activities entertained in the DWMs and DWFs to propose a learning-based model to distinguish them from their related counterparts on the surface web. To this end, we propose a COVID-19 Open Source artificial INTelligence framework (C-OSINT) to automatically collect and classify the activities done in DWMs and DWFs. Moreover, we corporate linguistic and stylistic solutions to leverage the classification performance between the content found in DWMs and DWFs and two surface web sources. Our results show that using syntactic and stylistic representation outperforms the Transformer based results over these domains.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Machine Learning</kwd>
        <kwd>Natural Language Processing</kwd>
        <kwd>COVID-19</kwd>
        <kwd>Dark Web</kwd>
        <kwd>Cyberspace</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        By the end of 2019, COVID-19, a respiratory disease, emerged that caused financial and health
crises around the world. Consequently, many countries and health organizations started to
respond to the pandemic. To stop and slow down the mortality rate of the disease, many vaccines
were proposed, and the first batch of them in late 2020 was oficially approved. Vaccines from
Pfizer/BioNTech [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], Moderna [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], and Sputnik [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] were among the most famous and utilized
brands. The unbalanced distribution of vaccine doses and the race to access the first dose soon
generated concerns about illegal trades of the vaccine. Europol and other national security
agencies reported the sale of fake COVID-19 vaccines on Dark Web Marketplaces (DWMs) on
December 2020 [
        <xref ref-type="bibr" rid="ref4 ref5 ref6 ref7 ref8">4, 5, 6, 7, 8</xref>
        ]. Monitoring DWMs is therefore critical to enable police and public
health agencies to be prepared and efectively counter these threats.
      </p>
      <p>
        Interpol and Europol said that DWMs had become proxies for online traficking of masks,
COVID-19 tests, and alleged drugs constantly advertised on these platforms. A similar issue
happened with the use of vaccines and the start of vaccination campaigns [
        <xref ref-type="bibr" rid="ref10 ref9">9, 10</xref>
        ]. The matter
got exacerbated by the birth of the green pass as a document that would enable people to have
public activities such as using public transportation and visiting public spaces [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. At the same
time, Dark Web Forums (DWFs) have been the subject of the proliferation of arguments and
the spreading of fake information related to COVID-19. Linking these activities is not an easy
task [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
      </p>
      <p>
        DWFs are a great place to get into illicit online activities, and DWMs can be easily accessed
through specialized browsers, such as Tor [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], I2P [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] and FreeNet [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. These browsers
guarantee users’ anonymity, and in turn, trades of many illegal goods such as drugs, firearms,
credit cards, and fake documents are being conducted in them [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. The growing popularity
of Dark Web activities has attracted the interest of the scientific community, and security
researchers to provide comparative analyses of the diferent DWMs [
        <xref ref-type="bibr" rid="ref17 ref18 ref19 ref20 ref21">17, 18, 19, 20, 21</xref>
        ] and
DWFs [
        <xref ref-type="bibr" rid="ref22 ref23 ref24">22, 23, 24</xref>
        ]. Among the most credited are studies that propose automatic recognition
and classification of activities and analysis of lexicon used in DWMs and DWFs [
        <xref ref-type="bibr" rid="ref25 ref26 ref27">25, 26, 27</xref>
        ].
According to numerous reports, law enforcement has successfully closed several illegal DWMs
[
        <xref ref-type="bibr" rid="ref28 ref29">28, 29</xref>
        ]. Still, DWMs are inherently resilient to these interventions, and in 2020 COVID-19
disease provided another reason to analyze and classify the content produced in this particular
domain.
      </p>
      <p>In this study, we investigate the activities entertained in the DWMs and DWFs to propose a
learning-based model to recognize their contents compared to the data from the surface web.
To this end, we propose a COVID-19 Open Source artificial INTelligence framework (C-OSINT)
to automatically collect and classify the textual content created in DWMs and DWFs compared
to Reddit and Amazon. Our C-OSINT model consists of three parts: (1) the corpus extraction
system C-OSINT-e, which is used to extract data from DWMs and DWFs; (2) the cleaning
system C-OSINT-d, which cleans and does some pre-processing to build the final corpus; (3)
the classification system C-OSINT-c, which classifies the ’.onion’ service to determine whether
the service is from a marketplace or forum (from Dark Web and surface web) by using the
HTML text of the pages and applying Natural Language Processing algorithms. The rest of
the paper is organized as follows. Section 2 describes state-of-the-art studies on Dark Web
activities and how to identify them with automatically generated heuristics. Section 3 describes
our C-OSINT-e, C-OSINT-d and Section 4 describes our C-OSINT-c. Finally, in section 5 we
present the result of C-OSINT-c classifications and provide a discussion of the obtained results.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Background and Related Work</title>
      <p>
        Since the 2000s, many have researched methods of classifying surface web content [
        <xref ref-type="bibr" rid="ref30 ref31 ref32">30, 31, 32</xref>
        ].
More recently, some attempts to classify the non-indexed part of the web, called the Deep Web
[
        <xref ref-type="bibr" rid="ref33 ref34">33, 34</xref>
        ], and then with the ancestor of today’s Dark Web (DW), [
        <xref ref-type="bibr" rid="ref35 ref36">35, 36</xref>
        ] have been published.
      </p>
      <p>
        With the growth in popularity, the DW has become a research subject in many studies. Barratt
et al. [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] and Aldridge et al. [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] have done extensive investigations of customers of DWMs
taking the ’Silk Road’ phenomenon as a use case. Yang et al. [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] and Pete et al.[
        <xref ref-type="bibr" rid="ref37">37</xref>
        ], on the
other hand, addressed the social relationships undertaken by users of DWFs. The two analytical
activities, while fundamental to understanding the dynamics of the social networks that are
created around the DW have remained highly contextualized. One of the first works that shifted
the focus to automatic content classification was done by Biryukov et al. [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ]. They classified
the content of the DW, restricting the study only to Tor’s hidden services resulting in 18 topical
categories. By limiting the topic to drug trades, Graczyk et al.[
        <xref ref-type="bibr" rid="ref38">38</xref>
        ] combined unsupervised
feature selection and an SVM classifier to classify drug selling services.
      </p>
      <p>
        The first real distinction between activities, as selling services rather than forums, was
proposed in [
        <xref ref-type="bibr" rid="ref25 ref39">25, 39</xref>
        ]. They presented DUTA (Darknet Usage Text Addresses), the first publicly
available Darknet dataset, with a classification into topical categories and subcategories.
Avarikioti et al.[
        <xref ref-type="bibr" rid="ref40">40</xref>
        ] on the other hand, were the first to focus only on the classification of illegal and
legal activities, so they built a new dataset and used an SVM classifier in an active learning
setting with a bag-of-words feature representation and got very good results. Recently, Choshen
et al.[
        <xref ref-type="bibr" rid="ref26">26</xref>
        ], following [
        <xref ref-type="bibr" rid="ref40">40</xref>
        ] and using the updated version of the publicly available DUTA [
        <xref ref-type="bibr" rid="ref41">41</xref>
        ],
studied the style and structure of hidden illegal and legal services. Choshen et al.[
        <xref ref-type="bibr" rid="ref26">26</xref>
        ] proposed
some excellent classifiers that were based on shallow heuristics and converted the input text
into part of speech (POS) tags. Their obtained results were satisfactory but at the same time
evaded much important information such as sentence structure and basic semantics, and they
converted some diferent symbols into a single symbol, ignoring many typical symbols peculiar
to the DW domain.
      </p>
      <p>
        In this paper, we propose an Open Source artificial INTelligence (OSINT) framework to
automatically collect and classify activities entertained in DWMs and DWFs on a new emerging
topic: COVID-19, from the same type of services on the surface web, namely Amazon and Reddit.
Since the start of the production of vaccines and the obligation of the green pass certificate,
some stores began to sell them [
        <xref ref-type="bibr" rid="ref10 ref11 ref4 ref9">9, 10, 11, 4</xref>
        ] which may pose a significant risk to public health.
Consequently, we propose a comparative analysis using two surface web platforms to show
that our framework can diferentiate the domain where the activity is taking place.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Data</title>
      <p>In this article, we aim to analyze the COVID-19 topic in the most popular Dark Web Marketplaces
(DWMs) and Dark Web Forums (DWFs) between 2020-2021 (see Appendix A.2, A.3), to create a
framework capable of: a) collecting information from ’.oinion’ services, b) recognizing activities
in DWMs and DWFs for monitoring and warning of abuse. To solve this need, we analyzed
the current methods to extract and classify the activities in subsection 3.1. To obtain data, we
propose our framework, which consists of: a crawler and scraper to collect the data (C-OSINT-e)
described in the subsection 3.1.1; a pre-processor of the extracted text and a labeling step
(C-OSINT-d) described in the subsection 3.1.2; a set of classifiers based on machine learning
models (C-OSINT-c).</p>
      <sec id="sec-3-1">
        <title>3.1. DarkNet Dataset</title>
        <p>Obtaining and investigating data from the Dark Web is very complex due to the nature of the
service and obstacles such as text and image-based CAPTCHAs or the absence of public DNS.</p>
        <p>
          Current monitoring pipelines have the first objective of isolating suspicious domains from
normal ones and classifying them into categories. These components are based on keyword
heuristics, which are dificult to keep up to date and prone to false positives given the high
rate of polysemy. There are other heuristics based on automatic learning, but they are highly
dependent on datasets [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ]. One of the first public datasets obtained from Dark Web was the
ifrst version of “Darknet Usage Text Addresses” (DUTA) [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ]. Although an updated version,
DUTA-10K [
          <xref ref-type="bibr" rid="ref41">41</xref>
          ], has been released, the dataset is obsolete because many of its links are currently
down.
        </p>
        <p>
          In this research, our first contribution is the system C-OSINT-e, which extracts text from
’.onion’ services from DWMs and DWFs. Similar to the strategy proposed in [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ], C-OSINT-e is
based on an extraction step, cleaning phase, and finally, labeling of the extracted samples. From
the [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] report, it is possible to identify several DWMs that have COVID-19 related products
available. Similarly, it is possible to analyze some DWFs, as proposed in [
          <xref ref-type="bibr" rid="ref42">42</xref>
          ]. Furthermore, to
perform a comparative analysis and have corpora from DWMs and DMFs at our disposal, we did
the same process on two very famous surface web services: Reddit1 and Amazon2. These two
surface web services were chosen because Reddit is very similar to the structure of DWFs, and
Amazon is the largest online store. A screenshot of DWFs is shown in Figure 2 in the appendix,
and some examples of the cleaned corpus can be seen in table 1.
        </p>
        <sec id="sec-3-1-1">
          <title>1https://www.reddit.com/ 2https://www.amazon.com/</title>
          <p>Sentence
We provide COVID-19 vaccine, Green Pass, Fake Tests
We ship Green Pass and QR code valid throughout Europe
payment in BTC and immediate delivery.</p>
          <p>Fake pandemic and vaccine speculation
The fake pandemic is caused by the Jews
who are ready to speculate on human
as in Israel all lined up to vaccinate
Polonord Adeste 5 Nasal Rapid Test Kit for SARS-CoV-2
Antigen (Nasal Swab) for Self-Diagnosis, 5 Units
(1 pack of 5 rapid tests)
CLINTEST Rapid Covid-19 Antigen Self-Test
To be extra cautious, rotate such masks every three days
Safety was fine, not able to show eficacy.</p>
          <p>
            Since they didn’t release the data we don’t know
how inefective but that is what was reported.
Dark Web Marketplace
Dark Web Forum
Dark Web Forum
Amazon
Amazon
Reddit
Reddit
3.1.1. Extraction
Extracting domains using Tor is a complex task as there is no public DNS server where all
hidden service addresses (HS) are registered. In Tor, there is a Hidden Service Directory (HSDir),
which Tor relies on it, and it functions as an intermediate point between an HS, as it publishes its
descriptors and clients, which communicate with it to learn the address of the HS introduction
points [
            <xref ref-type="bibr" rid="ref27">27</xref>
            ]. However, a Tor needs a specific flag to be assigned by Tor to authorities to function
as an HSDir.
          </p>
          <p>
            Our C-OSINT-e works similar to the method proposed by Al Nabki et al.[
            <xref ref-type="bibr" rid="ref41">41</xref>
            ]. Instead of
querying the flag, we use a custom crawler that uses a Tor socket to retrieve onion web pages
and new addresses through the 9050 port using: online notepad services on the Surface Web,
Tor network search engines, and hyperlinks from the DUTA dataset. Each service is being
visited and then recursively extracts ’.onion’ links which are then cleaned, and duplicate and
inactive links are being removed. Finally, the HTML code gets downloaded using the functions
implemented in the selenium library.
          </p>
          <p>
            The code used to perform scraping of both the Dark Web and surface web corpora is available
at the following GitHub repository3. The time period of data collection for corpus construction
is from November 2020 through March 2021. Appendix A.2 and Appendix A.3 show the list of
’.onion’ services analyzed.
3.1.2. Pre-processing &amp; Labeling
The division into paragraphs and the cleaning of the dataset are done by the C-OSINT-d module,
following the methodology proposed by Choshen et al.[
            <xref ref-type="bibr" rid="ref26">26</xref>
            ]. In all experiments, we apply a
cleaning to the text of the corpora web pages. HTML markups are removed from the original
          </p>
        </sec>
        <sec id="sec-3-1-2">
          <title>3https://github.com/ART-Group-it/C-OSINT</title>
          <p>dataset; the same is done for non-linguistic contents such as buttons, encryption keys, metadata
and URLs. Despite applying these pre-processing steps, the remaining textual elements are
unclear, and in some cases, unintelligible as domain-specific slang and abbreviations are widely
used on the Dark Web.</p>
          <p>
            The labeling process of new samples is carried out in two steps: 1) text classifier proposed
previously; 2) sharing of manually assigned tags. The main rules defined in [
            <xref ref-type="bibr" rid="ref25">25</xref>
            ] consist of
labeling a domain based only on the textual content visible to the user; a domain should receive
only one tag based on its activity. In case of uncertainties, an open discussion is established
with the rest of the authors.
          </p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Surface Web</title>
        <p>For the other two additional datasets from legal sources, we compiled a corpus of Amazon and
Reddit pages of similar sizes and characteristics. Amazon is the largest hosting site for sellers
of various goods. The corpus from Amazon contains 630 item descriptions, each consisting
of more than one sentence. The item descriptions vary by price, item sold, and seller. The
descriptions were selected by searching Amazon for terms related to COVID-19 and selecting
search patterns to avoid excessive repetition. The search queries also included filtering by price
so that each query would result in diferent items. Due to sellers’ advertising strategies or
geographic dispersion, the Amazon corpus contains both formal and informal language, and
some item descriptions contain abbreviations and domain-specific words. Reddit is a social news,
entertainment, and forum website where registered users can publish content in textual posts
or hyperlinks. The corpus from Reddit contains 630 discussions on topics related to COVID-19.
The source codes to reconstruct the two datasets can be found in the GitHub repository.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Methods</title>
      <p>In this section, we experiment with our C-OSINT framework to investigate which Natural
Language Processing (NLP) algorithm achieves the best result in classifying the activities
entertained in DWMs and DWFs.</p>
      <p>The C-OSINT-c module is where our text classification experiments are done to find the
essential linguistic features that distinguish the activities entertained in diferent services.
Another goal of the classification task is to observe whether these tasks can be solved by holistic
Transformers, lexical models, syntactic models, stylistic models, and models derived from the
union of the previous ones.</p>
      <sec id="sec-4-1">
        <title>4.1. Methods: Classification Models</title>
        <p>The models proposed in this section aim to cover all linguistic needs in the study of style,
lexicon, and semantics.</p>
        <p>
          Holistic Transformers These classifiers are based on Transformers-models [
          <xref ref-type="bibr" rid="ref43">43</xref>
          ] and seem to
achieve state-of-the-art results in many text classification tasks.
        </p>
        <p>
          We tested the following Transformer models to cover the majority of cases of pre-training
size (see Table 2) and models:
•  [
          <xref ref-type="bibr" rid="ref44">44</xref>
          ], that is Bidirectional Encoder Representations from Transformers, and is
trained on the BooksCorpus [
          <xref ref-type="bibr" rid="ref45">45</xref>
          ] and English Wikipedia.
• , that is the Multi-Language version of BERT [
          <xref ref-type="bibr" rid="ref46">46</xref>
          ] and is trained on a Wikipedia
dump of 100 languages.
• XLNet [
          <xref ref-type="bibr" rid="ref47">47</xref>
          ] is based on a generalized autoregressive pre-training technique that allows
the learning of bidirectional contexts by maximizing the expected likelihood over all
permutations of the factorization order. This architecture is trained from datasets gathered
from the surface web such as Wikipedia, Bookcorpus, Giga5, Clueweb, and Common
Crawl.
• ERNIE [
          <xref ref-type="bibr" rid="ref48">48</xref>
          ] to improve some of BERT’s problems, introduced a language model
representation that uses an external knowledge graph for named entities. ERNIE is pre-trained on
Wikipedia corpus and Wikidata knowledge base.
• ELECTRA [
          <xref ref-type="bibr" rid="ref49">49</xref>
          ] proposes a mechanism of “corrupting” the input token that is replaced
with a token that potentially fits the place. The training procedure is a classification of
each token, whether it is a corrupted input or not. This model is trained on the same
dataset as BERT.
• DistilBERT [
          <xref ref-type="bibr" rid="ref50">50</xref>
          ] proposes a method for pre-training a smaller, general-purpose language
representation model, much like BERT, that can then be tuned with good performance on
a wide range of tasks like its larger counterparts.
• RoBERTa [
          <xref ref-type="bibr" rid="ref51">51</xref>
          ] appears to be a replication study of the pre-training BERT, with the major
diference being the focus on the impact of many key hyperparameters and the size of
the training data. Indeed, it appears that BERT is under-trained in some respects, and
changing the choice of hyperparameters may make a diference on some tasks.
        </p>
        <p>
          All models described above were implemented using the oficial implementations coming
from the Huggingface Transformers library [
          <xref ref-type="bibr" rid="ref52">52</xref>
          ].
        </p>
        <p>Stylistic Classifier This classifier is used to determine if the proposed tasks are sensitive to
syntactic and lexical information, and thus there is a stylistic diference between the source
domains. We would expect texts associated with selling merchandise to be written more formally
with pre-defined structures. In contrast, users utilize diferent styles to express their ideas in
texts from forums with no strict rules. Among other diferences, we can point to the use
of capital letters, possible emoticons, and interjections in forum texts. For this purpose, we
apply two models, one purely based on word-level features and one based on shallow syntactic
structure.</p>
        <p>
          Bleaching text [
          <xref ref-type="bibr" rid="ref53">53</xref>
          ] is a model proposed to capture the style of writing at the word level.
A linear SVM classifier is applied over the final representation, which concatenates all the
‘bleached’ strings treated as a binary bag-of-word model.
        </p>
        <p>
          Part-of-speech tags (POS) [
          <xref ref-type="bibr" rid="ref54">54</xref>
          ] are unique labels assigned to each token (word) to indicate
the grammatical categories and other information such as tense and number (plural/singular)
of the words. A vanilla feed-forward neural networks (FFNN) classifier is applied to the final
representation, the concatenation of all converted strings treated as a binary bag-of-word model.
This model is trained with 300 dimensions for five epochs. The FFNN consists of an input layer
of dimension 300 and 2 hidden layers of 150 and 50 dimensions with the  activation
function.
        </p>
        <p>
          Lexical-based Neural Networks We used a classifier based on a vanilla feed-forward neural
networks (FFNN) over a bag-of-word-embedding (BoE) representation of sentences to answer
this question. This classifier is used to determine whether the proposed tasks can be described
and classified through pre-trained word embeddings. In BoE, sentence representations are
computed as the summation of the embedding of each constituent word of samples in our
dataset. For this classification method, we used   word embeddings [
          <xref ref-type="bibr" rid="ref59">59</xref>
          ] trained on 2014
Wikipedia dumps and Giga5. The FFNN used with Glove representation consists of an input
layer of 300 dimensions and two hidden layers of 150 and 50 dimensions with the 
activation function, and it was trained for five epochs as well.
        </p>
        <p>
          Syntactic-based Neural Networks Finally, to evaluate the role of “pre-trained” universal
syntactic models, we used the Kernel-inspired Encoder with Recursive Mechanism for
Interpretable Trees (KERMIT) [
          <xref ref-type="bibr" rid="ref60">60</xref>
          ]. This model positively exploits parse trees in neural networks
as it increases pre-trained Transformers’ performance when used in combined models. The
version used in the experiments encodes parse-trees in vectors of 4, 000 dimensions. The rest of
the FFNN comprises two hidden layers of 4, 000 and 2, 000 dimensions. Finally, the output layer
consists of 2 dimensions for classification. Between each layer, the  activation function
and a dropout of 0.1 was used to avoid overfitting on the train data.
        </p>
        <p>
          KERMIT model exploits the parse trees produced by a traditional parser. As advised by
Zanzotto el al. [
          <xref ref-type="bibr" rid="ref60">60</xref>
          ], we used the English constitution-based parser, CoreNLP library [
          <xref ref-type="bibr" rid="ref61">61</xref>
          ].
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Experimental set-up</title>
        <p>Using C-OSINT-e and C-OSINT-d, we created four datasets: two from Dark Web Marketplaces
and Forums and two from Surface Web. Each corpus contains 630 examples labeled either
‘forum’ or ‘market’. In the experiments, the datasets were merged, building four balanced
comparisons, they were split into training and test sets with a 70/30 ratio. The evaluation was
done by extracting the accuracy of classification outputs.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results and Discussion</title>
      <p>Looking at the performance of diferent approaches in the same dataset setting helps us compare
their ability to tackle the problem of classifying markets and forums from dark and surface net.
Results of the experiments are reported in Tab. 3 with the configurations described in Sec. 4.</p>
      <p>These results show the unexpected behavior of the applied models because Transformers
have poor performance on these uncovered domains. Although the BoE(GloVe) lexical scores
better than the Transformer, it still lags behind the other syntactic and stylistic approaches.
This poor performance can be attributed to the data that these models were trained on: all these
representations were trained on surface web datasets which cannot generalize to the data from
the Dark Web.</p>
      <p>The other results for the proposed tasks are mixed, but the trend is that all work better than
the Transformers. Stylistic models perform on par with syntactic models. The tasks where
stylistic models perform better are those that classify surface web services against their Dark
Web counterparts. These results are of two kinds: (1) data from the Dark Web have a writing
style that can be captured through distinctive features such as all capital letters, abbreviations,
punctuation. (2) A bag-of-words representation of POS-tags can be a distinguishing factor
between Dark and surface Web services Reddit and Amazon. The distinction is less evident for
DFMs and DWMs.</p>
      <p>
        Neural network models based on syntax have engaging performances on this dataset. Here,
KERMIT [
        <xref ref-type="bibr" rid="ref60">60</xref>
        ] works better than Transformers, showing that these tasks are sensitive
concerning syntactic information that the Transformers cannot transfer to another unseen domain.
Moreover, although KERMIT uses a parser trained on the surface web to parse sentences [
        <xref ref-type="bibr" rid="ref62">62</xref>
        ],
syntactic rules are more restricted than semantic and discourse-level information captured by
the Transformers. Yet, it can find the variations among these diferent domains. However, the
combined “pre-trained” lexical and syntactic model, BoE(GloVe) + KERMIT, do not outperform
the two models separately.
      </p>
      <p>In conclusion, monitoring the activities on the Dark Web and comparing them with their
similar surface services is an ongoing challenge. Using models, such as Transformers, to solve
text classification tasks is not consistently successful [ 63]. Possibly, activities on the Dark Web
domain are written with a diferent style and grammar and require a diferent representation
than what pre-trained embeddings ofer. Taking into account that these models can handle
lexical and syntactic information [64, 65] they also can overfit to their training data. In other
words, they cannot transfer these types of knowledge to a new unseen domain.</p>
      <p>In future work, we propose mechanisms for extracting and validating training data [66] and
investigating the control mechanisms of neural networks [67], as initiated in [68]. Although
these avenues of research are exciting and compelling, they still cannot be developed easily
because of the lack of data from obscure and hard-to-find domains.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>
        In this research, we investigated a new type of activity on the web that emerged due to the global
pandemic. The products and discussions around COVID-19 on two parts of the web, namely
surface and Dark Web, allowed us to investigate the performance of classification methods over
these two domains. Although national security agencies [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and international security agencies
[
        <xref ref-type="bibr" rid="ref5 ref6 ref9">5, 9, 6</xref>
        ] continuously monitor these activities, they are not easily found, and automatic analysis
could produce false truths.
      </p>
      <p>For this matter, we proposed the C-OSINT framework to detect the activity related to the
COVID-19 issue in Dark Web Marketplaces and Forums. COSINT-e and COSINT-d are used to
extract and process data from heterogeneous sources such as ’.onion’ services and surface web
pages. COSINT-c proposes a set of learning-based classifiers to classify the extracted corpora
using COSINT-e and COSINT-d.</p>
      <p>With the success of Transformer in many downstream tasks, we were expecting the same
results on our extracted dataset. However, the results show that they cannot transfer their
knowledge to an unseen domain. Finally, we observed that other subtle features such as style
and syntactic information could be better clues in finding and distinguishing the activities
between dark and surface web.</p>
      <p>In summary, our contribution is two folds: (1) We build an Open Source Intelligence
frameworks for activity recognition in the far reaches of the web around COVID-19 topic; (2) Reaching
to the conclusion that adding external knowledge to the classification task in the form of
syntactic and stylistic information would be more helpful than solely relying on pre-trained and
automatic Transformer based classification.
https://aclanthology.org/P14-5010. doi:10.3115/v1/P14-5010.
[63] L. Ranaldi, F. Ranaldi, F. Fallucchi, F. M. Zanzotto, Shedding light on the dark web:
Authorship attribution in radical forums, Information 13 (2022). URL: https://www.mdpi.
com/2078-2489/13/9/435.
[64] G. Jawahar, B. Sagot, D. Seddah, What does bert learn about the structure of language, in:
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics,
2019, pp. 3651–3657.
[65] J. Hu, J. Gauthier, P. Qian, E. Wilcox, R. Levy, A systematic assessment of syntactic
generalization in neural language models, in: Proceedings of the 58th Annual Meeting of
the Association for Computational Linguistics, 2020, pp. 1725–1744.
[66] L. Ranaldi, F. Fallucchi, A. Santilli, F. M. Zanzotto, Kermitviz: Visualizing neural network
activations on syntactic trees, in: E. Garoufallou, M.-A. Ovalle-Perandones, A. Vlachidis
(Eds.), Metadata and Semantic Research, Springer International Publishing, Cham, 2022,
pp. 139–147.
[67] D. Onorati, P. Tommasino, L. Ranaldi, F. Fallucchi, F. M. Zanzotto, Pat-in-the-loop:
Declarative knowledge for controlling neural networks, Future Internet 12 (2020). URL:
https://www.mdpi.com/1999-5903/12/12/218. doi:10.3390/fi12120218.
[68] L. Ranaldi, F. Fallucchi, F. M. Zanzotto, Dis-cover ai minds to preserve human knowledge,
Future Internet 14 (2022). URL: https://www.mdpi.com/1999-5903/14/1/10. doi:10.3390/
fi14010010.</p>
    </sec>
    <sec id="sec-7">
      <title>A. Appendix</title>
      <sec id="sec-7-1">
        <title>A.1. Example of Listing</title>
        <p>COVID-19</p>
        <p>DWF</p>
        <p>RAID, dread, Nulled,
4chan, The Stock Insiders, Hidden Answers,
Acropolis Forum, torBBS
Teddit forum, SuprBay,</p>
        <p>DeaChan</p>
      </sec>
      <sec id="sec-7-2">
        <title>A.3. Dark Web Marketplaces</title>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>R.</given-names>
            <surname>Michelle</surname>
          </string-name>
          , Covid:
          <article-title>Pfizer-biontech vaccine approved for eu states</article-title>
          .,
          <year>2020</year>
          . URL: https: //www.bbc.co.uk/news/world-europe-
          <volume>55401136</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Gallagher</surname>
          </string-name>
          , Moderna:
          <article-title>Covid vaccine shows nearly 95% protection</article-title>
          .,
          <year>2020</year>
          . URL: https: //www.bbc.com/news/health-54902908.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Burki</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Khan</surname>
          </string-name>
          ,
          <article-title>The russian vaccine for covid-19, The Lancet</article-title>
          .
          <source>Respiratory medicine 8</source>
          (
          <year>2020</year>
          ). doi:
          <volume>10</volume>
          .1016/S2213-
          <volume>2600</volume>
          (
          <issue>20</issue>
          )
          <fpage>30402</fpage>
          -
          <lpage>1</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Sicurezzanazionale</surname>
          </string-name>
          , Relazione al parlamento
          <year>2021</year>
          .,
          <year>2022</year>
          . URL: https://www. sicurezzanazionale.gov.it/sisr.nsf/relazione-annuale/relazione-al-parlamento
          <article-title>-2021</article-title>
          .html.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>EUROPOL</surname>
          </string-name>
          , Covid
          <article-title>-19 sparks upward trend in cybercrime</article-title>
          .,
          <year>2020</year>
          . URL: https://www.europol. europa.eu/media-press/newsroom/news/covid-19
          <string-name>
            <surname>-</surname>
          </string-name>
          sparks
          <article-title>-upward-trend-in-cybercrime.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>EUROPOL</surname>
          </string-name>
          ,
          <article-title>Europol predictions correct for fake covid-19 vaccines</article-title>
          .,
          <year>2020</year>
          . URL: https://www.europol.europa.eu/media-press/newsroom/news/ europol-predictions
          <article-title>-correct-for-fake-covid-19-vaccines</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>B. R</surname>
          </string-name>
          , B. M,
          <article-title>Jiang, Availability of covid-19 related products on tor darknet markets, Statistical Bulletin</article-title>
          . Canberra: Australian Institute of Criminology. (
          <year>2020</year>
          ). doi:https://doi.org/ 10.52922/sb04534.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Intelligence</surname>
          </string-name>
          , S. C. of Parliament,
          <source>Annual report</source>
          ,
          <year>2022</year>
          . URL: https://isc.independent.gov.uk/.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>EUROPOL</surname>
          </string-name>
          ,
          <article-title>Eu drug markets: Impact of covid-19</article-title>
          .,
          <year>2020</year>
          . URL: https://www.europol.europa. eu/media-press/newsroom/news/eu
          <article-title>-drug-markets-impact-of-covid-19.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>EUROPOL</surname>
          </string-name>
          ,
          <string-name>
            <surname>How</surname>
          </string-name>
          covid-19
          <source>-related crime infected europe during</source>
          <year>2020</year>
          ,
          <year>2020</year>
          . URL: https://www.europol.europa.eu/sites/default/files/documents/how_covid-19-related_ crime_infected_europe_during_
          <year>2020</year>
          .pdf.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>EUROPOL</surname>
          </string-name>
          ,
          <article-title>Europol warning on the illicit sale of false negative covid-19 test certificates</article-title>
          .,
          <year>2021</year>
          . URL: https://www.europol.europa.eu/media-press/newsroom/news/ europol-warning
          <article-title>-illicit-sale-of-false-negative-</article-title>
          <string-name>
            <surname>covid-</surname>
          </string-name>
          19
          <string-name>
            <surname>-</surname>
          </string-name>
          test-certificates.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>L.</given-names>
            <surname>Ranaldi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. M.</given-names>
            <surname>Zanzotto</surname>
          </string-name>
          ,
          <article-title>Hiding your face is not enough: user identity linkage with image recognition</article-title>
          ,
          <source>Social Network Analysis and Mining</source>
          <volume>10</volume>
          (
          <year>2020</year>
          ). URL: https://doi.org/10. 1007/s13278-020-00673-4. doi:
          <volume>10</volume>
          .1007/s13278-020-00673-4.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>R.</given-names>
            <surname>Dingledine</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Mathewson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Syverson</surname>
          </string-name>
          ,
          <article-title>Tor: The second-generation onion router</article-title>
          ,
          <source>in: Proceedings of the 13th Conference on USENIX Security Symposium - Volume 13, SSYM'04</source>
          ,
          <string-name>
            <given-names>USENIX</given-names>
            <surname>Association</surname>
          </string-name>
          , USA,
          <year>2004</year>
          , p.
          <fpage>21</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>P. H.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Kintis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Antonakakis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Polychronakis</surname>
          </string-name>
          ,
          <article-title>An empirical study of the i2p anonymity network and its censorship resistance</article-title>
          ,
          <source>in: Proceedings of 2018 Internet Measurement Conference (IMC '18)</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>R. W.</given-names>
            <surname>Gehl</surname>
          </string-name>
          ,
          <article-title>Weaving the dark web: a trial of legitimacy on freenet, tor, and i2p</article-title>
          , The MIT Press,
          <year>2018</year>
          . ISBN:
          <volume>9780262038263</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>T. de Boer</surname>
          </string-name>
          , V. Breider,
          <source>Invisible Internet Project(Report)</source>
          ,
          <source>Master's thesis</source>
          , University of Amsterdam,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>M. J. Barratt</surname>
            ,
            <given-names>J. A.</given-names>
          </string-name>
          <string-name>
            <surname>Ferris</surname>
            ,
            <given-names>A. R.</given-names>
          </string-name>
          <string-name>
            <surname>Winstock</surname>
          </string-name>
          ,
          <article-title>Use of silk road, the online drug marketplace, in the united kingdom, australia and the united states</article-title>
          ,
          <source>Addiction</source>
          <volume>109</volume>
          (
          <year>2014</year>
          )
          <fpage>774</fpage>
          -
          <lpage>783</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>J.</given-names>
            <surname>Aldridge</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          <article-title>DDcary-HHtu, Not an 'ebay for drugs': The cryptomarket 'silk road' as a paradigm shifting criminal innovation, SSRN Electron</article-title>
          . J. (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>J.</given-names>
            <surname>Martin</surname>
          </string-name>
          ,
          <article-title>Lost on the silk road: Online drug distribution and the 'cryptomarket'</article-title>
          ,
          <source>Criminology &amp; Criminal Justice</source>
          <volume>14</volume>
          (
          <year>2014</year>
          )
          <fpage>351</fpage>
          -
          <lpage>367</lpage>
          . URL: https://doi.org/10.1177/1748895813505234. doi:
          <volume>10</volume>
          .1177/1748895813505234. arXiv:https://doi.org/10.1177/1748895813505234.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>M. C. Van Hout</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Bingham</surname>
          </string-name>
          , '
          <article-title>silk road', the virtual drug marketplace: a single case study of user experiences</article-title>
          ,
          <source>Int. J. Drug Policy</source>
          <volume>24</volume>
          (
          <year>2013</year>
          )
          <fpage>385</fpage>
          -
          <lpage>391</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bracci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Nadini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Aliapoulios</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>McCoy</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Gray</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Teytelboym</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gallo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Baronchelli</surname>
          </string-name>
          ,
          <article-title>Dark web marketplaces and covid-19: After the vaccines</article-title>
          ,
          <year>2021</year>
          . URL: https://arxiv.org/abs/2102.05470. doi:
          <volume>10</volume>
          .48550/ARXIV.2102.05470.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          , L. Chen,
          <article-title>Dark web forum correlation analysis research</article-title>
          ,
          <source>in: 2019 IEEE 8th Joint International Information Technology and Artificial Intelligence Conference (ITAIC)</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>1216</fpage>
          -
          <lpage>1220</lpage>
          . doi:
          <volume>10</volume>
          .1109/ITAIC.
          <year>2019</year>
          .
          <volume>8785760</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>T.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Abbasi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <article-title>A focused crawler for dark web forums</article-title>
          ,
          <source>J. Assoc. Inf. Sci. Technol</source>
          .
          <volume>61</volume>
          (
          <year>2010</year>
          )
          <fpage>1213</fpage>
          -
          <lpage>1231</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>N.</given-names>
            <surname>Tavabi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Bartley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Abeliuk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Soni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Ferrara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lerman</surname>
          </string-name>
          ,
          <article-title>Characterizing activity on the deep and dark web</article-title>
          ,
          <source>in: Companion Proceedings of The 2019 World Wide Web Conference</source>
          , WWW '19,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2019</year>
          , p.
          <fpage>206</fpage>
          -
          <lpage>213</lpage>
          . URL: https://doi.org/10.1145/3308560.3316502. doi:
          <volume>10</volume>
          .1145/3308560. 3316502.
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <surname>M. W. Al Nabki</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Fidalgo</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Alegre</surname>
          </string-name>
          , I. de Paz,
          <article-title>Classifying illegal activities on tor network based on web textual contents</article-title>
          ,
          <source>in: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume</source>
          <volume>1</volume>
          ,
          <string-name>
            <surname>Long</surname>
            <given-names>Papers</given-names>
          </string-name>
          , Association for Computational Linguistics, Valencia, Spain,
          <year>2017</year>
          , pp.
          <fpage>35</fpage>
          -
          <lpage>43</lpage>
          . URL: https://aclanthology.org/E17-1004.
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>L.</given-names>
            <surname>Choshen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. J.</given-names>
            <surname>Eldad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Hershcovich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Sulem</surname>
          </string-name>
          ,
          <string-name>
            <surname>O. Abend,</surname>
          </string-name>
          <article-title>The language of legal and illegal activity on the darknet</article-title>
          ,
          <source>in: ACL</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>A.</given-names>
            <surname>Biryukov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Pustogarov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Thill</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.-P.</given-names>
            <surname>Weinmann</surname>
          </string-name>
          ,
          <source>Content and popularity analysis of tor hidden services</source>
          ,
          <source>2014 IEEE 34th International Conference on Distributed Computing Systems Workshops (ICDCSW)</source>
          (
          <year>2014</year>
          )
          <fpage>188</fpage>
          -
          <lpage>193</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <surname>EUROPOL</surname>
          </string-name>
          ,
          <article-title>Eu terrorism situation trend report (te-sat)</article-title>
          .,
          <year>2021</year>
          . URL: https://www.europol. europa.eu/publications-events/main-reports
          <source>/tesat-report.</source>
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <surname>FBI</surname>
          </string-name>
          ,
          <article-title>Darknet takedown</article-title>
          .,
          <year>2017</year>
          . URL: https://www.fbi.gov/news/stories/alphabay-takedown.
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>S.</given-names>
            <surname>Dumais</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <article-title>Hierarchical classification of web content</article-title>
          ,
          <source>in: Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          , SIGIR '00,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2000</year>
          , p.
          <fpage>256</fpage>
          -
          <lpage>263</lpage>
          . URL: https://doi.org/10.1145/345508.345593. doi:
          <volume>10</volume>
          .1145/345508.345593.
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>A.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.-P.</given-names>
            <surname>Lim</surname>
          </string-name>
          , W.-K. Ng,
          <article-title>Web classification using support vector machine</article-title>
          ,
          <source>in: Proceedings of the 4th International Workshop on Web Information and Data Management</source>
          , WIDM '02,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2002</year>
          , p.
          <fpage>96</fpage>
          -
          <lpage>99</lpage>
          . URL: https://doi.org/10.1145/584931.584952. doi:
          <volume>10</volume>
          .1145/584931.584952.
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <surname>M.-Y. Kan</surname>
          </string-name>
          ,
          <article-title>Web page classification without the web page</article-title>
          ,
          <source>in: Proceedings of the 13th International World Wide Web Conference on Alternate Track Papers amp; Posters</source>
          , WWW Alt. '
          <volume>04</volume>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2004</year>
          , p.
          <fpage>262</fpage>
          -
          <lpage>263</lpage>
          . URL: https://doi.org/10.1145/1013367.1013426. doi:
          <volume>10</volume>
          .1145/1013367.1013426.
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>W.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Lochovsky</surname>
          </string-name>
          ,
          <article-title>Automatic hierarchical classification of structured deep web databases</article-title>
          ,
          <source>in: Proceedings of the 7th International Conference on Web Information Systems, WISE'06</source>
          , Springer-Verlag, Berlin, Heidelberg,
          <year>2006</year>
          , p.
          <fpage>210</fpage>
          -
          <lpage>221</lpage>
          . URL: https: //doi.org/10.1007/11912873_23. doi:
          <volume>10</volume>
          .1007/11912873_
          <fpage>23</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <surname>H.-X. Xu</surname>
            ,
            <given-names>X.-L.</given-names>
          </string-name>
          <string-name>
            <surname>Hao</surname>
            , S.-
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>Y.-F.</given-names>
          </string-name>
          <string-name>
            <surname>Hu</surname>
          </string-name>
          ,
          <article-title>A method of deep web classification</article-title>
          ,
          <source>in: 2007 International Conference on Machine Learning and Cybernetics</source>
          , volume
          <volume>7</volume>
          ,
          <year>2007</year>
          , pp.
          <fpage>4009</fpage>
          -
          <lpage>4014</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICMLC.
          <year>2007</year>
          .
          <volume>4370847</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>L.</given-names>
            <surname>Overlier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Syverson</surname>
          </string-name>
          ,
          <article-title>Locating hidden servers</article-title>
          ,
          <source>in: 2006 IEEE Symposium on Security and Privacy (S P'06)</source>
          ,
          <year>2006</year>
          , pp.
          <volume>15</volume>
          pp.-
          <fpage>114</fpage>
          . doi:
          <volume>10</volume>
          .1109/SP.
          <year>2006</year>
          .
          <volume>24</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          [36]
          <string-name>
            <given-names>S. J.</given-names>
            <surname>Murdoch</surname>
          </string-name>
          ,
          <article-title>Hot or not: Revealing hidden services by their clock skew</article-title>
          ,
          <source>in: Proceedings of the 13th ACM Conference on Computer and Communications Security</source>
          , CCS '06,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2006</year>
          , p.
          <fpage>27</fpage>
          -
          <lpage>36</lpage>
          . URL: https://doi.org/10.1145/1180405.1180410. doi:
          <volume>10</volume>
          .1145/1180405.1180410.
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          [37]
          <string-name>
            <given-names>I.</given-names>
            <surname>Pete</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hughes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. T.</given-names>
            <surname>Chua</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bada</surname>
          </string-name>
          ,
          <article-title>A social network analysis and comparison of six dark web forums</article-title>
          ,
          <source>in: 2020 IEEE European Symposium on Security and Privacy Workshops (EuroS PW)</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>484</fpage>
          -
          <lpage>493</lpage>
          . doi:
          <volume>10</volume>
          .1109/EuroSPW51379.
          <year>2020</year>
          .
          <volume>00071</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          [38]
          <string-name>
            <given-names>M.</given-names>
            <surname>Graczyk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kinningham</surname>
          </string-name>
          ,
          <article-title>Automatic product categorization for anonymous marketplaces</article-title>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          [39]
          <string-name>
            <given-names>G.</given-names>
            <surname>Avarikioti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Brunner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kiayias</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Wattenhofer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zindros</surname>
          </string-name>
          ,
          <article-title>Structure and content of the visible darknet</article-title>
          , CoRR abs/
          <year>1811</year>
          .01348 (
          <year>2018</year>
          ). URL: http://arxiv.org/abs/
          <year>1811</year>
          .01348. arXiv:
          <year>1811</year>
          .01348.
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          [40]
          <string-name>
            <given-names>G.</given-names>
            <surname>Avarikioti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Brunner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kiayias</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Wattenhofer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zindros</surname>
          </string-name>
          ,
          <article-title>Structure and content of the visible darknet</article-title>
          ,
          <year>2018</year>
          . arXiv:
          <year>1811</year>
          .01348.
        </mixed-citation>
      </ref>
      <ref id="ref41">
        <mixed-citation>
          [41]
          <string-name>
            <given-names>M. W. A.</given-names>
            <surname>Nabki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>FIDALGO</surname>
          </string-name>
          , E. Alegre,
          <string-name>
            <given-names>L.</given-names>
            <surname>Fernández-Robles</surname>
          </string-name>
          ,
          <article-title>Torank: Identifying the most influential suspicious domains in the tor network</article-title>
          ,
          <source>Expert Syst. Appl</source>
          .
          <volume>123</volume>
          (
          <year>2019</year>
          )
          <fpage>212</fpage>
          -
          <lpage>226</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref42">
        <mixed-citation>
          [42]
          <string-name>
            <given-names>S.</given-names>
            <surname>Nazah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Huda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. H.</given-names>
            <surname>Abawajy</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. M. Hassan</surname>
          </string-name>
          ,
          <article-title>An unsupervised model for identifying and characterizing dark web forums</article-title>
          ,
          <source>IEEE Access 9</source>
          (
          <year>2021</year>
          )
          <fpage>112871</fpage>
          -
          <lpage>112892</lpage>
          . doi:
          <volume>10</volume>
          .1109/ ACCESS.
          <year>2021</year>
          .
          <volume>3103319</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref43">
        <mixed-citation>
          [43]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          , Ł. Kaiser,
          <string-name>
            <surname>I. Polosukhin</surname>
          </string-name>
          ,
          <article-title>Attention is all you need</article-title>
          ,
          <source>in: Advances in neural information processing systems</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>5998</fpage>
          -
          <lpage>6008</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref44">
        <mixed-citation>
          [44]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , Bert:
          <article-title>Pre-training of deep bidirectional transformers for language understanding</article-title>
          ,
          <year>2019</year>
          . arXiv:
          <year>1810</year>
          .04805.
        </mixed-citation>
      </ref>
      <ref id="ref45">
        <mixed-citation>
          [45]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kiros</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zemel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Salakhutdinov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Urtasun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Torralba</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Fidler</surname>
          </string-name>
          ,
          <article-title>Aligning books and movies: Towards story-like visual explanations by watching movies</article-title>
          and reading books,
          <year>2015</year>
          . arXiv:
          <volume>1506</volume>
          .
          <fpage>06724</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref46">
        <mixed-citation>
          [46]
          <string-name>
            <given-names>T.</given-names>
            <surname>Pires</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Schlinger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Garrette</surname>
          </string-name>
          , How multilingual is multilingual bert?,
          <year>2019</year>
          . arXiv:
          <year>1906</year>
          .01502.
        </mixed-citation>
      </ref>
      <ref id="ref47">
        <mixed-citation>
          [47]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          , J. G. Carbonell, R. Salakhutdinov,
          <string-name>
            <given-names>Q. V.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <article-title>Xlnet: Generalized autoregressive pretraining for language understanding</article-title>
          ,
          <source>in: NeurIPS</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref48">
        <mixed-citation>
          [48]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Pang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Shang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Gong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Shang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Ouyang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Tian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          , Ernie
          <volume>3</volume>
          .
          <article-title>0: Large-scale knowledge enhanced pre-training for language understanding and generation</article-title>
          ,
          <source>ArXiv abs/2107</source>
          .02137 (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref49">
        <mixed-citation>
          [49]
          <string-name>
            <given-names>K.</given-names>
            <surname>Clark</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>T.</given-names>
            <surname>Luong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q. V.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. D.</given-names>
            <surname>Manning</surname>
          </string-name>
          , ELECTRA:
          <article-title>Pre-training text encoders as discriminators rather than generators</article-title>
          ,
          <source>in: ICLR</source>
          ,
          <year>2020</year>
          . URL: https://openreview.net/pdf? id=r1xMH1BtvB.
        </mixed-citation>
      </ref>
      <ref id="ref50">
        <mixed-citation>
          [50]
          <string-name>
            <given-names>V.</given-names>
            <surname>Sanh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Debut</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chaumond</surname>
          </string-name>
          , T. Wolf,
          <article-title>Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter</article-title>
          , ArXiv abs/
          <year>1910</year>
          .01108 (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref51">
        <mixed-citation>
          [51]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Joshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Levy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Stoyanov</surname>
          </string-name>
          ,
          <article-title>Roberta: A robustly optimized bert pretraining approach</article-title>
          , ArXiv abs/
          <year>1907</year>
          .11692 (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref52">
        <mixed-citation>
          [52]
          <string-name>
            <given-names>T.</given-names>
            <surname>Wolf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Debut</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Sanh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chaumond</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Delangue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Moi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Cistac</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Rault</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Louf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Funtowicz</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Brew,</surname>
          </string-name>
          <article-title>HuggingFace's Transformers: State-of-the-art Natural Language Processing</article-title>
          , ArXiv abs/
          <year>1910</year>
          .0 (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref53">
        <mixed-citation>
          [53]
          <string-name>
            R. van der Goot,
            <given-names>N.</given-names>
            <surname>Ljubešić</surname>
          </string-name>
          , I. Matroos,
          <string-name>
            <given-names>M.</given-names>
            <surname>Nissim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Plank</surname>
          </string-name>
          ,
          <article-title>Bleaching text: Abstract features for cross-lingual gender prediction, in: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics</article-title>
          (Volume
          <volume>2</volume>
          :
          <string-name>
            <surname>Short</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Association for Computational Linguistics</source>
          , Melbourne, Australia,
          <year>2018</year>
          , pp.
          <fpage>383</fpage>
          -
          <lpage>389</lpage>
          . URL: https:// aclanthology.org/P18-2061. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>P18</fpage>
          -2061.
        </mixed-citation>
      </ref>
      <ref id="ref54">
        <mixed-citation>
          [54]
          <string-name>
            <given-names>S.</given-names>
            <surname>Bird</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Klein</surname>
          </string-name>
          , E. Loper,
          <article-title>Natural language processing with Python: analyzing text with the natural language toolkit, "</article-title>
          <string-name>
            <surname>O'Reilly Media</surname>
          </string-name>
          ,
          <source>Inc."</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref55">
        <mixed-citation>
          [55]
          <string-name>
            <given-names>R.</given-names>
            <surname>Parker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Graf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Maeda</surname>
          </string-name>
          ,
          <source>English gigaword fifth edition ldc2011t07 (tech. rep.)</source>
          ,
          <source>Technical Report, Technical Report. Linguistic Data Consortium</source>
          , Philadelphia,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref56">
        <mixed-citation>
          [56]
          <string-name>
            <given-names>C.</given-names>
            <surname>Crawl</surname>
          </string-name>
          ,
          <article-title>Common crawl</article-title>
          , URL: http://commoncrawl.org,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref57">
        <mixed-citation>
          [57]
          <string-name>
            <given-names>J.</given-names>
            <surname>Callan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hoy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Yoo</surname>
          </string-name>
          ,
          <string-name>
            <surname>L</surname>
          </string-name>
          . Zhao,
          <source>Clueweb09 data set</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref58">
        <mixed-citation>
          [58]
          <string-name>
            <given-names>M. P.</given-names>
            <surname>Marcus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Santorini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Marcinkiewicz</surname>
          </string-name>
          ,
          <article-title>Building a large annotated corpus of English: The Penn Treebank</article-title>
          ,
          <source>Computational Linguistics</source>
          <volume>19</volume>
          (
          <year>1993</year>
          )
          <fpage>313</fpage>
          -
          <lpage>330</lpage>
          . URL: https://aclanthology.org/J93-2004.
        </mixed-citation>
      </ref>
      <ref id="ref59">
        <mixed-citation>
          [59]
          <string-name>
            <given-names>J.</given-names>
            <surname>Pennington</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Socher</surname>
          </string-name>
          , C. Manning, GloVe:
          <article-title>Global vectors for word representation</article-title>
          ,
          <source>in: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)</source>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Doha, Qatar,
          <year>2014</year>
          , pp.
          <fpage>1532</fpage>
          -
          <lpage>1543</lpage>
          . URL: https://aclanthology.org/D14-1162. doi:
          <volume>10</volume>
          .3115/v1/
          <fpage>D14</fpage>
          -1162.
        </mixed-citation>
      </ref>
      <ref id="ref60">
        <mixed-citation>
          [60]
          <string-name>
            <given-names>F. M.</given-names>
            <surname>Zanzotto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Santilli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Ranaldi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Onorati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Tommasino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Fallucchi</surname>
          </string-name>
          , KERMIT:
          <article-title>Complementing transformer architectures with encoders of explicit syntactic interpretations</article-title>
          ,
          <source>in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)</source>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Online,
          <year>2020</year>
          , pp.
          <fpage>256</fpage>
          -
          <lpage>267</lpage>
          . URL: https: //aclanthology.org/
          <year>2020</year>
          .emnlp-main.
          <volume>18</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2020</year>
          .emnlp-main.
          <volume>18</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref61">
        <mixed-citation>
          [61]
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , W. Chen,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , J. Zhu,
          <article-title>Fast and accurate shift-reduce constituent parsing</article-title>
          ,
          <source>in: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume</source>
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Association for Computational Linguistics</source>
          , Sofia, Bulgaria,
          <year>2013</year>
          , pp.
          <fpage>434</fpage>
          -
          <lpage>443</lpage>
          . URL: https://aclanthology.org/P13-1043.
        </mixed-citation>
      </ref>
      <ref id="ref62">
        <mixed-citation>
          [62]
          <string-name>
            <given-names>C.</given-names>
            <surname>Manning</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Surdeanu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bauer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Finkel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bethard</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          <article-title>McClosky, The Stanford CoreNLP natural language processing toolkit, in: Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Association for Computational Linguistics</article-title>
          , Baltimore, Maryland,
          <year>2014</year>
          , pp.
          <fpage>55</fpage>
          -
          <lpage>60</lpage>
          . URL: DWM Royal, Cypher, Asap, Bigblue, Dark fox, Hydra, Invictus, Kilos, Liberty, Yakuza, Recon, Televend, The Canadian Headquarters, Agartha, World market, Yukon Magbo, Recon, Televend, The Canadian Headquarters, Torrez, Versus, White house, Yakuza MagBO, 24HoursPPC,
          <string-name>
            <given-names>ASAP</given-names>
            ,
            <surname>Dark</surname>
          </string-name>
          <string-name>
            <surname>Fox</surname>
          </string-name>
          , Dark Leak Market,
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>