<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Archiving a Hard-to-Access Massive Research Data Set in the Language Bank of Finland: The Finnish Dark Web Marketplace Corpus (FINDarC)</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>J. Tuomas Harviainen</string-name>
          <email>tuomas.harviainen@tuni.fi</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Conference on Technology Ethics - Tethics</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Digital Humanities, University of Helsinki, Yliopistokatu 4, 00014 University of Helsinki</institution>
          ,
          <country country="FI">Finland</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Faculty of Information Technology and Communication Sciences, Tampere University</institution>
          ,
          <addr-line>Kalevantie 4, Tampere, 33014</addr-line>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Krister Lindén</institution>
        </aff>
      </contrib-group>
      <fpage>114</fpage>
      <lpage>131</lpage>
      <abstract>
        <p>We discuss the archiving procedure of a corpus comprising posts submitted to Torilauta, a Finnish dark web marketplace website. The site was active from 2017 to 2021 and during this time one of the most prominent online illegal narcotics markets in Finland. As a result of the presented work, a reduced version of the corpus, Finnish Dark Web Marketplace Corpus (FINDarC), has been archived in the Language Bank of Finland. Researchers can apply for access rights to the corpus under the CLARIN RES licence. The discussion presented in this paper addresses the archiving process, including assessment of the risk and impact of data subject re-identification, assessment and implementation of viable data anonymization/reduction approaches, assessment of privacy and security measures implemented by the Language Bank of Finland, and future corpus management plan coordinated by the Language Bank of ∗Corresponding author. †These authors contributed equally.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>dark web, illegal narcotics, online marketplace, data sharing</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>We discuss the archiving procedure of a corpus comprising posts submitted to Torilauta, a
dark web marketplace website, for research purposes. The site was active from 2017 to 2021
and during this time one of the most prominent online illegal narcotics markets in Finland.
Functionally, the site consisted of discussion imageboards where vendors and customers were
able to set up instances of face-to-face trading, typically with the assistance of instant messaging
software such as Wickr or Telegram. The original, unmodified data set comprising 3,104,976
CEUR
CEUR
Workshop
Proceedings</p>
      <p>ceur-ws.org
ISSN1613-0073
posts was collected and handed over to the ENNCODE consortium1 by the site administration
to be archived and shared for research purposes, as permitted by the site’s Terms of Service.
As a result of the presented work, a reduced version of the corpus comprising 3,104,515 posts,
referred to as the Finnish Dark Web Marketplace Corpus (FINDarC), has been deposited in the
Language Bank of Finland, a language resource service coordinated by the national FIN-CLARIN
consortium formed by Finnish universities and other research organizations. Researchers can
contact the Language Bank and apply for permission to access the corpus under the CLARIN
RES license.2</p>
      <p>
        While the dark web online market places, including Torilauta, emphasize user anonymity,
the posts submitted to such sites can nevertheless contain personal information, such as unique
usernames and personal names, enabling data subject re-identification. Therefore, as described
in this paper, we have made our best efort to assess and identify the type and amount of
personal information in the original unmodified data set, to assess and implement viable data
anonymization/reduction approaches, to assess privacy and security measures implemented by
the Language Bank of Finland, and to put in place a future corpus management plan coordinated
by the Language Bank of Finland. Moreover, any future research based on the corpus is
encouraged to implement appropriate ethical proofreading measures (see e.g. [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]) in order to
further mitigate any potential harm from access to the material, to both the researchers and the
studied populations.
      </p>
      <p>The rest of the paper is organized as follows. We first discuss previous studies on Torilauta
and existing related corpora in Section 2. We then provide an overview of the data set in
Section 3. In Section 4, we present our experimental work, the purpose of which was to assess
the amount of personal and sensitive data in the corpus. Subsequently, given the experimental
results, we discuss in detail our data release plan in Section 5. Finally, we provide conclusions
on the work in Section 6.</p>
    </sec>
    <sec id="sec-3">
      <title>2. Related Work</title>
      <p>In this section, we discuss previously published studies using the Torilauta site as a data source
and related corpora in Sections 2.1 and 2.2, respectively.</p>
      <sec id="sec-3-1">
        <title>2.1. Previous Studies on Torilauta</title>
        <p>
          Prior to the work presented here, Torilauta was utilized as a data source in multiple
linguistic and social science studies [
          <xref ref-type="bibr" rid="ref15 ref16 ref17 ref18 ref21">15, 18, 16, 17, 21</xref>
          ]. In particular, Haasio et al. [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] examined
information needs of drug users using a sample of 9,300 posts.3 Harviainen et al. [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] studied
cultural and socioeconomic aspects of drug traders using the same 9,300 post sample.
Hämäläinen and Ruokolainen [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] studied narcotic substance vocabulary based on a sample of 3,000
posts. Hämäläinen et al. [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] studied a sample of 1,654 usernames extracted from posts submitted
to the site. Karjalainen et al. [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ] examined the availability of illegal narcotics during the first
wave of the COVID-19 pandemic using a sample of 535 posts.
1Consortium website: https://research.tuni.fi/enncode/
2Permanent link to the corpus: http://urn.fi/urn:nbn:fi:lb-2022062221
3In their paper, Haasio et al. [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] refer to Torilauta using its other commonly used name Sipulitori.
        </p>
        <p>It is notable that none of the previous studies attempted to share their data sets with the
research community in a systematic manner. This practice negatively impacts the replication
and verification of the published studies and potentially discourages further research on the
topic. On the other hand, given the sensitive and potentially incriminating nature of the data,
not releasing the data is an understandable approach since preparing and managing such a
resource gives rise to multiple technical, ethical, and legal challenges. The purpose of this paper
is to describe and discuss these challenges and how we approached them.</p>
      </sec>
      <sec id="sec-3-2">
        <title>2.2. Related Corpora</title>
        <p>
          To the best of our knowledge, there exist relatively few published dark web corpora or text
data sets. Three notable exceptions include the Dark Net Market archives (2013–2015) [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ],
a collection covering 89 dark net markets and over 37 related forums (1.6TB uncompressed)
scraped during 2013-2015, DUTA [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], a set of 7,000 text samples formed by sampling the Tor
network for two months, and CoDa [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ], a set of 10,000 web documents tailored towards
textbased dark web analysis. All three corpora comprise primarily English texts and are either
publicly downloadable (Dark Net Market archives) or available to researchers upon request
(DUTA, CoDa).
        </p>
        <p>
          Existing Finnish web forum corpora include texts collected from the Ylilauta imageboard [35]
and Suomi24 social networking site [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. While emphasizing user anonymity, both Ylilauta and
Suomi24 forums operate on the clear web and strictly forbid illegal content. Both corpora are
available for research purposes via the Language Bank of Finland under a Creative Commons
(CC BY-NC 4.0) license.
        </p>
        <p>
          The Finnish Internet Parsebank [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ] is a large-scale syntactically analyzed text collection
created using plain text webpage data made available by the Common-Crawl2 Internet crawl
project. Due to the employed web crawling approach to data collection, the corpus is likely to
contain web forum content.
        </p>
        <p>
          Finally, in a recent study, Leedham et al. [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ] discussed their work on archiving a
hard-toaccess WiSP corpus consisting of texts written by social work professionals describing their
work practices. Due to the potentially sensitive nature of the texts, Leedham et al. [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ] created
two versions of the corpus: one for the research project and an anonymized/reduced version for
archiving. In a similar vein, our work presented here aims to provide an extensive discussion
on the process of preparing a corpus of potentially sensitive texts for archiving and sharing.
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>3. Data</title>
      <p>In this section, we describe the original, unmodified corpus received by the ENNCODE
consortium. We provide an overview of the data as a whole and discuss in more detail the contained
data fields, post lifespans, and thread and message lengths in Sections 3.1, 3.2, 3.3, and 3.4,
respectively.</p>
      <sec id="sec-4-1">
        <title>3.1. Overview</title>
        <p>The original data set received by the consortium included all posts submitted to Torilauta
between 2019-09-11 and 2020-05-20 (1,863,639 posts in 251 days) and 2020-06-17 and 2020-10-31
(1,099,710 posts in 136 days). In addition to the posts collected during these active collection
periods, the data contained “residue” posts submitted between 2017-11-02 and 2019-09-11
(141,627 posts in 678 days). Meanwhile, posts submitted between 2020-05-20 and 2020-06-17
were missing completely. Therefore, the original unmodified corpus consisted of 3,104,976 posts
in total.</p>
        <p>Table 1 presents an overview of the post and thread frequencies in the data grouped by
boards along with brief topic descriptions. Of the 32 boards, the board with the highest activity
measured by the total number of submitted posts and threads was the market board dedicated
to narcotics transactions within the city of Helsinki (/hki). The total number of posts submitted
to this board was 787,459 corresponding to 25.4% of all posts in the data. Meanwhile, in total
96.5% (2,997,624) of all posts were submitted to the 16 boards dedicated to transactions (denoted
with (market) in the topic column of Table 1).</p>
      </sec>
      <sec id="sec-4-2">
        <title>3.2. Data Fields</title>
        <p>Each post in the corpus is represented as a data structure with 8 fields as shown in Table 2.
Each field belongs to one of the following three types: a string, an integer, or a date. All dates
are in timezone UTC+0 (GMT+0). Note that throughout this paper, we refer to a set of these
8 data fields as post, whereas the content of the text data field within a single post is referred
to as message or text body. Missing values have diferent meanings depending on the field.
For deletions, a missing value means the post was never deleted. The first post of a thread,
referred to as the original post (OP), always has a missing postId value and is instead identified
by its (boardUri, threadId) pair. The subject field is missing for 46.2% of all posts since it was
common practice to omit the subject. Similarly, the poster name field is missing for 54.1% of all
posts which is in line with the anonymous nature of the site and since any optional contact
information, such as an instant messenger username, was often included in the message text
body instead.</p>
        <p>Finally, the posts submitted to Torilauta optionally contained an attached image. However,
no images were included in the original data set. Moreover, the data fields comprising a post
did not include information on whether the post contained an image or not.</p>
      </sec>
      <sec id="sec-4-3">
        <title>3.3. Post Lifespans</title>
        <p>Submitted posts were deleted from the site for three main reasons. First, the site hosted a fixed
number of threads on each board at a given time and so inactive threads were regularly removed
by an automatic pruning mechanism to make room for new, active threads. Second, posts which
violated the site rules (e.g. spam) were removed by the site administration. Third, the site
interface did not provide users with means to edit messages and, therefore, the only way to
correct erroneous message content (e.g. typos, updates) was to delete the post and resubmit.
Lastly, a small portion of posts were ”pinned” by the site administration, that is, they were
meant to stay available on the site indefinitely.
hki
tre
vnt
oulu
muut
tku
esp
kpo
jkl
lti
bulk
vsa
t
roi
seka
hm
b
h
y
pm
a
hox
hax
kkk
meta
test
spam
rotta
tt
k
fap
pgp
total</p>
        <p>City of Helsinki (market)
City of Tampere (market)
City of Vantaa (market)
City of Oulu (market)
Other areas (market)
City of Turku (market)
City of Espoo (market)
City of Kuopio (market)
City of Jyväskylä (market)
City of Lahti (market)
Bulk transactions (market)
City of Vaasa (market)
Dates
City of Rovaniemi (market)
Miscellaneous (market)
Narcotics markets
Random
Narcotics
Jobs
Mail orders (market)
Everyday/mundane
Hormones (market)
Hacking
Cultivation
Meta discussion
Testing
Spamming
Vendor feedback
Health
Getting and staying sober
Porn
PGP public keys</p>
        <p># posts posts (%) # threads threads (%)
3,104,976</p>
        <p>There are two known caveats related to the deletion timestamps. First, while the data set
included the creation and deletion times of posts, it unfortunately did not include information
about the reason for the deletion. Second, all posts deleted during the pause in collection
2020-05-20 - 2020-06-17 had their deletion value marked as missing and, therefore, appeared
as if they were not deleted. The amount of these potentially erroneous missing values was,
however, relatively small and 97.22% of all posts (3,104,976) in the data had reliable deletion
time information.</p>
        <p>Finally, we estimated the median lifespans of submissions to the market and non-market
boards to be 23 and 238 hours, respectively. The diference was mainly due to the lower posting
frequency and consequently lower thread pruning frequency of the non-market boards. For
noise filtering purposes, we are mostly interested in the messages with short lifespans. To this
end, we note that 5% of all messages had a lifespan of less than 32 minutes. Since posts with
such short lifespans were likely removed by the user and resubmitted after minor modifications,
they may be discarded as noise.4</p>
      </sec>
      <sec id="sec-4-4">
        <title>3.4. Thread and Message Lengths</title>
        <p>By and large, the threads in the market boards were rather short as the majority (57.9%) consisted
of a single post (the thread has no reply and/or follow-up messages to the original post) and
99% of the threads have 23 posts or less. Similarly, on the non-market boards, 64.18% of the
threads consisted of a single post and 99% of the threads have 35 posts or less.</p>
        <p>In order to examine individual message lengths, we tokenized the message bodies using
the tokenizer included in the Finnish Tagtools (v1.5) software developed at the University of
Helsinki.5 The tokenizer splits running text into sentences and extracts punctuation and special
characters from word bodies into separate tokens. The median length of original posts and
reply/follow-up posts to the market boards were 36 and 6 word tokens, respectively. In other
words, the original posts were typically short (a few sentences) and reply/follow-up posts even
shorter (a few words). As for the reply/follow-up posts, this was because of the prevalence of
short update posts, the purpose of which was to keep the thread alive and close to the front
page. These types of posts comprised roughly one fith of all market board reply/follow-up
posts. Therefore, one can reduce the amount of noise in the FINDarC considerably by ignoring
reply/follow-up posts with, e.g. less than 5 word tokens (or less than 20 characters) in the
message body.6</p>
        <sec id="sec-4-4-1">
          <title>4This is in agreement with the recommendation of the site administrator. 5Available at http://urn.fi/urn:nbn:fi:lb-2021042102 6This is in agreement with the recommendation of the site administrator.</title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4. Assessing the Frequency and Nature of Personal Data</title>
      <p>In this section, we present our experimental work, the purpose of which was to assess the
amount and types of personal and sensitive data in the FINDarC corpus. As the size of the corpus
exceeded over 3,000,000 messages, it was deemed infeasible to curate all the posts manually
given the consortium resources. In Section 4.2, we discuss experiments utilizing full-text search
using hand-crafted regular expressions. Finally, the implications of the results on data reduction
and data release are discussed in Section 5.</p>
      <sec id="sec-5-1">
        <title>4.1. Manual Annotation</title>
        <p>In this part of the experiments, we examined the types of personal data contained in the message
bodies of posts using a manual annotation approach. In particular, we were interested in types
of personal information found in the market board submissions which contained the majority
of posts in the data set (97%) and represent the primary function of the site as a narcotics
marketplace.
4.1.1. Definition of Personal Data
We adopted the Article 4(1) of the GDPR7 which defines personal data as
any information relating to an identified or identifiable natural person (data subject);
an identifiable natural person is one who can be identified, directly or indirectly, in
particular by reference to an identifier such as a name, an identification number,
location data, an online identifier or to one or more factors specific to the physical,
physiological, genetic, mental, economic, cultural or social identity of that natural
person;
and divided potential identifiers into the following 5 classes
1. person_name (e.g. first name, family name)
2. id_number (e.g. the Finnish social security number)
3. location (e.g. street address, city, city district)
4. online_id (e.g. instant messaging username, email address, IP address)
5. other (e.g. phone number, identifying physical appearance)
Moreover, GDPR Article 9 (1)8 further defines the special categories of personal data as
Processing of personal data revealing racial or ethnic origin, political opinions,
religious or philosophical beliefs, or trade union membership, and the processing
of genetic data, biometric data for the purpose of uniquely identifying a natural
person, data concerning health or data concerning a natural person’s sex life or
sexual orientation shall be prohibited.</p>
        <p>
          For cases belonging to this group, we added a sixth potential identifier class
7 https://gdpr-info.eu/art-4-gdpr/
8 https://gdpr-info.eu/art-9-gdpr/
6. special_category
Given the personal data class specifications, the annotation task then comprised two subtasks:
entity span detection and class assignment on a word token level to continuous, non-overlapping
sequences of tokens.9 For example, consider
priimaa kukkaa [Helsingin keskustassa]location ! yhteydenotot Wickerillä W //
[example-name]online_id
excellent bud [in the centre of Helsinki]location ! contact Wicker W //
[examplename]online_id
4.1.2. Personal Data versus Named Entities
The definition of personal data and the personal data classes described in the previous section
are closely related to named entities [
          <xref ref-type="bibr" rid="ref14 ref30">14, 30</xref>
          ], in particular, person and location names. However,
while there is an overlap between the personal data classes and the named entities, they are not
equivalent nor is one a subset of the other. This is because of the following two main reasons.
First, a person/location name mentioned in running text is almost invariably identified as a
named entity independent of the context. For example, consider the sentence fragment
        </p>
        <sec id="sec-5-1-1">
          <title>Sauli Niinistö on Suomen presidentti.</title>
          <p>
            Sauli Niinistö is the President of Finland.
where Sauli Niinistö and Suomen (= Finland) are defined as person and location named entities,
respectively, according to the Finnish NER guidelines [
            <xref ref-type="bibr" rid="ref25 ref29">29, 25</xref>
            ]. In contrast, neither of these
should be interpreted as personal data according to the GDPR as stating the name and occupation
of a well-known public figure does not breach their privacy. Second, not all pieces of personal
name or location information found in a message will be classified as named entities. This is
because the definition of personal data takes into account a wider context than what is directly
available in the message. In other words, any piece of information found in text relating to an
individual can be considered personal data if it is reasonable to assume that the information in
question could be combined with other information to identify that individual. For example,
consider the modified version of a post in corpus presented in Table 3. From the information
contained in this exemplary post, one can infer that
1. at 12:12 (UTC) on January 1st 2019 (according to the post creation timestamp)
2. a person using the Wicker username example-name (according to the contact information
included in the message)
3. was selling illegal narcotics (according to subject and message) in Helsinki area (according
to the post boardUri) and
4. more specifically in the Eastern part of Helsinki and
5. with a high likelihood near the metro track
In consequence, we would regard the mentions of East (idässä) and metro (metro) in the message
body as personal location data. In contrast, these mentions would not be considered named
location entities according to the Finnish NER annotation guidelines.
9The tokenization of messages was acquired using the Finnish Tagtools (ver. 1.5) toolkit.
4.1.3. Manual NER annotation
The results of a preliminary manual annotation showed, rather unsurprisingly, that the majority
of personal information cases consisted of instant messenger usernames and areas posted as
potential settings for face-to-face transactions. On the other hand, the preliminary experiment
also revealed that even native Finnish speakers can struggle with the text domain and that
manually curating large parts of the corpus would be an excessively tedious efort. On the other
hand, this means that automatic text processing methods are also likely to struggle with the
text domain.
4.1.4. Automated NER annotation
Text anonymization/reduction approaches proposed in literature commonly utilize automatic
NER as a part of the processing pipelines to varying extents [
            <xref ref-type="bibr" rid="ref1 ref11 ref12 ref13 ref27 ref32 ref7 ref8">8, 32, 1, 27, 7, 11, 13, 12</xref>
            ]. Ideally,
NER tools would also be useful when processing FINDarC as they could, in principle, be
employed to automatically detect some direct personal identifiers, such as names and addresses.
However, we found this approach problematic to implement as the number of posts marked
with person and location entities was hundreds of thousands. While substantially smaller than
the original data set of over 3,000,000 posts, this set was still too large to be curated manually
given the available consortium resources. Examining the predictions on the manually annotated
data set, suggested that the available tools sufered from a domain mismatch in addition to the
inherent mismatch between personal data and named entity classes. This was not a completely
surprising outcome since the text domain could also cause problems for human annotators.
Because the tools tended to miss entities of interest (low recall) but also be incorrect when
detecting entities (low precision), we did not consider them eficient curating tools for FINDarC
in their current states and instead continued to the full-text search experiments presented in
Section 4.2.
          </p>
        </sec>
      </sec>
      <sec id="sec-5-2">
        <title>4.2. Full-Text Search</title>
        <p>In this section, we assess the amount and types of personal data in the data set using a full-text
search approach. In particular, we are interested in finding common personal identifiers with
relatively rigid formats, such as social security numbers and phone numbers. We describe
the experiment setup in Section 4.2.1, the applied textual patterns (regular expressions) in
Section 4.2.2, and the obtained results in Section 4.2.3.
4.2.1. Setup
We define a target set of textual patterns (regular expressions), search for matches in message
bodies. Specifically, we are interested in finding expressions matching
1. (Finnish) social security numbers
2. (Finnish) phone numbers
3. Email addresses
4. IBAN bank accounts
5. IP addresses
all of which have relatively rigid formats. The employed regular expressions are presented in
Section 4.2.2. We apply the search to all posts in the data and assign the matches manually
to personal data and non-personal data according to post context. The pattern matching is
performed using MongoDB (v.5.0) full text search.10 In contrast to the manual annotation
experiment discussed in Section 4.1, we do not filter out noise from the data and instead apply
the search to all 3,104,976 posts in the original corpus.
4.2.2. Regular Expressions
In what follows, we provide brief descriptions of the applied regular expressions.
Social security numbers The Finnish social security number (SSN) is a sequence of 11
characters assigned to individuals by the Finnish government based on their date of birth and
gender. The first 10 characters of the sequence are 6 numbers (date of birth) followed by a
hyphen or A, followed by 3 numbers. The last character is alphanumeric, i.e., a number or
a letter. Valid sequences likely have, therefore, format ”121212-1234” and ”121212-123A”. We
detect the sequences using the regular expression \ d \ d \ d \ d \ d \ d \ - \ d \ d \ d [ a - z A - Z 0 - 9 ] . Persons
born in the 2000s, who would have an ”A” instead of hyphen, were not found in the sample.
Phone numbers According to the specification of the Finnish telephone network numbering,
Finnish mobile phone numbers begin with a routing number (04-, 050, or 059) and are followed
by a subscriber number, such as, ”040 1234567”, ”059 1234567”, and so forth.11 The first zero
(”0”) of the number can optionally be replaced by the country code of Finland +358 (e.g. ”+358
10https://www.mongodb.com/
11Specification of numbers in the Finnish phone network is available at:
https://www.finlex.fi/fi/viranomaiset/normi/480001/47180
40 1234123”, ”+358 59 4321432”, etc.). Based on a preliminary examination of the data set, we
detect common phone number formats using two regular expressions: [ \ + ] ? 3 5 8 [ \ - \ s ] ? 0 [ 4 5 ] [ \
\ s ] ? \ d \ d \ d [ \ - \ s ] ? \ d [ \ - \ s ] ? \ d \ d \ d which matches numbers starting with the country code
and 0 [ 4 5 ] \ d [ \ s \ - ] ? \ d \ d \ d [ \ - \ s ] ? \ d [ \ - \ s ] ? \ d \ d \ d which detects numbers with the country
code omitted. Moreover, the expressions detect most commonly used grouping patterns using
hyphens (e.g. ”+358-40-12345-567”) and whitespaces (e.g. ”059 123 4567”). While the subscriber
part of the number can, in principle, vary in length, the patterns match the most common
length of 7 digits. Landline numbers would be shorter but follow the same principles; none
were however found in the data.</p>
        <p>Email addresses According to the RFC 5322 standard12, an email address as an identifier
which contains a locally interpreted string followed by the at-character (”@”) followed by an
internet domain, such as ”name@domain.com”, ”firstname.surname@subdomain.domain.com”,
and ”underscore-hyphen-plus+sign@domain.com”. We detect the addresses using a regular
expression \ S + \ @ \ S + \ . \ S + which successfully detects all the above examples from a running
text.</p>
        <p>IBAN bank accounts We search for bank account numbers matching the International
Bank Account Number (IBAN) structure specified by the ISO 13616-1:2020 standard 13. The
IBAN formatted numbers consist of the Finnish bank account number (14 digits) preceded by
a two letter country code (”FI” for Finland) and two check digits (e.g. ”FI72 1234 5678 1234
12”). We detect the pattern using the regular expression [ F f ] [ I i ] \ d \ d [ \ s \ - ] ? \ d \ d \ d \ d [ \ s \
] ? \ d \ d \ d \ d [ \ s \ - ] ? \ d \ d \ d \ d [ \ s \ - ] ? \ d \ d which takes into consideration the letter case of the
country code and the commonly used grouping whitespaces.</p>
        <p>
          IP addresses IP (internet protocol) addresses are unique addresses which identity devices
on the internet and local networks. We search for IP addresses using the following regular
expression ( 2 5 [
          <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4 ref5">0 - 5</xref>
          ] \ 2[
          <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4">0-4</xref>
          ][
          <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4 ref5 ref6 ref7 ref8 ref9">0-9</xref>
          ]|[01]?[
          <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4 ref5 ref6 ref7 ref8 ref9">0-9</xref>
          ][
          <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4 ref5 ref6 ref7 ref8 ref9">0-9</xref>
          ]?))̇3(25[
          <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4 ref5">0-5</xref>
          ]|2[
          <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4">0-4</xref>
          ][
          <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4 ref5 ref6 ref7 ref8 ref9">0-9</xref>
          ]|[01]?[
          <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4 ref5 ref6 ref7 ref8 ref9">0-9</xref>
          ][
          <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4 ref5 ref6 ref7 ref8 ref9">0-9</xref>
          ]?| which
matches patterns such as 88.777.66.555 and so forth.
4.2.3. Results
The frequencies of matched social security numbers, phone numbers, email addresses, bank
account numbers, and IP addresses are presented in Table 4. As shown, the most and least
frequent matched types were email addresses and bank account numbers with 1,840 and 12
regular expression matches, respectively. Due to the suficiently low number of original matches,
we were able to perform manual verification of all the cases, presented also in Table 4. According
to this inspection, the phone numbers and email addresses occurred in two contexts. First,
similarly to the instant messaging usernames , 491 out of 858 and 1,622 out of 1,837 of the
phone numbers and email addresses, respectively, were posted as contact information by
individuals themselves. The remaining cases were posted as a means of targeting people. In
12The RFC 5322 specification is available at: https://datatracker.ietf.org/doc/html/rfc5322
13//www.iso.org/standard/81090.html
such cases, personal details (e.g., name, relationship information, area of residence) were shared
in connection with one or more usernames, in order to paint the person as a potential target for
violence. Bank account numbers occurred similarly in two contexts. Out of the 16 IP addresses,
10 cases were included as a means of targeting, while the remaining 6 were provided as a type
of contact information. Finally, all 73 and 12 found cases of social security numbers and bank
account numbers were posted with a purpose of targeting. Thus, we identified in total 667 cases
of targeting in 295 posts using this method.
        </p>
        <p>
          Finally, we created a second regular expression list using words and prefixes related to the
personal information contained in the identified 295 targeting posts. This list consisted of 77
keywords and parts of person names and addresses.14 After performing a second search with
these patterns and a subsequent manual inspection, we identified an additional set of 166 posts
submitted as a means of targeting.
4.2.4. Discussion
Similarly to automatic NER, rule-based search for text patterns is a commonly used part of text
anonymization pipelines to varying extents [
          <xref ref-type="bibr" rid="ref1 ref11 ref12 ref13 ref27 ref32 ref7 ref8">8, 32, 1, 27, 7, 11, 13, 12</xref>
          ]. For multiple cases, such
as email addresses and social security numbers, it is rather straightforward to write the patterns
as regular expressions. Moreover, one can eficiently find pattern matches from large data
sets using suitable databases, such as the MongoDB, or search engines. However, the caveat
of examining the results is, of course, that manually verifying the matches only provides us
with an estimate of the precision of the search while neglecting recall. In other words, one can
not estimate the number of cases missed by the search in the whole corpus without manually
curating thousands or, preferably, tens of thousands, of posts. Unfortunately, an annotation
efort of this scale was not feasible given the available resources of the consortium. Nevertheless,
using the search combined with manual examination of the search results, we were able to
uncover a set of 461 posts submitted with a purpose of targeting individuals. While extremely
rare compared to the total number of posts in the corpus, these posts are particularly interesting
14We do not present the list here due to obvious privacy issues.
from the point of view of data reduction and will be discussed further in Section 5.
        </p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>5. Data Release</title>
      <p>In this section, we draw on the discussion and experimental results presented in the previous
sections and present an outline of the FINDarC data release.</p>
      <p>
        Data Reduction Conventionally, the most direct approach to protect data subjects from
re-identification has been to anonymize the data by removing/obscuring the parts containing
personal information. [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ] However, it appears evident that, if implemented successfully, this
type of processing would have a profound impact on the usefulness of FINDarC for research
purposes. For example, subsequent to removing usernames from their post contexts or from
the data altogether, one would not be able to replicate the study of Hämäläinen et al. [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] who
examined how sellers and buyers of illegal drugs represent themselves in their usernames. In
turn, subsequent to removing location and/or timestamp data, one would no longer be able to
replicate the study of Karjalainen et al. [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] who studied the availability of drugs specifically in
the city of Tampere during the COVID-19 epidemic in the spring of 2020. From a utility point
of view, therefore, it could be argued that reducing personal information from the buy/sell post
threads would quickly degrade, or destroy, the usefulness of the corpus as a data source for
research. This problem is generally referred to as the privacy-utility trade-of within the data
privacy literature [
        <xref ref-type="bibr" rid="ref24 ref3">24, 3</xref>
        ].
      </p>
      <p>Due to the problematic privacy-utility trade-of, we posit here that reducing the FINDarC
extensively would not be appropriate even if suficient resources could be allocated for
domainspecific tool development and manual labour. Furthermore, we note that Torilauta and other
drug trading sites have also been under observation by other parties, including both criminals
and law enforcement agencies. Therefore, it is our assessment that leaving the sell/buy posts,
which form the majority of the FINDarC, largely intact poses few additional risks to the studied
populations. However, as discussed in 4.2, in addition to the sell/buy posts, the data also contains
posts with the intention of doxxing/targeting individuals. Here, our position is that removing
these submissions is warranted from an ethical point of view while not decreasing the value of
the corpus as a data source significantly. This is because these posts are not directly related to
the main functionality of the site as an online marketplace. Accordingly, we removed from the
corpus all 461 posts containing identified doxxing/targeting information described in Section 4.
The reduced corpus, therefore, comprises 3,104,515 posts. A summary of the removed posts by
board is presented in Table 5.</p>
      <p>Finally, as per the Terms of Service of Torilauta, the site users gave consent to data collection
for academic use by using the site. Consequently, site users could opt out of the data collection by
not submitting new posts and/or contacting the site administration about previously submitted
posts. However, it could be argued that by removing a previously submitted post, a user has
withdrawn the permission to use the data. Unfortunately, as discussed in Section 3.2, the original
data set received from the site administration did not include information about the reasons
behind post deletions. Therefore, we were not able to exclude any posts from the corpus based
on the deletion status.</p>
      <p>board</p>
      <p>
        # posts
Access Restrictions Due to the limited applicability of data reduction as a means of protecting
data subjects from reidentification, we next discuss restricting the access to the corpus. In
general, the Language Bank recommends sharing resources using standard CC-BY or other
open source licenses, in which case only the metadata of the resource needs to be registered
with the Language Bank, although the Language Bank also hosts openly available and publicly
accessible resources (PUB) in its Language Bank Download service. However, since the FINDarC
resource in its current form contains personal data, both copyright and personal data legislations
apply and the corpus cannot be published with open access. Instead, FINDarC is provided via
protected access under the CLARIN RES licence which means that permission to download and
use the corpus is only granted to researchers based on written applications reviewed by the
data controller (principal investigator of the ENNCODE consortium) including a data protection
impact assessment. The purpose of this limitation is to ensure that the material is accessed
only by verified researchers for legimitate research purposes. It also lessens sharing-related
risks to both the researchers and the subjects of study, as mandated by the consortium’s data
management policy. Restricting access to the corpus as described here is in line with the current
literature on data sharing [
        <xref ref-type="bibr" rid="ref10 ref26 ref28 ref31 ref9">26, 28, 31, 10, 9</xref>
        ] which also acknowledges the limitations of data
anonymization/reduction and encourages the use of user group limitations.
Corpus Version Control Resources deposited in the Language Bank may have several
diferent variants (i.e. versions) which form a resource group. Typically, a resource group consists of
diferent annotations (raw data or preprocessed data for a single corpus), accumulated data (the
content is almost identical but one version has more or newer content), or repaired data (flaws
or necessary modifications, e.g. justified requests to remove or stop processing data, which
have been identified and fixed manually or automatically). As for FINDarC, we emphasize the
importance of the third point and note that if the Language Bank receives a notification and/or
request for removal of content on grounds of sensitive data from a user of the corpus, these
requests will be reviewed and acted upon. In particular, the Language Bank may update the
corpus by reducing the data further and only store and share the most recent version.
      </p>
    </sec>
    <sec id="sec-7">
      <title>6. Conclusions</title>
      <p>We discussed the archiving procedure of FINDarC, a Finnish dark web marketplace corpus, in
the Language Bank of Finland. The discussion included an overview of the data, assessment of
the risk and impact of data subject re-identification, assessment and implementation of viable
data reduction approaches using manual and automatic text processing, assessment of privacy
and security measures implemented by the Language Bank of Finland, and a future corpus
management plan implemented and coordinated by the Language Bank of Finland. As a result
of the presented work, a reduced version of the corpus has been archived in the Language Bank
of Finland. Researchers can apply for access to the corpus under the CLARIN RES licence.</p>
      <p>As this article shows, the data set was cleared using best practices for ethical proofreading,
which consistently sought to prioritize the protection of the posters on the forum. Given how
even indirect identifiers could be utilized against the site’s users by either law enforcement
or other members of the drug-using community, it was necessary to opt for maximal removal
eficiency whenever possible. Nevertheless, the amount of data is so high that no clearing can
be considered ethically suficient for the purpose of releasing the data openly, so we opted for a
gatekeeping approach in addition to clearing everything that could be found.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments References</title>
      <p>LVI.
[33] Virtanen, A., J. Kanerva, R. Ilo, J. Luoma, J. Luotolahti, T. Salakoski, F. Ginter, and S. Pyysalo.</p>
      <p>2019. Multilingual is not enough: Bert for Finnish. arXiv preprint arXiv:1912.07076.
[34] Weischedel, R., M. Palmer, M. Marcus, E. Hovy, S. Pradhan, L. Ramshaw, N. Xue, A. Taylor,
J. Kaufman, M. Franchini, and Others. 2013. Ontonotes release 5.0 ldc2013t19. Linguistic
Data Consortium, Philadelphia, PA 23.
[35] Ylilauta. 2016. Ylilauta-korpuksen ladattava versio [tekstikorpus]. Kielipankki. Available
at http://urn.fi/urn:nbn:fi:lb-2016101210.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Adams</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Aili</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Aioanei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Jonsson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Mickelsson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Mikmekova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Roberts</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.F.</given-names>
            <surname>Valencia</surname>
          </string-name>
          , and
          <string-name>
            <surname>R. Wechsler</surname>
          </string-name>
          <year>2019</year>
          .
          <article-title>AnonyMate: A toolkit for anonymizing unstructured chat data</article-title>
          .
          <source>In Proceedings of the Workshop on NLP and Pseudonymisation</source>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>7</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Al</given-names>
            <surname>Nabki</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.W.</surname>
          </string-name>
          , E. Fidalgo, E. Alegre,
          <string-name>
            <surname>and I. De Paz</surname>
          </string-name>
          <year>2017</year>
          .
          <article-title>Classifying illegal activities on tor network based on web textual contents</article-title>
          .
          <source>In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume</source>
          <volume>1</volume>
          ,
          <string-name>
            <given-names>Long</given-names>
            <surname>Papers</surname>
          </string-name>
          , pp.
          <fpage>35</fpage>
          -
          <lpage>43</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Alvim</surname>
            ,
            <given-names>M.S.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>M.E.</given-names>
            <surname>Andrés</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Chatzikokolakis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Degano</surname>
          </string-name>
          , and
          <string-name>
            <surname>C. Palamidessi</surname>
          </string-name>
          <year>2011</year>
          .
          <article-title>Diferential privacy: on the trade-of between utility and information leakage</article-title>
          .
          <source>In International Workshop on Formal Aspects in Security and Trust</source>
          , pp.
          <fpage>39</fpage>
          -
          <lpage>54</lpage>
          . Springer.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Artstein</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <year>2017</year>
          .
          <article-title>Inter-annotator Agreement</article-title>
          .
          <source>Handbook of Linguistic Annotation</source>
          , pp.
          <fpage>297</fpage>
          -
          <lpage>313</lpage>
          . Springer.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Branwen</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Christin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Décary-Hétu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.M.</given-names>
            <surname>Andersen</surname>
          </string-name>
          , StExo, E. Presidente, Anonymous,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.K.</given-names>
            <surname>Sohhlz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Cakic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Buskirk</surname>
          </string-name>
          , Whom,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>McKenna, and</article-title>
          <string-name>
            <given-names>S.</given-names>
            <surname>Goode</surname>
          </string-name>
          .
          <year>2015</year>
          ,
          <article-title>July</article-title>
          .
          <source>Dark net market archives</source>
          ,
          <year>2011</year>
          -
          <fpage>2015</fpage>
          . https://www.gwern.net/DNM-archives.
          <source>Accessed: 2022-06-28.</source>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>City</given-names>
            <surname>Digital Group</surname>
          </string-name>
          .
          <year>2021</year>
          .
          <article-title>Suomi24 virkkeet -korpus 2001-2020, Korp-versio [tekstikorpus]</article-title>
          .
          <source>Kielipankki</source>
          . Available at http://urn.fi/urn:nbn:fi:
          <fpage>lb</fpage>
          -
          <lpage>2021101525</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Csányi</surname>
            ,
            <given-names>G.M.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Nagy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Vági</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.P.</given-names>
            <surname>Vadász</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Orosz</surname>
          </string-name>
          .
          <year>2021</year>
          .
          <article-title>Challenges and Open Problems of Legal Document Anonymization</article-title>
          .
          <source>Symmetry</source>
          <volume>13</volume>
          (
          <issue>8</issue>
          ):
          <fpage>1490</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Di</given-names>
            <surname>Cerbo</surname>
          </string-name>
          , F. and
          <string-name>
            <surname>S. Trabelsi</surname>
          </string-name>
          <year>2018</year>
          .
          <article-title>Towards personal data identification and anonymization using machine learning techniques</article-title>
          .
          <source>In European Conference on Advances in Databases and Information Systems</source>
          , pp.
          <fpage>118</fpage>
          -
          <lpage>126</lpage>
          . Springer.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Elliot</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Mackey</surname>
          </string-name>
          , and
          <string-name>
            <surname>K. O'Hara</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>The anonymisation decision-making framework 2nd Edition: European practitioners' guide.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Elliot</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>K. O'hara</surname>
          </string-name>
          , C. Raab,
          <string-name>
            <surname>C.M. O'Keefe</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Mackey</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Dibben</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Gowans</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Purdam</surname>
            , and
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>McCullagh</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Functional anonymisation: Personal data and the data environment</article-title>
          .
          <source>Computer Law &amp; Security Review</source>
          <volume>34</volume>
          (
          <issue>2</issue>
          ):
          <fpage>204</fpage>
          -
          <lpage>221</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Francopoulo</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          and
          <string-name>
            <surname>L.P. Schaub</surname>
          </string-name>
          <year>2020</year>
          .
          <article-title>Anonymization for the GDPR in the Context of Citizen and Customer Relationship Management and NLP</article-title>
          .
          <source>In workshop on Legal and Ethical Issues (Legal2020)</source>
          , pp.
          <fpage>9</fpage>
          -
          <lpage>14</lpage>
          . ELRA.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Garat</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          and
          <string-name>
            <given-names>D.</given-names>
            <surname>Wonsever</surname>
          </string-name>
          .
          <year>2022</year>
          ,
          <article-title>jan</article-title>
          .
          <source>Automatic Curation of Court Documents: Anonymizing Personal Data. Information</source>
          <year>2022</year>
          , Vol.
          <volume>13</volume>
          , Page 27
          <volume>13</volume>
          (
          <issue>1</issue>
          ):
          <fpage>27</fpage>
          . https://doi.org/10.3390/ INFO13010027.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Glaser</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Schamberger</surname>
          </string-name>
          , and
          <string-name>
            <given-names>F.</given-names>
            <surname>Matthes</surname>
          </string-name>
          .
          <year>2021</year>
          ,
          <article-title>jun. Anonymization of German legal court rulings</article-title>
          .
          <source>Proceedings of the 18th International Conference on Artificial Intelligence and Law</source>
          ,
          <string-name>
            <surname>ICAIL</surname>
          </string-name>
          <year>2021</year>
          :
          <fpage>205</fpage>
          -
          <lpage>209</lpage>
          . https://doi.org/10.1145/3462757.3466087.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Grishman</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          and
          <string-name>
            <surname>B.M. Sundheim</surname>
          </string-name>
          <year>1996</year>
          .
          <article-title>Message understanding conference-6: A brief history</article-title>
          .
          <source>In COLING 1996 Volume 1: The 16th International Conference on Computational Linguistics.</source>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Haasio</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>J.T.</given-names>
            <surname>Harviainen</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Savolainen</surname>
          </string-name>
          .
          <year>2020</year>
          ,
          <article-title>mar. Information needs of drug users on a local dark Web marketplace</article-title>
          .
          <source>Information Processing and Management</source>
          <volume>57</volume>
          (
          <issue>2</issue>
          ):
          <fpage>102080</fpage>
          . https://doi.org/10.1016/j.ipm.
          <year>2019</year>
          .
          <volume>102080</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Hämäläinen</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Haasio</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.T.</given-names>
            <surname>Harviainen</surname>
          </string-name>
          .
          <year>2021</year>
          ,
          <article-title>aug. Usernames on a Finnish Online Marketplace for Illegal Drugs</article-title>
          .
          <source>Names A Journal of Onomastics</source>
          <volume>69</volume>
          (
          <issue>3</issue>
          ). https://doi.org/10.5195/ NAMES.
          <year>2021</year>
          .
          <volume>2234</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Hämäläinen</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          and
          <string-name>
            <given-names>T.</given-names>
            <surname>Ruokolainen</surname>
          </string-name>
          .
          <year>2021</year>
          .
          <article-title>Kukkaa, amfea, subua ja essoja: Huumausaineiden slanginimitykset Tor-verkon suomalaisella kauppapaikalla</article-title>
          .
          <source>Sananjalka</source>
          <volume>63</volume>
          :
          <fpage>130</fpage>
          -
          <lpage>153</lpage>
          . https://doi.org/10.30673/sja.106615.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Harviainen</surname>
            ,
            <given-names>J.T.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Haasio</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Hämäläinen</surname>
          </string-name>
          .
          <year>2020</year>
          ,
          <article-title>jan. Drug traders on a local dark web marketplace</article-title>
          . ACM International Conference Proceeding Series:
          <fpage>20</fpage>
          -
          <lpage>26</lpage>
          . https://doi.org/10.1145/ 3377290.3377293.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Harviainen</surname>
            ,
            <given-names>J.T.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Haasio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ruokolainen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Hassan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Siuda</surname>
          </string-name>
          , and J.
          <source>Hamari</source>
          <year>2021</year>
          ,
          <article-title>1. Information protection in dark web drug markets research</article-title>
          .
          <source>Hawaii International Conference on System Sciences.</source>
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Jin</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Jang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shin</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.W.</given-names>
            <surname>Chung</surname>
          </string-name>
          .
          <year>2022</year>
          .
          <article-title>Shedding new light on the language of the dark web</article-title>
          .
          <source>arXiv preprint arXiv:2204</source>
          .06885 (To appear
          <source>in NAACL</source>
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Karjalainen</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Nyrhinen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Gunnar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ylöstalo</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Ståhl</surname>
          </string-name>
          .
          <year>2021</year>
          .
          <article-title>Huumeiden saatavuus, käyttö ja huumausainerikollisuus Tampereella koronakeväänä 2020</article-title>
          .
          <source>Yhteiskuntapolitiikka</source>
          <volume>86</volume>
          (
          <issue>2</issue>
          ):
          <fpage>80</fpage>
          -
          <lpage>90</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <surname>Laippala</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          and
          <string-name>
            <surname>F. Ginter</surname>
          </string-name>
          <year>2014</year>
          .
          <article-title>Syntactic n-gram collection from a large-scale corpus of internet finnish</article-title>
          .
          <source>In Human Language Technologies-The Baltic Perspective: Proceedings of the Sixth International Conference Baltic HLT</source>
          , Volume
          <volume>268</volume>
          , pp.
          <fpage>184</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <surname>Leedham</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lillis</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Twiner</surname>
          </string-name>
          .
          <year>2021</year>
          .
          <article-title>Creating a corpus of sensitive and hard-toaccess texts: Methodological challenges and ethical concerns in the building of the WiSP Corpus</article-title>
          .
          <source>Applied Corpus Linguistics</source>
          <volume>1</volume>
          (
          <issue>3</issue>
          ):
          <fpage>100011</fpage>
          . https://doi.org/https://doi.org/10.1016/j. acorp.
          <year>2021</year>
          .
          <volume>100011</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          and
          <string-name>
            <surname>N. Li</surname>
          </string-name>
          <year>2009</year>
          .
          <article-title>On the tradeof between privacy and utility in data publishing</article-title>
          .
          <source>In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining</source>
          , pp.
          <fpage>517</fpage>
          -
          <lpage>526</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <surname>Luoma</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Oinonen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Pyykönen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Laippala</surname>
          </string-name>
          , and
          <string-name>
            <surname>S. Pyysalo</surname>
          </string-name>
          <year>2020</year>
          .
          <article-title>A Broad-coverage Corpus for Finnish Named Entity Recognition</article-title>
          .
          <source>In Proceedings of the 12th Language Resources and Evaluation Conference</source>
          , Marseille, France, pp.
          <fpage>4615</fpage>
          -
          <lpage>4624</lpage>
          . European Language Resources Association.
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <surname>Ohm</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <year>2009</year>
          .
          <article-title>Broken promises of privacy: Responding to the surprising failure of anonymization</article-title>
          .
          <source>UCLA Law Review</source>
          <volume>57</volume>
          :
          <fpage>1701</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <surname>Oksanen</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Tamper</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tuominen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hietanen</surname>
          </string-name>
          , and
          <string-name>
            <surname>E. Hyvönen</surname>
          </string-name>
          <year>2019</year>
          .
          <article-title>AnoPpi: A pseudonymization service for Finnish court documents</article-title>
          .
          <source>In JURIX 2019</source>
          , pp.
          <fpage>251</fpage>
          -
          <lpage>254</lpage>
          . IOS Press.
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <surname>Rubinstein</surname>
            ,
            <given-names>I.S.</given-names>
          </string-name>
          and
          <string-name>
            <given-names>W.</given-names>
            <surname>Hartzog</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Anonymization and risk</article-title>
          . Wash. L. Rev.
          <volume>91</volume>
          :
          <fpage>703</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <surname>Ruokolainen</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Kauppinen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Silfverberg</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Lindén</surname>
          </string-name>
          .
          <year>2019</year>
          ,
          <article-title>aug. A Finnish news corpus for named entity recognition</article-title>
          .
          <source>Language Resources and Evaluation 2019</source>
          <volume>54:1 54</volume>
          (
          <issue>1</issue>
          ):
          <fpage>247</fpage>
          -
          <lpage>272</lpage>
          . https://doi.org/10.1007/S10579-019-09471-7. arXiv:
          <year>1908</year>
          .04212.
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <surname>Sang</surname>
            ,
            <given-names>E.T.K.</given-names>
          </string-name>
          and
          <string-name>
            <surname>F. De Meulder</surname>
          </string-name>
          <article-title>2003</article-title>
          .
          <article-title>Introduction to the conll-2003 shared task: Languageindependent named entity recognition</article-title>
          .
          <source>In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL</source>
          <year>2003</year>
          , pp.
          <fpage>142</fpage>
          -
          <lpage>147</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <surname>Stalla-Bourdillon</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Knight</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Anonymous data v. personal data-false debate: an EU perspective on anonymization, pseudonymization and personal data</article-title>
          .
          <source>Wis. Int'l LJ</source>
          <volume>34</volume>
          :
          <fpage>284</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <surname>Tamper</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Oksanen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tuominen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Hyvönen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hietanen</surname>
          </string-name>
          , and
          <article-title>Others 2018</article-title>
          .
          <article-title>Anonymization Service for Finnish Case Law: Opening Data without Sacrificing Data Protection and Privacy of Citizens</article-title>
          . In International Conference on Law via the Internet,
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>