Ethically Archiving a Hard-to-Access Massive Research Data Set in the Language Bank of Finland: The Finnish Dark Web Marketplace Corpus (FINDarC) Krister Lindén1,∗,† , Teemu Ruokolainen2,† , Lasse Hämäläinen2,† and J. Tuomas Harviainen2,∗,† 1 Department of Digital Humanities, University of Helsinki, Yliopistokatu 4, 00014 University of Helsinki, Finland 2 Faculty of Information Technology and Communication Sciences, Tampere University, Kalevantie 4, Tampere, 33014, Finland Abstract We discuss the archiving procedure of a corpus comprising posts submitted to Torilauta, a Finnish dark web marketplace website. The site was active from 2017 to 2021 and during this time one of the most prominent online illegal narcotics markets in Finland. As a result of the presented work, a reduced version of the corpus, Finnish Dark Web Marketplace Corpus (FINDarC), has been archived in the Language Bank of Finland. Researchers can apply for access rights to the corpus under the CLARIN RES licence. The discussion presented in this paper addresses the archiving process, including assessment of the risk and impact of data subject re-identification, assessment and implementation of viable data anonymization/reduction approaches, assessment of privacy and security measures implemented by the Language Bank of Finland, and future corpus management plan coordinated by the Language Bank of Finland. Keywords dark web, illegal narcotics, online marketplace, data sharing 1. Introduction We discuss the archiving procedure of a corpus comprising posts submitted to Torilauta, a dark web marketplace website, for research purposes. The site was active from 2017 to 2021 and during this time one of the most prominent online illegal narcotics markets in Finland. Functionally, the site consisted of discussion imageboards where vendors and customers were able to set up instances of face-to-face trading, typically with the assistance of instant messaging software such as Wickr or Telegram. The original, unmodified data set comprising 3,104,976 Conference on Technology Ethics – Tethics, October 18–19, 2023, Turku, Finland ∗ Corresponding author. † These authors contributed equally. Envelope-Open krister.linden@helsinki.fi (K. Lindén); teemu.ruokolainen@tuni.fi (T. Ruokolainen); lasse.hamalainen@tuni.fi (L. Hämäläinen); tuomas.harviainen@tuni.fi (J. T. Harviainen) Orcid 0000-0003-2337-303X (K. Lindén) © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings 114 posts was collected and handed over to the ENNCODE consortium1 by the site administration to be archived and shared for research purposes, as permitted by the site’s Terms of Service. As a result of the presented work, a reduced version of the corpus comprising 3,104,515 posts, referred to as the Finnish Dark Web Marketplace Corpus (FINDarC), has been deposited in the Language Bank of Finland, a language resource service coordinated by the national FIN-CLARIN consortium formed by Finnish universities and other research organizations. Researchers can contact the Language Bank and apply for permission to access the corpus under the CLARIN RES license.2 While the dark web online market places, including Torilauta, emphasize user anonymity, the posts submitted to such sites can nevertheless contain personal information, such as unique usernames and personal names, enabling data subject re-identification. Therefore, as described in this paper, we have made our best effort to assess and identify the type and amount of personal information in the original unmodified data set, to assess and implement viable data anonymization/reduction approaches, to assess privacy and security measures implemented by the Language Bank of Finland, and to put in place a future corpus management plan coordinated by the Language Bank of Finland. Moreover, any future research based on the corpus is encouraged to implement appropriate ethical proofreading measures (see e.g. [19]) in order to further mitigate any potential harm from access to the material, to both the researchers and the studied populations. The rest of the paper is organized as follows. We first discuss previous studies on Torilauta and existing related corpora in Section 2. We then provide an overview of the data set in Section 3. In Section 4, we present our experimental work, the purpose of which was to assess the amount of personal and sensitive data in the corpus. Subsequently, given the experimental results, we discuss in detail our data release plan in Section 5. Finally, we provide conclusions on the work in Section 6. 2. Related Work In this section, we discuss previously published studies using the Torilauta site as a data source and related corpora in Sections 2.1 and 2.2, respectively. 2.1. Previous Studies on Torilauta Prior to the work presented here, Torilauta was utilized as a data source in multiple linguis- tic and social science studies [15, 18, 16, 17, 21]. In particular, Haasio et al. [15] examined information needs of drug users using a sample of 9,300 posts.3 Harviainen et al. [18] studied cultural and socioeconomic aspects of drug traders using the same 9,300 post sample. Hämäläi- nen and Ruokolainen [17] studied narcotic substance vocabulary based on a sample of 3,000 posts. Hämäläinen et al. [16] studied a sample of 1,654 usernames extracted from posts submitted to the site. Karjalainen et al. [21] examined the availability of illegal narcotics during the first wave of the COVID-19 pandemic using a sample of 535 posts. 1 Consortium website: https://research.tuni.fi/enncode/ 2 Permanent link to the corpus: http://urn.fi/urn:nbn:fi:lb-2022062221 3 In their paper, Haasio et al. [15] refer to Torilauta using its other commonly used name Sipulitori. 115 It is notable that none of the previous studies attempted to share their data sets with the research community in a systematic manner. This practice negatively impacts the replication and verification of the published studies and potentially discourages further research on the topic. On the other hand, given the sensitive and potentially incriminating nature of the data, not releasing the data is an understandable approach since preparing and managing such a resource gives rise to multiple technical, ethical, and legal challenges. The purpose of this paper is to describe and discuss these challenges and how we approached them. 2.2. Related Corpora To the best of our knowledge, there exist relatively few published dark web corpora or text data sets. Three notable exceptions include the Dark Net Market archives (2013–2015) [5], a collection covering 89 dark net markets and over 37 related forums (1.6TB uncompressed) scraped during 2013-2015, DUTA [2], a set of 7,000 text samples formed by sampling the Tor network for two months, and CoDa [20], a set of 10,000 web documents tailored towards text- based dark web analysis. All three corpora comprise primarily English texts and are either publicly downloadable (Dark Net Market archives) or available to researchers upon request (DUTA, CoDa). Existing Finnish web forum corpora include texts collected from the Ylilauta imageboard [35] and Suomi24 social networking site [6]. While emphasizing user anonymity, both Ylilauta and Suomi24 forums operate on the clear web and strictly forbid illegal content. Both corpora are available for research purposes via the Language Bank of Finland under a Creative Commons (CC BY-NC 4.0) license. The Finnish Internet Parsebank [22] is a large-scale syntactically analyzed text collection created using plain text webpage data made available by the Common-Crawl2 Internet crawl project. Due to the employed web crawling approach to data collection, the corpus is likely to contain web forum content. Finally, in a recent study, Leedham et al. [23] discussed their work on archiving a hard-to- access WiSP corpus consisting of texts written by social work professionals describing their work practices. Due to the potentially sensitive nature of the texts, Leedham et al. [23] created two versions of the corpus: one for the research project and an anonymized/reduced version for archiving. In a similar vein, our work presented here aims to provide an extensive discussion on the process of preparing a corpus of potentially sensitive texts for archiving and sharing. 3. Data In this section, we describe the original, unmodified corpus received by the ENNCODE consor- tium. We provide an overview of the data as a whole and discuss in more detail the contained data fields, post lifespans, and thread and message lengths in Sections 3.1, 3.2, 3.3, and 3.4, respectively. 116 3.1. Overview The original data set received by the consortium included all posts submitted to Torilauta between 2019-09-11 and 2020-05-20 (1,863,639 posts in 251 days) and 2020-06-17 and 2020-10-31 (1,099,710 posts in 136 days). In addition to the posts collected during these active collection periods, the data contained “residue” posts submitted between 2017-11-02 and 2019-09-11 (141,627 posts in 678 days). Meanwhile, posts submitted between 2020-05-20 and 2020-06-17 were missing completely. Therefore, the original unmodified corpus consisted of 3,104,976 posts in total. Table 1 presents an overview of the post and thread frequencies in the data grouped by boards along with brief topic descriptions. Of the 32 boards, the board with the highest activity measured by the total number of submitted posts and threads was the market board dedicated to narcotics transactions within the city of Helsinki (/hki). The total number of posts submitted to this board was 787,459 corresponding to 25.4% of all posts in the data. Meanwhile, in total 96.5% (2,997,624) of all posts were submitted to the 16 boards dedicated to transactions (denoted with (market) in the topic column of Table 1). 3.2. Data Fields Each post in the corpus is represented as a data structure with 8 fields as shown in Table 2. Each field belongs to one of the following three types: a string, an integer, or a date. All dates are in timezone UTC+0 (GMT+0). Note that throughout this paper, we refer to a set of these 8 data fields as post, whereas the content of the text data field within a single post is referred to as message or text body. Missing values have different meanings depending on the field. For deletions, a missing value means the post was never deleted. The first post of a thread, referred to as the original post (OP), always has a missing postId value and is instead identified by its (boardUri, threadId) pair. The subject field is missing for 46.2% of all posts since it was common practice to omit the subject. Similarly, the poster name field is missing for 54.1% of all posts which is in line with the anonymous nature of the site and since any optional contact information, such as an instant messenger username, was often included in the message text body instead. Finally, the posts submitted to Torilauta optionally contained an attached image. However, no images were included in the original data set. Moreover, the data fields comprising a post did not include information on whether the post contained an image or not. 3.3. Post Lifespans Submitted posts were deleted from the site for three main reasons. First, the site hosted a fixed number of threads on each board at a given time and so inactive threads were regularly removed by an automatic pruning mechanism to make room for new, active threads. Second, posts which violated the site rules (e.g. spam) were removed by the site administration. Third, the site interface did not provide users with means to edit messages and, therefore, the only way to correct erroneous message content (e.g. typos, updates) was to delete the post and resubmit. Lastly, a small portion of posts were ”pinned” by the site administration, that is, they were meant to stay available on the site indefinitely. 117 Table 1 An overview of the number of submitted posts by board. Board topics marked with (market) indicate that the board was dedicated to transactions. board topic # posts posts (%) # threads threads (%) hki City of Helsinki (market) 787,459 25.4 239,750 25.9 tre City of Tampere (market) 435,463 14.0 152,410 16.5 vnt City of Vantaa (market) 242,848 7.8 74,173 8.0 oulu City of Oulu (market) 242,240 7.8 86,114 9.3 muut Other areas (market) 232,513 7.5 29,293 3.2 tku City of Turku (market) 232,225 7.5 75,121 8.1 esp City of Espoo (market) 224,707 7.2 57,229 6.2 kpo City of Kuopio (market) 155,304 5.0 27,903 3.0 jkl City of Jyväskylä (market) 145,890 4.7 49,079 5.3 lti City of Lahti (market) 110,358 3.6 35,860 3.9 bulk Bulk transactions (market) 65,220 2.1 17,070 1.8 vsa City of Vaasa (market) 60,137 1.9 23,795 2.6 t Dates 33,688 1.1 13,694 1.5 roi City of Rovaniemi (market) 27,850 0.9 11,460 1.2 seka Miscellaneous (market) 22,208 0.7 9,566 1.0 hm Narcotics markets 18,280 0.6 2,204 0.2 b Random 9,068 0.3 1,249 0.1 h Narcotics 7,942 0.3 1,753 0.2 y Jobs 7,579 0.2 2,532 0.3 pm Mail orders (market) 7,023 0.2 2,074 0.2 a Everyday/mundane 6,772 0.2 1,351 0.1 hox Hormones (market) 6,180 0.2 2,137 0.2 hax Hacking 4,197 0.1 1,203 0.1 kkk Cultivation 3,725 0.1 1,349 0.1 meta Meta discussion 3,646 0.1 827 0.1 test Testing 2,927 0.1 2,240 0.2 spam Spamming 2,495 0.1 2,495 0.3 rotta Vendor feedback 2,220 0.1 314 0.0 tt Health 1,838 0.1 248 0.0 k Getting and staying sober 1,562 0.1 247 0.0 fap Porn 1,323 0.0 343 0.0 pgp PGP public keys 89 0.0 2 0.0 total 3,104,976 100.0 925,085 100.0 There are two known caveats related to the deletion timestamps. First, while the data set included the creation and deletion times of posts, it unfortunately did not include information about the reason for the deletion. Second, all posts deleted during the pause in collection 2020-05-20 - 2020-06-17 had their deletion value marked as missing and, therefore, appeared as if they were not deleted. The amount of these potentially erroneous missing values was, however, relatively small and 97.22% of all posts (3,104,976) in the data had reliable deletion time information. 118 Table 2 Data fields comprising a single post. The column titled missing (%) indicates the portion of all posts where the field value is not available. description example missing (%) boardUri board identifier roi 0.0 creation post creation datetime (UTC) 2020-01-14T17:51:24.714Z 0.0 deletion post deletion datetime (UTC) 2020-01-27T16:49:03.663Z 2.8 threadId thread identifier 27961 0.0 postId post identifier 28069 29.8 name poster name example-name 54.1 subject message subject Example message subject 46.2 message message text body Example message text body 0.0 Finally, we estimated the median lifespans of submissions to the market and non-market boards to be 23 and 238 hours, respectively. The difference was mainly due to the lower posting frequency and consequently lower thread pruning frequency of the non-market boards. For noise filtering purposes, we are mostly interested in the messages with short lifespans. To this end, we note that 5% of all messages had a lifespan of less than 32 minutes. Since posts with such short lifespans were likely removed by the user and resubmitted after minor modifications, they may be discarded as noise.4 3.4. Thread and Message Lengths By and large, the threads in the market boards were rather short as the majority (57.9%) consisted of a single post (the thread has no reply and/or follow-up messages to the original post) and 99% of the threads have 23 posts or less. Similarly, on the non-market boards, 64.18% of the threads consisted of a single post and 99% of the threads have 35 posts or less. In order to examine individual message lengths, we tokenized the message bodies using the tokenizer included in the Finnish Tagtools (v1.5) software developed at the University of Helsinki.5 The tokenizer splits running text into sentences and extracts punctuation and special characters from word bodies into separate tokens. The median length of original posts and reply/follow-up posts to the market boards were 36 and 6 word tokens, respectively. In other words, the original posts were typically short (a few sentences) and reply/follow-up posts even shorter (a few words). As for the reply/follow-up posts, this was because of the prevalence of short update posts, the purpose of which was to keep the thread alive and close to the front page. These types of posts comprised roughly one fifth of all market board reply/follow-up posts. Therefore, one can reduce the amount of noise in the FINDarC considerably by ignoring reply/follow-up posts with, e.g. less than 5 word tokens (or less than 20 characters) in the message body.6 4 This is in agreement with the recommendation of the site administrator. 5 Available at http://urn.fi/urn:nbn:fi:lb-2021042102 6 This is in agreement with the recommendation of the site administrator. 119 4. Assessing the Frequency and Nature of Personal Data In this section, we present our experimental work, the purpose of which was to assess the amount and types of personal and sensitive data in the FINDarC corpus. As the size of the corpus exceeded over 3,000,000 messages, it was deemed infeasible to curate all the posts manually given the consortium resources. In Section 4.2, we discuss experiments utilizing full-text search using hand-crafted regular expressions. Finally, the implications of the results on data reduction and data release are discussed in Section 5. 4.1. Manual Annotation In this part of the experiments, we examined the types of personal data contained in the message bodies of posts using a manual annotation approach. In particular, we were interested in types of personal information found in the market board submissions which contained the majority of posts in the data set (97%) and represent the primary function of the site as a narcotics marketplace. 4.1.1. Definition of Personal Data We adopted the Article 4(1) of the GDPR7 which defines personal data as any information relating to an identified or identifiable natural person (data subject); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person; and divided potential identifiers into the following 5 classes 1. person_name (e.g. first name, family name) 2. id_number (e.g. the Finnish social security number) 3. location (e.g. street address, city, city district) 4. online_id (e.g. instant messaging username, email address, IP address) 5. other (e.g. phone number, identifying physical appearance) Moreover, GDPR Article 9 (1)8 further defines the special categories of personal data as Processing of personal data revealing racial or ethnic origin, political opinions, religious or philosophical beliefs, or trade union membership, and the processing of genetic data, biometric data for the purpose of uniquely identifying a natural person, data concerning health or data concerning a natural person’s sex life or sexual orientation shall be prohibited. For cases belonging to this group, we added a sixth potential identifier class 7 https://gdpr-info.eu/art-4-gdpr/ 8 https://gdpr-info.eu/art-9-gdpr/ 120 6. special_category Given the personal data class specifications, the annotation task then comprised two subtasks: entity span detection and class assignment on a word token level to continuous, non-overlapping sequences of tokens.9 For example, consider priimaa kukkaa [Helsingin keskustassa]location ! yhteydenotot Wickerillä W // [example-name]online_id excellent bud [in the centre of Helsinki]location ! contact Wicker W // [example- name]online_id 4.1.2. Personal Data versus Named Entities The definition of personal data and the personal data classes described in the previous section are closely related to named entities [14, 30], in particular, person and location names. However, while there is an overlap between the personal data classes and the named entities, they are not equivalent nor is one a subset of the other. This is because of the following two main reasons. First, a person/location name mentioned in running text is almost invariably identified as a named entity independent of the context. For example, consider the sentence fragment Sauli Niinistö on Suomen presidentti. Sauli Niinistö is the President of Finland. where Sauli Niinistö and Suomen (= Finland) are defined as person and location named entities, respectively, according to the Finnish NER guidelines [29, 25]. In contrast, neither of these should be interpreted as personal data according to the GDPR as stating the name and occupation of a well-known public figure does not breach their privacy. Second, not all pieces of personal name or location information found in a message will be classified as named entities. This is because the definition of personal data takes into account a wider context than what is directly available in the message. In other words, any piece of information found in text relating to an individual can be considered personal data if it is reasonable to assume that the information in question could be combined with other information to identify that individual. For example, consider the modified version of a post in corpus presented in Table 3. From the information contained in this exemplary post, one can infer that 1. at 12:12 (UTC) on January 1st 2019 (according to the post creation timestamp) 2. a person using the Wicker username example-name (according to the contact information included in the message) 3. was selling illegal narcotics (according to subject and message) in Helsinki area (according to the post boardUri) and 4. more specifically in the Eastern part of Helsinki and 5. with a high likelihood near the metro track In consequence, we would regard the mentions of East (idässä) and metro (metro) in the message body as personal location data. In contrast, these mentions would not be considered named location entities according to the Finnish NER annotation guidelines. 9 The tokenization of messages was acquired using the Finnish Tagtools (ver. 1.5) toolkit. 121 Table 3 A modified example post containing personal data. field value boardUri hki creation 2019-01-01T12:12:12.123Z deletion 2019-02-02T13:13:13.345Z threadId 12345 postId 23456 name [missing] subject kukkaa myynnissä (selling bud ) message kukkaa myynnissä idässä. kuljetus metron lähelle. Wicker // example-name (selling bud in the east. delivery nearby metro. Wicker // example-name) 4.1.3. Manual NER annotation The results of a preliminary manual annotation showed, rather unsurprisingly, that the majority of personal information cases consisted of instant messenger usernames and areas posted as potential settings for face-to-face transactions. On the other hand, the preliminary experiment also revealed that even native Finnish speakers can struggle with the text domain and that manually curating large parts of the corpus would be an excessively tedious effort. On the other hand, this means that automatic text processing methods are also likely to struggle with the text domain. 4.1.4. Automated NER annotation Text anonymization/reduction approaches proposed in literature commonly utilize automatic NER as a part of the processing pipelines to varying extents [8, 32, 1, 27, 7, 11, 13, 12]. Ideally, NER tools would also be useful when processing FINDarC as they could, in principle, be employed to automatically detect some direct personal identifiers, such as names and addresses. However, we found this approach problematic to implement as the number of posts marked with person and location entities was hundreds of thousands. While substantially smaller than the original data set of over 3,000,000 posts, this set was still too large to be curated manually given the available consortium resources. Examining the predictions on the manually annotated data set, suggested that the available tools suffered from a domain mismatch in addition to the inherent mismatch between personal data and named entity classes. This was not a completely surprising outcome since the text domain could also cause problems for human annotators. Because the tools tended to miss entities of interest (low recall) but also be incorrect when detecting entities (low precision), we did not consider them efficient curating tools for FINDarC in their current states and instead continued to the full-text search experiments presented in Section 4.2. 122 4.2. Full-Text Search In this section, we assess the amount and types of personal data in the data set using a full-text search approach. In particular, we are interested in finding common personal identifiers with relatively rigid formats, such as social security numbers and phone numbers. We describe the experiment setup in Section 4.2.1, the applied textual patterns (regular expressions) in Section 4.2.2, and the obtained results in Section 4.2.3. 4.2.1. Setup We define a target set of textual patterns (regular expressions), search for matches in message bodies. Specifically, we are interested in finding expressions matching 1. (Finnish) social security numbers 2. (Finnish) phone numbers 3. Email addresses 4. IBAN bank accounts 5. IP addresses all of which have relatively rigid formats. The employed regular expressions are presented in Section 4.2.2. We apply the search to all posts in the data and assign the matches manually to personal data and non-personal data according to post context. The pattern matching is performed using MongoDB (v.5.0) full text search.10 In contrast to the manual annotation experiment discussed in Section 4.1, we do not filter out noise from the data and instead apply the search to all 3,104,976 posts in the original corpus. 4.2.2. Regular Expressions In what follows, we provide brief descriptions of the applied regular expressions. Social security numbers The Finnish social security number (SSN) is a sequence of 11 characters assigned to individuals by the Finnish government based on their date of birth and gender. The first 10 characters of the sequence are 6 numbers (date of birth) followed by a hyphen or A, followed by 3 numbers. The last character is alphanumeric, i.e., a number or a letter. Valid sequences likely have, therefore, format ”121212-1234” and ”121212-123A”. We detect the sequences using the regular expression \ d \ d \ d \ d \ d \ d \ - \ d \ d \ d [ a - z A - Z 0 - 9 ] . Persons born in the 2000s, who would have an ”A” instead of hyphen, were not found in the sample. Phone numbers According to the specification of the Finnish telephone network numbering, Finnish mobile phone numbers begin with a routing number (04-, 050, or 059) and are followed by a subscriber number, such as, ”040 1234567”, ”059 1234567”, and so forth.11 The first zero (”0”) of the number can optionally be replaced by the country code of Finland +358 (e.g. ”+358 10 https://www.mongodb.com/ 11 Specification of numbers in the Finnish phone network is available at: https://www.finlex.fi/fi/vira- nomaiset/normi/480001/47180 123 40 1234123”, ”+358 59 4321432”, etc.). Based on a preliminary examination of the data set, we detect common phone number formats using two regular expressions: [ \ + ] ? 3 5 8 [ \ - \ s ] ? 0 [ 4 5 ] [ \ - \ s ] ? \ d \ d \ d [ \ - \ s ] ? \ d [ \ - \ s ] ? \ d \ d \ d which matches numbers starting with the country code and 0 [ 4 5 ] \ d [ \ s \ - ] ? \ d \ d \ d [ \ - \ s ] ? \ d [ \ - \ s ] ? \ d \ d \ d which detects numbers with the country code omitted. Moreover, the expressions detect most commonly used grouping patterns using hyphens (e.g. ”+358-40-12345-567”) and whitespaces (e.g. ”059 123 4567”). While the subscriber part of the number can, in principle, vary in length, the patterns match the most common length of 7 digits. Landline numbers would be shorter but follow the same principles; none were however found in the data. Email addresses According to the RFC 5322 standard12 , an email address as an identifier which contains a locally interpreted string followed by the at-character (”@”) followed by an internet domain, such as ”name@domain.com”, ”firstname.surname@subdomain.domain.com”, and ”underscore-hyphen-plus+sign@domain.com”. We detect the addresses using a regular expression \ S + \ @ \ S + \ . \ S + which successfully detects all the above examples from a running text. IBAN bank accounts We search for bank account numbers matching the International Bank Account Number (IBAN) structure specified by the ISO 13616-1:2020 standard 13 . The IBAN formatted numbers consist of the Finnish bank account number (14 digits) preceded by a two letter country code (”FI” for Finland) and two check digits (e.g. ”FI72 1234 5678 1234 12”). We detect the pattern using the regular expression [ F f ] [ I i ] \ d \ d [ \ s \ - ] ? \ d \ d \ d \ d [ \ s \ - ] ? \ d \ d \ d \ d [ \ s \ - ] ? \ d \ d \ d \ d [ \ s \ - ] ? \ d \ d which takes into consideration the letter case of the country code and the commonly used grouping whitespaces. IP addresses IP (internet protocol) addresses are unique addresses which identity devices on the internet and local networks. We search for IP addresses using the following regular expression ( 2 5 [ 0 - 5 ] \ 2[0-4][0-9]|[01]?[0-9][0-9]?))̇3(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?| which matches patterns such as 88.777.66.555 and so forth. 4.2.3. Results The frequencies of matched social security numbers, phone numbers, email addresses, bank account numbers, and IP addresses are presented in Table 4. As shown, the most and least frequent matched types were email addresses and bank account numbers with 1,840 and 12 regular expression matches, respectively. Due to the sufficiently low number of original matches, we were able to perform manual verification of all the cases, presented also in Table 4. According to this inspection, the phone numbers and email addresses occurred in two contexts. First, similarly to the instant messaging usernames , 491 out of 858 and 1,622 out of 1,837 of the phone numbers and email addresses, respectively, were posted as contact information by individuals themselves. The remaining cases were posted as a means of targeting people. In 12 The RFC 5322 specification is available at: https://datatracker.ietf.org/doc/html/rfc5322 13 //www.iso.org/standard/81090.html 124 Table 4 Matched regular expression frequencies. The columns titled matches and verified denote the number of found regular expression matches and the number of manually verified cases, respectively, The columns titled posts and threads denote the number of distinct posts and threads where the verified cases occurred. matches verified posts phone 875 858 699 hetu 91 73 65 email 1,840 1,837 1,707 iban 12 12 12 ip_address 121 16 14 total 2,939 2,796 2,261 such cases, personal details (e.g., name, relationship information, area of residence) were shared in connection with one or more usernames, in order to paint the person as a potential target for violence. Bank account numbers occurred similarly in two contexts. Out of the 16 IP addresses, 10 cases were included as a means of targeting, while the remaining 6 were provided as a type of contact information. Finally, all 73 and 12 found cases of social security numbers and bank account numbers were posted with a purpose of targeting. Thus, we identified in total 667 cases of targeting in 295 posts using this method. Finally, we created a second regular expression list using words and prefixes related to the personal information contained in the identified 295 targeting posts. This list consisted of 77 keywords and parts of person names and addresses.14 After performing a second search with these patterns and a subsequent manual inspection, we identified an additional set of 166 posts submitted as a means of targeting. 4.2.4. Discussion Similarly to automatic NER, rule-based search for text patterns is a commonly used part of text anonymization pipelines to varying extents [8, 32, 1, 27, 7, 11, 13, 12]. For multiple cases, such as email addresses and social security numbers, it is rather straightforward to write the patterns as regular expressions. Moreover, one can efficiently find pattern matches from large data sets using suitable databases, such as the MongoDB, or search engines. However, the caveat of examining the results is, of course, that manually verifying the matches only provides us with an estimate of the precision of the search while neglecting recall. In other words, one can not estimate the number of cases missed by the search in the whole corpus without manually curating thousands or, preferably, tens of thousands, of posts. Unfortunately, an annotation effort of this scale was not feasible given the available resources of the consortium. Nevertheless, using the search combined with manual examination of the search results, we were able to uncover a set of 461 posts submitted with a purpose of targeting individuals. While extremely rare compared to the total number of posts in the corpus, these posts are particularly interesting 14 We do not present the list here due to obvious privacy issues. 125 from the point of view of data reduction and will be discussed further in Section 5. 5. Data Release In this section, we draw on the discussion and experimental results presented in the previous sections and present an outline of the FINDarC data release. Data Reduction Conventionally, the most direct approach to protect data subjects from re-identification has been to anonymize the data by removing/obscuring the parts containing personal information. [26] However, it appears evident that, if implemented successfully, this type of processing would have a profound impact on the usefulness of FINDarC for research purposes. For example, subsequent to removing usernames from their post contexts or from the data altogether, one would not be able to replicate the study of Hämäläinen et al. [16] who examined how sellers and buyers of illegal drugs represent themselves in their usernames. In turn, subsequent to removing location and/or timestamp data, one would no longer be able to replicate the study of Karjalainen et al. [21] who studied the availability of drugs specifically in the city of Tampere during the COVID-19 epidemic in the spring of 2020. From a utility point of view, therefore, it could be argued that reducing personal information from the buy/sell post threads would quickly degrade, or destroy, the usefulness of the corpus as a data source for research. This problem is generally referred to as the privacy-utility trade-off within the data privacy literature [24, 3]. Due to the problematic privacy-utility trade-off, we posit here that reducing the FINDarC extensively would not be appropriate even if sufficient resources could be allocated for domain- specific tool development and manual labour. Furthermore, we note that Torilauta and other drug trading sites have also been under observation by other parties, including both criminals and law enforcement agencies. Therefore, it is our assessment that leaving the sell/buy posts, which form the majority of the FINDarC, largely intact poses few additional risks to the studied populations. However, as discussed in 4.2, in addition to the sell/buy posts, the data also contains posts with the intention of doxxing/targeting individuals. Here, our position is that removing these submissions is warranted from an ethical point of view while not decreasing the value of the corpus as a data source significantly. This is because these posts are not directly related to the main functionality of the site as an online marketplace. Accordingly, we removed from the corpus all 461 posts containing identified doxxing/targeting information described in Section 4. The reduced corpus, therefore, comprises 3,104,515 posts. A summary of the removed posts by board is presented in Table 5. Finally, as per the Terms of Service of Torilauta, the site users gave consent to data collection for academic use by using the site. Consequently, site users could opt out of the data collection by not submitting new posts and/or contacting the site administration about previously submitted posts. However, it could be argued that by removing a previously submitted post, a user has withdrawn the permission to use the data. Unfortunately, as discussed in Section 3.2, the original data set received from the site administration did not include information about the reasons behind post deletions. Therefore, we were not able to exclude any posts from the corpus based on the deletion status. 126 Table 5 Number of posts removed from FINDarC prior to archiving due to personal information used for targeting. board # posts hm 113 b 110 esp 62 muut 41 hki 23 t 19 tre 14 a 10 rotta 10 tku 9 jkl 8 kpo 7 bulk 5 vnt 5 spam 4 seka 4 vsa 3 oulu 3 roi 2 hox 2 pm 1 meta 1 k 1 hax 1 h 1 fap 1 lti 1 total 461 Access Restrictions Due to the limited applicability of data reduction as a means of protecting data subjects from reidentification, we next discuss restricting the access to the corpus. In general, the Language Bank recommends sharing resources using standard CC-BY or other open source licenses, in which case only the metadata of the resource needs to be registered with the Language Bank, although the Language Bank also hosts openly available and publicly accessible resources (PUB) in its Language Bank Download service. However, since the FINDarC resource in its current form contains personal data, both copyright and personal data legislations apply and the corpus cannot be published with open access. Instead, FINDarC is provided via protected access under the CLARIN RES licence which means that permission to download and use the corpus is only granted to researchers based on written applications reviewed by the data controller (principal investigator of the ENNCODE consortium) including a data protection 127 impact assessment. The purpose of this limitation is to ensure that the material is accessed only by verified researchers for legimitate research purposes. It also lessens sharing-related risks to both the researchers and the subjects of study, as mandated by the consortium’s data management policy. Restricting access to the corpus as described here is in line with the current literature on data sharing [26, 28, 31, 10, 9] which also acknowledges the limitations of data anonymization/reduction and encourages the use of user group limitations. Corpus Version Control Resources deposited in the Language Bank may have several differ- ent variants (i.e. versions) which form a resource group. Typically, a resource group consists of different annotations (raw data or preprocessed data for a single corpus), accumulated data (the content is almost identical but one version has more or newer content), or repaired data (flaws or necessary modifications, e.g. justified requests to remove or stop processing data, which have been identified and fixed manually or automatically). As for FINDarC, we emphasize the importance of the third point and note that if the Language Bank receives a notification and/or request for removal of content on grounds of sensitive data from a user of the corpus, these requests will be reviewed and acted upon. In particular, the Language Bank may update the corpus by reducing the data further and only store and share the most recent version. 6. Conclusions We discussed the archiving procedure of FINDarC, a Finnish dark web marketplace corpus, in the Language Bank of Finland. The discussion included an overview of the data, assessment of the risk and impact of data subject re-identification, assessment and implementation of viable data reduction approaches using manual and automatic text processing, assessment of privacy and security measures implemented by the Language Bank of Finland, and a future corpus management plan implemented and coordinated by the Language Bank of Finland. As a result of the presented work, a reduced version of the corpus has been archived in the Language Bank of Finland. Researchers can apply for access to the corpus under the CLARIN RES licence. As this article shows, the data set was cleared using best practices for ethical proofreading, which consistently sought to prioritize the protection of the posters on the forum. Given how even indirect identifiers could be utilized against the site’s users by either law enforcement or other members of the drug-using community, it was necessary to opt for maximal removal efficiency whenever possible. Nevertheless, the amount of data is so high that no clearing can be considered ethically sufficient for the purpose of releasing the data openly, so we opted for a gatekeeping approach in addition to clearing everything that could be found. Acknowledgments References [1] Adams, A., E. Aili, D. Aioanei, R. Jonsson, L. Mickelsson, D. Mikmekova, F. Roberts, J.F. Valencia, and R. Wechsler 2019. AnonyMate: A toolkit for anonymizing unstructured chat data. In Proceedings of the Workshop on NLP and Pseudonymisation, pp. 1–7. 128 [2] Al Nabki, M.W., E. Fidalgo, E. Alegre, and I. De Paz 2017. Classifying illegal activities on tor network based on web textual contents. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pp. 35–43. [3] Alvim, M.S., M.E. Andrés, K. Chatzikokolakis, P. Degano, and C. Palamidessi 2011. Differ- ential privacy: on the trade-off between utility and information leakage. In International Workshop on Formal Aspects in Security and Trust, pp. 39–54. Springer. [4] Artstein, R. 2017. Inter-annotator Agreement. Handbook of Linguistic Annotation, pp. 297–313. Springer. [5] Branwen, G., N. Christin, D. Décary-Hétu, R.M. Andersen, StExo, E. Presidente, Anonymous, D. Lau, D.K. Sohhlz, V. Cakic, V. Buskirk, Whom, M. McKenna, and S. Goode. 2015, July. Dark net market archives, 2011-2015. https://www.gwern.net/DNM-archives. Accessed: 2022-06-28. [6] City Digital Group. 2021. Suomi24 virkkeet -korpus 2001-2020, Korp-versio [tekstikorpus]. Kielipankki. Available at http://urn.fi/urn:nbn:fi:lb-2021101525. [7] Csányi, G.M., D. Nagy, R. Vági, J.P. Vadász, and T. Orosz. 2021. Challenges and Open Problems of Legal Document Anonymization. Symmetry 13(8): 1490. [8] Di Cerbo, F. and S. Trabelsi 2018. Towards personal data identification and anonymization using machine learning techniques. In European Conference on Advances in Databases and Information Systems, pp. 118–126. Springer. [9] Elliot, M., E. Mackey, and K. O’Hara. 2020. The anonymisation decision-making framework 2nd Edition: European practitioners’ guide. [10] Elliot, M., K. O’hara, C. Raab, C.M. O’Keefe, E. Mackey, C. Dibben, H. Gowans, K. Purdam, and K. McCullagh. 2018. Functional anonymisation: Personal data and the data environment. Computer Law & Security Review 34(2): 204–221. [11] Francopoulo, G. and L.P. Schaub 2020. Anonymization for the GDPR in the Context of Citizen and Customer Relationship Management and NLP. In workshop on Legal and Ethical Issues (Legal2020), pp. 9–14. ELRA. [12] Garat, D. and D. Wonsever. 2022, jan. Automatic Curation of Court Documents: Anonymiz- ing Personal Data. Information 2022, Vol. 13, Page 27 13(1): 27. https://doi.org/10.3390/ INFO13010027. [13] Glaser, I., T. Schamberger, and F. Matthes. 2021, jun. Anonymization of German legal court rulings. Proceedings of the 18th International Conference on Artificial Intelligence and Law, ICAIL 2021: 205–209. https://doi.org/10.1145/3462757.3466087. [14] Grishman, R. and B.M. Sundheim 1996. Message understanding conference-6: A brief history. In COLING 1996 Volume 1: The 16th International Conference on Computational Linguistics. [15] Haasio, A., J.T. Harviainen, and R. Savolainen. 2020, mar. Information needs of drug users on a local dark Web marketplace. Information Processing and Management 57 (2): 102080. https://doi.org/10.1016/j.ipm.2019.102080. [16] Hämäläinen, L., A. Haasio, and J.T. Harviainen. 2021, aug. Usernames on a Finnish Online Marketplace for Illegal Drugs. Names A Journal of Onomastics 69(3). https://doi.org/10.5195/ NAMES.2021.2234. [17] Hämäläinen, L. and T. Ruokolainen. 2021. Kukkaa, amfea, subua ja essoja: Huumausainei- den slanginimitykset Tor-verkon suomalaisella kauppapaikalla. Sananjalka 63: 130–153. 129 https://doi.org/10.30673/sja.106615. [18] Harviainen, J.T., A. Haasio, and L. Hämäläinen. 2020, jan. Drug traders on a local dark web marketplace. ACM International Conference Proceeding Series: 20–26. https://doi.org/10.1145/ 3377290.3377293. [19] Harviainen, J.T., A. Haasio, T. Ruokolainen, L. Hassan, P. Siuda, and J. Hamari 2021, 1. Information protection in dark web drug markets research. Hawaii International Conference on System Sciences. [20] Jin, Y., E. Jang, Y. Lee, S. Shin, and J.W. Chung. 2022. Shedding new light on the language of the dark web. arXiv preprint arXiv:2204.06885 (To appear in NAACL 2022). [21] Karjalainen, K., R. Nyrhinen, T. Gunnar, T. Ylöstalo, and T. Ståhl. 2021. Huumeiden saatavuus, käyttö ja huumausainerikollisuus Tampereella koronakeväänä 2020. Yhteiskun- tapolitiikka 86(2): 80–90. [22] Laippala, V. and F. Ginter 2014. Syntactic n-gram collection from a large-scale corpus of internet finnish. In Human Language Technologies-The Baltic Perspective: Proceedings of the Sixth International Conference Baltic HLT, Volume 268, pp. 184. [23] Leedham, M., T. Lillis, and A. Twiner. 2021. Creating a corpus of sensitive and hard-to- access texts: Methodological challenges and ethical concerns in the building of the WiSP Corpus. Applied Corpus Linguistics 1(3): 100011. https://doi.org/https://doi.org/10.1016/j. acorp.2021.100011. [24] Li, T. and N. Li 2009. On the tradeoff between privacy and utility in data publishing. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 517–526. [25] Luoma, J., M. Oinonen, M. Pyykönen, V. Laippala, and S. Pyysalo 2020. A Broad-coverage Corpus for Finnish Named Entity Recognition. In Proceedings of the 12th Language Resources and Evaluation Conference, Marseille, France, pp. 4615–4624. European Language Resources Association. [26] Ohm, P. 2009. Broken promises of privacy: Responding to the surprising failure of anonymization. UCLA Law Review 57: 1701. [27] Oksanen, A., M. Tamper, J. Tuominen, A. Hietanen, and E. Hyvönen 2019. AnoPpi: A pseudonymization service for Finnish court documents. In JURIX 2019, pp. 251–254. IOS Press. [28] Rubinstein, I.S. and W. Hartzog. 2016. Anonymization and risk. Wash. L. Rev. 91: 703. [29] Ruokolainen, T., P. Kauppinen, M. Silfverberg, and K. Lindén. 2019, aug. A Finnish news corpus for named entity recognition. Language Resources and Evaluation 2019 54:1 54(1): 247–272. https://doi.org/10.1007/S10579-019-09471-7. arXiv:1908.04212. [30] Sang, E.T.K. and F. De Meulder 2003. Introduction to the conll-2003 shared task: Language- independent named entity recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, pp. 142–147. [31] Stalla-Bourdillon, S. and A. Knight. 2016. Anonymous data v. personal data-false debate: an EU perspective on anonymization, pseudonymization and personal data. Wis. Int’l LJ 34: 284. [32] Tamper, M., A. Oksanen, J. Tuominen, E. Hyvönen, A. Hietanen, and Others 2018. Anonymization Service for Finnish Case Law: Opening Data without Sacrificing Data Protection and Privacy of Citizens. In International Conference on Law via the Internet, 130 LVI. [33] Virtanen, A., J. Kanerva, R. Ilo, J. Luoma, J. Luotolahti, T. Salakoski, F. Ginter, and S. Pyysalo. 2019. Multilingual is not enough: Bert for Finnish. arXiv preprint arXiv:1912.07076. [34] Weischedel, R., M. Palmer, M. Marcus, E. Hovy, S. Pradhan, L. Ramshaw, N. Xue, A. Taylor, J. Kaufman, M. Franchini, and Others. 2013. Ontonotes release 5.0 ldc2013t19. Linguistic Data Consortium, Philadelphia, PA 23. [35] Ylilauta. 2016. Ylilauta-korpuksen ladattava versio [tekstikorpus]. Kielipankki. Available at http://urn.fi/urn:nbn:fi:lb-2016101210. 131