<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Ethically Archiving a Hard-to-Access Massive Research Data Set in the Language Bank of Finland: The Finnish Dark Web Marketplace Corpus (FINDarC)</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Krister</forename><surname>Lindén</surname></persName>
							<email>krister.linden@helsinki.fi</email>
							<affiliation key="aff0">
								<orgName type="department">Department of Digital Humanities</orgName>
								<orgName type="institution">University of Helsinki</orgName>
								<address>
									<addrLine>Yliopistokatu 4</addrLine>
									<postCode>00014</postCode>
								</address>
							</affiliation>
							<affiliation key="aff1">
								<orgName type="institution">University of Helsinki</orgName>
								<address>
									<country key="FI">Finland</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Teemu</forename><surname>Ruokolainen</surname></persName>
							<email>teemu.ruokolainen@tuni.fi</email>
							<affiliation key="aff2">
								<orgName type="department">Faculty of Information Technology and Communication Sciences</orgName>
								<orgName type="institution">Tampere University</orgName>
								<address>
									<addrLine>Kalevantie 4</addrLine>
									<postCode>33014</postCode>
									<settlement>Tampere</settlement>
									<country key="FI">Finland</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Lasse</forename><surname>Hämäläinen</surname></persName>
							<email>lasse.hamalainen@tuni.fi</email>
							<affiliation key="aff2">
								<orgName type="department">Faculty of Information Technology and Communication Sciences</orgName>
								<orgName type="institution">Tampere University</orgName>
								<address>
									<addrLine>Kalevantie 4</addrLine>
									<postCode>33014</postCode>
									<settlement>Tampere</settlement>
									<country key="FI">Finland</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">J</forename><surname>Tuomas Harviainen</surname></persName>
							<email>tuomas.harviainen@tuni.fi</email>
							<affiliation key="aff2">
								<orgName type="department">Faculty of Information Technology and Communication Sciences</orgName>
								<orgName type="institution">Tampere University</orgName>
								<address>
									<addrLine>Kalevantie 4</addrLine>
									<postCode>33014</postCode>
									<settlement>Tampere</settlement>
									<country key="FI">Finland</country>
								</address>
							</affiliation>
						</author>
						<author>
							<affiliation key="aff3">
								<orgName type="department">Conference on Technology Ethics -Tethics</orgName>
								<address>
									<addrLine>October 18-19</addrLine>
									<postCode>2023</postCode>
									<settlement>Turku</settlement>
									<country key="FI">Finland</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Ethically Archiving a Hard-to-Access Massive Research Data Set in the Language Bank of Finland: The Finnish Dark Web Marketplace Corpus (FINDarC)</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">63D4335E8B3EB91D1AC0B36B92E59844</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-12-29T07:40+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>dark web</term>
					<term>illegal narcotics</term>
					<term>online marketplace</term>
					<term>data sharing</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>We discuss the archiving procedure of a corpus comprising posts submitted to Torilauta, a Finnish dark web marketplace website. The site was active from 2017 to 2021 and during this time one of the most prominent online illegal narcotics markets in Finland. As a result of the presented work, a reduced version of the corpus, Finnish Dark Web Marketplace Corpus (FINDarC), has been archived in the Language Bank of Finland. Researchers can apply for access rights to the corpus under the CLARIN RES licence. The discussion presented in this paper addresses the archiving process, including assessment of the risk and impact of data subject re-identification, assessment and implementation of viable data anonymization/reduction approaches, assessment of privacy and security measures implemented by the Language Bank of Finland, and future corpus management plan coordinated by the Language Bank of Finland.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>We discuss the archiving procedure of a corpus comprising posts submitted to Torilauta, a dark web marketplace website, for research purposes. The site was active from 2017 to 2021 and during this time one of the most prominent online illegal narcotics markets in Finland. Functionally, the site consisted of discussion imageboards where vendors and customers were able to set up instances of face-to-face trading, typically with the assistance of instant messaging software such as Wickr or Telegram. The original, unmodified data set comprising 3,104,976 posts was collected and handed over to the ENNCODE consortium <ref type="foot" target="#foot_0">1</ref> by the site administration to be archived and shared for research purposes, as permitted by the site's Terms of Service. As a result of the presented work, a reduced version of the corpus comprising 3,104,515 posts, referred to as the Finnish Dark Web Marketplace Corpus (FINDarC), has been deposited in the Language Bank of Finland, a language resource service coordinated by the national FIN-CLARIN consortium formed by Finnish universities and other research organizations. Researchers can contact the Language Bank and apply for permission to access the corpus under the CLARIN RES license. <ref type="foot" target="#foot_1">2</ref>While the dark web online market places, including Torilauta, emphasize user anonymity, the posts submitted to such sites can nevertheless contain personal information, such as unique usernames and personal names, enabling data subject re-identification. Therefore, as described in this paper, we have made our best effort to assess and identify the type and amount of personal information in the original unmodified data set, to assess and implement viable data anonymization/reduction approaches, to assess privacy and security measures implemented by the Language Bank of Finland, and to put in place a future corpus management plan coordinated by the Language Bank of Finland. Moreover, any future research based on the corpus is encouraged to implement appropriate ethical proofreading measures (see e.g. <ref type="bibr" target="#b19">[19]</ref>) in order to further mitigate any potential harm from access to the material, to both the researchers and the studied populations.</p><p>The rest of the paper is organized as follows. We first discuss previous studies on Torilauta and existing related corpora in Section 2. We then provide an overview of the data set in Section 3. In Section 4, we present our experimental work, the purpose of which was to assess the amount of personal and sensitive data in the corpus. Subsequently, given the experimental results, we discuss in detail our data release plan in Section 5. Finally, we provide conclusions on the work in Section 6.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related Work</head><p>In this section, we discuss previously published studies using the Torilauta site as a data source and related corpora in Sections 2.1 and 2.2, respectively.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.">Previous Studies on Torilauta</head><p>Prior to the work presented here, Torilauta was utilized as a data source in multiple linguistic and social science studies <ref type="bibr" target="#b15">[15,</ref><ref type="bibr" target="#b18">18,</ref><ref type="bibr" target="#b16">16,</ref><ref type="bibr" target="#b17">17,</ref><ref type="bibr" target="#b21">21]</ref>. In particular, Haasio et al. <ref type="bibr" target="#b15">[15]</ref> examined information needs of drug users using a sample of 9,300 posts. <ref type="foot" target="#foot_2">3</ref> Harviainen et al. <ref type="bibr" target="#b18">[18]</ref> studied cultural and socioeconomic aspects of drug traders using the same 9,300 post sample. Hämäläinen and Ruokolainen <ref type="bibr" target="#b17">[17]</ref> studied narcotic substance vocabulary based on a sample of 3,000 posts. Hämäläinen et al. <ref type="bibr" target="#b16">[16]</ref> studied a sample of 1,654 usernames extracted from posts submitted to the site. Karjalainen et al. <ref type="bibr" target="#b21">[21]</ref> examined the availability of illegal narcotics during the first wave of the COVID-19 pandemic using a sample of 535 posts.</p><p>It is notable that none of the previous studies attempted to share their data sets with the research community in a systematic manner. This practice negatively impacts the replication and verification of the published studies and potentially discourages further research on the topic. On the other hand, given the sensitive and potentially incriminating nature of the data, not releasing the data is an understandable approach since preparing and managing such a resource gives rise to multiple technical, ethical, and legal challenges. The purpose of this paper is to describe and discuss these challenges and how we approached them.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.">Related Corpora</head><p>To the best of our knowledge, there exist relatively few published dark web corpora or text data sets. Three notable exceptions include the Dark Net Market archives (2013-2015) <ref type="bibr" target="#b4">[5]</ref>, a collection covering 89 dark net markets and over 37 related forums (1.6TB uncompressed) scraped during 2013-2015, DUTA <ref type="bibr" target="#b1">[2]</ref>, a set of 7,000 text samples formed by sampling the Tor network for two months, and CoDa <ref type="bibr" target="#b20">[20]</ref>, a set of 10,000 web documents tailored towards textbased dark web analysis. All three corpora comprise primarily English texts and are either publicly downloadable (Dark Net Market archives) or available to researchers upon request (DUTA, CoDa).</p><p>Existing Finnish web forum corpora include texts collected from the Ylilauta imageboard <ref type="bibr" target="#b35">[35]</ref> and Suomi24 social networking site <ref type="bibr" target="#b5">[6]</ref>. While emphasizing user anonymity, both Ylilauta and Suomi24 forums operate on the clear web and strictly forbid illegal content. Both corpora are available for research purposes via the Language Bank of Finland under a Creative Commons (CC BY-NC 4.0) license.</p><p>The Finnish Internet Parsebank <ref type="bibr" target="#b22">[22]</ref> is a large-scale syntactically analyzed text collection created using plain text webpage data made available by the Common-Crawl2 Internet crawl project. Due to the employed web crawling approach to data collection, the corpus is likely to contain web forum content.</p><p>Finally, in a recent study, Leedham et al. <ref type="bibr" target="#b23">[23]</ref> discussed their work on archiving a hard-toaccess WiSP corpus consisting of texts written by social work professionals describing their work practices. Due to the potentially sensitive nature of the texts, Leedham et al. <ref type="bibr" target="#b23">[23]</ref> created two versions of the corpus: one for the research project and an anonymized/reduced version for archiving. In a similar vein, our work presented here aims to provide an extensive discussion on the process of preparing a corpus of potentially sensitive texts for archiving and sharing.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Data</head><p>In this section, we describe the original, unmodified corpus received by the ENNCODE consortium. We provide an overview of the data as a whole and discuss in more detail the contained data fields, post lifespans, and thread and message lengths in Sections 3.1, 3.2, 3.3, and 3.4, respectively.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Overview</head><p>The original data set received by the consortium included all posts submitted to Torilauta between 2019-09-11 and 2020-05-20 (1,863,639 posts in 251 days) and 2020-06-17 and 2020-10-31 (1,099,710 posts in 136 days). In addition to the posts collected during these active collection periods, the data contained "residue" posts submitted between 2017-11-02 and 2019-09-11 (141,627 posts in 678 days). Meanwhile, posts submitted between 2020-05-20 and 2020-06-17 were missing completely. Therefore, the original unmodified corpus consisted of 3,104,976 posts in total.</p><p>Table <ref type="table" target="#tab_0">1</ref> presents an overview of the post and thread frequencies in the data grouped by boards along with brief topic descriptions. Of the 32 boards, the board with the highest activity measured by the total number of submitted posts and threads was the market board dedicated to narcotics transactions within the city of Helsinki (/hki). The total number of posts submitted to this board was 787,459 corresponding to 25.4% of all posts in the data. Meanwhile, in total 96.5% (2,997,624) of all posts were submitted to the 16 boards dedicated to transactions (denoted with (market) in the topic column of Table <ref type="table" target="#tab_0">1</ref>).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Data Fields</head><p>Each post in the corpus is represented as a data structure with 8 fields as shown in Table <ref type="table" target="#tab_1">2</ref>. Each field belongs to one of the following three types: a string, an integer, or a date. All dates are in timezone UTC+0 (GMT+0). Note that throughout this paper, we refer to a set of these 8 data fields as post, whereas the content of the text data field within a single post is referred to as message or text body. Missing values have different meanings depending on the field. For deletions, a missing value means the post was never deleted. The first post of a thread, referred to as the original post (OP), always has a missing postId value and is instead identified by its (boardUri, threadId) pair. The subject field is missing for 46.2% of all posts since it was common practice to omit the subject. Similarly, the poster name field is missing for 54.1% of all posts which is in line with the anonymous nature of the site and since any optional contact information, such as an instant messenger username, was often included in the message text body instead.</p><p>Finally, the posts submitted to Torilauta optionally contained an attached image. However, no images were included in the original data set. Moreover, the data fields comprising a post did not include information on whether the post contained an image or not.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.">Post Lifespans</head><p>Submitted posts were deleted from the site for three main reasons. First, the site hosted a fixed number of threads on each board at a given time and so inactive threads were regularly removed by an automatic pruning mechanism to make room for new, active threads. Second, posts which violated the site rules (e.g. spam) were removed by the site administration. Third, the site interface did not provide users with means to edit messages and, therefore, the only way to correct erroneous message content (e.g. typos, updates) was to delete the post and resubmit. Lastly, a small portion of posts were "pinned" by the site administration, that is, they were meant to stay available on the site indefinitely. There are two known caveats related to the deletion timestamps. First, while the data set included the creation and deletion times of posts, it unfortunately did not include information about the reason for the deletion. Second, all posts deleted during the pause in collection 2020-05-20 -2020-06-17 had their deletion value marked as missing and, therefore, appeared as if they were not deleted. The amount of these potentially erroneous missing values was, however, relatively small and 97.22% of all posts <ref type="bibr" target="#b2">(3,</ref><ref type="bibr">104,</ref><ref type="bibr">976)</ref> in the data had reliable deletion time information. Finally, we estimated the median lifespans of submissions to the market and non-market boards to be 23 and 238 hours, respectively. The difference was mainly due to the lower posting frequency and consequently lower thread pruning frequency of the non-market boards. For noise filtering purposes, we are mostly interested in the messages with short lifespans. To this end, we note that 5% of all messages had a lifespan of less than 32 minutes. Since posts with such short lifespans were likely removed by the user and resubmitted after minor modifications, they may be discarded as noise. <ref type="foot" target="#foot_3">4</ref></p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4.">Thread and Message Lengths</head><p>By and large, the threads in the market boards were rather short as the majority (57.9%) consisted of a single post (the thread has no reply and/or follow-up messages to the original post) and 99% of the threads have 23 posts or less. Similarly, on the non-market boards, 64.18% of the threads consisted of a single post and 99% of the threads have 35 posts or less.</p><p>In order to examine individual message lengths, we tokenized the message bodies using the tokenizer included in the Finnish Tagtools (v1.5) software developed at the University of Helsinki. <ref type="foot" target="#foot_4">5</ref> The tokenizer splits running text into sentences and extracts punctuation and special characters from word bodies into separate tokens. The median length of original posts and reply/follow-up posts to the market boards were 36 and 6 word tokens, respectively. In other words, the original posts were typically short (a few sentences) and reply/follow-up posts even shorter (a few words). As for the reply/follow-up posts, this was because of the prevalence of short update posts, the purpose of which was to keep the thread alive and close to the front page. These types of posts comprised roughly one fifth of all market board reply/follow-up posts. Therefore, one can reduce the amount of noise in the FINDarC considerably by ignoring reply/follow-up posts with, e.g. less than 5 word tokens (or less than 20 characters) in the message body. <ref type="foot" target="#foot_5">6</ref></p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Assessing the Frequency and Nature of Personal Data</head><p>In this section, we present our experimental work, the purpose of which was to assess the amount and types of personal and sensitive data in the FINDarC corpus. As the size of the corpus exceeded over 3,000,000 messages, it was deemed infeasible to curate all the posts manually given the consortium resources. In Section 4.2, we discuss experiments utilizing full-text search using hand-crafted regular expressions. Finally, the implications of the results on data reduction and data release are discussed in Section 5.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">Manual Annotation</head><p>In this part of the experiments, we examined the types of personal data contained in the message bodies of posts using a manual annotation approach. In particular, we were interested in types of personal information found in the market board submissions which contained the majority of posts in the data set (97%) and represent the primary function of the site as a narcotics marketplace.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.1.">Definition of Personal Data</head><p>We adopted the Article 4(1) of the GDPR 7 which defines personal data as any information relating to an identified or identifiable natural person (data subject); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person; and divided potential identifiers into the following 5 classes 1. person_name (e.g. first name, family name) 2. id_number (e.g. the Finnish social security number) 3. location (e.g. street address, city, city district) 4. online_id (e.g. instant messaging username, email address, IP address) 5. other (e.g. phone number, identifying physical appearance) Moreover, GDPR Article 9 (1) 8 further defines the special categories of personal data as Processing of personal data revealing racial or ethnic origin, political opinions, religious or philosophical beliefs, or trade union membership, and the processing of genetic data, biometric data for the purpose of uniquely identifying a natural person, data concerning health or data concerning a natural person's sex life or sexual orientation shall be prohibited.</p><p>For cases belonging to this group, we added a sixth potential identifier class 7 https://gdpr-info.eu/art-4-gdpr/ 8 https://gdpr-info.eu/art-9-gdpr/</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">special_category</head><p>Given the personal data class specifications, the annotation task then comprised two subtasks: entity span detection and class assignment on a word token level to continuous, non-overlapping sequences of tokens. <ref type="foot" target="#foot_6">9</ref> For example, consider priimaa kukkaa [Helsingin keskustassa] location ! yhteydenotot Wickerillä W // [example-name] online_id excellent bud [in the centre of Helsinki] location ! contact Wicker W // [examplename] online_id</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.2.">Personal Data versus Named Entities</head><p>The definition of personal data and the personal data classes described in the previous section are closely related to named entities <ref type="bibr" target="#b14">[14,</ref><ref type="bibr" target="#b30">30]</ref>, in particular, person and location names. However, while there is an overlap between the personal data classes and the named entities, they are not equivalent nor is one a subset of the other. This is because of the following two main reasons. First, a person/location name mentioned in running text is almost invariably identified as a named entity independent of the context. For example, consider the sentence fragment Sauli Niinistö on Suomen presidentti.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Sauli Niinistö is the President of Finland.</head><p>where Sauli Niinistö and Suomen (= Finland) are defined as person and location named entities, respectively, according to the Finnish NER guidelines <ref type="bibr" target="#b29">[29,</ref><ref type="bibr" target="#b25">25]</ref>. In contrast, neither of these should be interpreted as personal data according to the GDPR as stating the name and occupation of a well-known public figure does not breach their privacy. Second, not all pieces of personal name or location information found in a message will be classified as named entities. This is because the definition of personal data takes into account a wider context than what is directly available in the message. In other words, any piece of information found in text relating to an individual can be considered personal data if it is reasonable to assume that the information in question could be combined with other information to identify that individual. For example, consider the modified version of a post in corpus presented in Table <ref type="table" target="#tab_2">3</ref>. From the information contained in this exemplary post, one can infer that 1. at 12:12 (UTC) on January 1st 2019 (according to the post creation timestamp) 2. a person using the Wicker username example-name (according to the contact information included in the message) 3. was selling illegal narcotics (according to subject and message) in Helsinki area (according to the post boardUri) and 4. more specifically in the Eastern part of Helsinki and 5. with a high likelihood near the metro track In consequence, we would regard the mentions of East (idässä) and metro (metro) in the message body as personal location data. In contrast, these mentions would not be considered named location entities according to the Finnish NER annotation guidelines. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.3.">Manual NER annotation</head><p>The results of a preliminary manual annotation showed, rather unsurprisingly, that the majority of personal information cases consisted of instant messenger usernames and areas posted as potential settings for face-to-face transactions. On the other hand, the preliminary experiment also revealed that even native Finnish speakers can struggle with the text domain and that manually curating large parts of the corpus would be an excessively tedious effort. On the other hand, this means that automatic text processing methods are also likely to struggle with the text domain.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.4.">Automated NER annotation</head><p>Text anonymization/reduction approaches proposed in literature commonly utilize automatic NER as a part of the processing pipelines to varying extents <ref type="bibr" target="#b8">[8,</ref><ref type="bibr" target="#b32">32,</ref><ref type="bibr" target="#b0">1,</ref><ref type="bibr" target="#b27">27,</ref><ref type="bibr" target="#b7">7,</ref><ref type="bibr" target="#b11">11,</ref><ref type="bibr" target="#b13">13,</ref><ref type="bibr" target="#b12">12]</ref>. Ideally, NER tools would also be useful when processing FINDarC as they could, in principle, be employed to automatically detect some direct personal identifiers, such as names and addresses. However, we found this approach problematic to implement as the number of posts marked with person and location entities was hundreds of thousands. While substantially smaller than the original data set of over 3,000,000 posts, this set was still too large to be curated manually given the available consortium resources. Examining the predictions on the manually annotated data set, suggested that the available tools suffered from a domain mismatch in addition to the inherent mismatch between personal data and named entity classes. This was not a completely surprising outcome since the text domain could also cause problems for human annotators.</p><p>Because the tools tended to miss entities of interest (low recall) but also be incorrect when detecting entities (low precision), we did not consider them efficient curating tools for FINDarC in their current states and instead continued to the full-text search experiments presented in Section 4.2.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.">Full-Text Search</head><p>In this section, we assess the amount and types of personal data in the data set using a full-text search approach. In particular, we are interested in finding common personal identifiers with relatively rigid formats, such as social security numbers and phone numbers. We describe the experiment setup in Section 4.2.1, the applied textual patterns (regular expressions) in Section 4.2.2, and the obtained results in Section 4.2.3.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.1.">Setup</head><p>We define a target set of textual patterns (regular expressions), search for matches in message bodies. Specifically, we are interested in finding expressions matching 1. (Finnish) social security numbers 2. (Finnish) phone numbers 3. Email addresses 4. IBAN bank accounts 5. IP addresses all of which have relatively rigid formats. The employed regular expressions are presented in Section 4.2.2. We apply the search to all posts in the data and assign the matches manually to personal data and non-personal data according to post context. The pattern matching is performed using MongoDB (v.5.0) full text search. <ref type="foot" target="#foot_7">10</ref> In contrast to the manual annotation experiment discussed in Section 4.1, we do not filter out noise from the data and instead apply the search to all 3,104,976 posts in the original corpus.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.2.">Regular Expressions</head><p>In what follows, we provide brief descriptions of the applied regular expressions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Social security numbers</head><p>The Finnish social security number (SSN) is a sequence of 11 characters assigned to individuals by the Finnish government based on their date of birth and gender. The first 10 characters of the sequence are 6 numbers (date of birth) followed by a hyphen or A, followed by 3 numbers. The last character is alphanumeric, i.e., a number or a letter. Valid sequences likely have, therefore, format "121212-1234" and "121212-123A". We detect the sequences using the regular expression</p><formula xml:id="formula_0">\ d \ d \ d \ d \ d \ d \ -\ d \ d \ d [ a -z A -Z 0 -9 ]</formula><p>. Persons born in the 2000s, who would have an "A" instead of hyphen, were not found in the sample.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Phone numbers</head><p>According to the specification of the Finnish telephone network numbering, Finnish mobile phone numbers begin with a routing number (04-, 050, or 059) and are followed by a subscriber number, such as, "040 1234567", "059 1234567", and so forth. <ref type="foot" target="#foot_8">11</ref> The first zero ("0") of the number can optionally be replaced by the country code of Finland +358 (e.g. "+358 40 1234123", "+358 59 4321432", etc.). Based on a preliminary examination of the data set, we detect common phone number formats using two regular expressions:</p><formula xml:id="formula_1">[ \ + ] ? 3 5 8 [ \ -\ s ] ? 0 [ 4 5 ] [ \ - \ s ] ? \ d \ d \ d [ \ -\ s ] ? \ d [ \ -\ s ] ? \ d \ d \ d</formula><p>which matches numbers starting with the country code and 0</p><formula xml:id="formula_2">[ 4 5 ] \ d [ \ s \ -] ? \ d \ d \ d [ \ -\ s ] ? \ d [ \ -\ s ] ? \ d \ d \ d</formula><p>which detects numbers with the country code omitted. Moreover, the expressions detect most commonly used grouping patterns using hyphens (e.g. "+358-40-12345-567") and whitespaces (e.g. "059 123 4567"). While the subscriber part of the number can, in principle, vary in length, the patterns match the most common length of 7 digits. Landline numbers would be shorter but follow the same principles; none were however found in the data.</p><p>Email addresses According to the RFC 5322 standard <ref type="foot" target="#foot_9">12</ref> , an email address as an identifier which contains a locally interpreted string followed by the at-character ("@") followed by an internet domain, such as "name@domain.com", "firstname.surname@subdomain.domain.com", and "underscore-hyphen-plus+sign@domain.com". We detect the addresses using a regular expression \ S + \ @ \ S + \ . \ S + which successfully detects all the above examples from a running text.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>IBAN bank accounts</head><p>We search for bank account numbers matching the International Bank Account Number (IBAN) structure specified by the ISO 13616-1:2020 standard <ref type="foot" target="#foot_10">13</ref> . The IBAN formatted numbers consist of the Finnish bank account number (14 digits) preceded by a two letter country code ("FI" for Finland) and two check digits (e.g. "FI72 1234 5678 1234 12"). We detect the pattern using the regular expression</p><formula xml:id="formula_3">[ F f ] [ I i ] \ d \ d [ \ s \ -] ? \ d \ d \ d \ d [ \ s \ - ] ? \ d \ d \ d \ d [ \ s \ -] ? \ d \ d \ d \ d [ \ s \ -] ? \ d \ d</formula><p>which takes into consideration the letter case of the country code and the commonly used grouping whitespaces.</p><p>IP addresses IP (internet protocol) addresses are unique addresses which identity devices on the internet and local networks. We search for IP addresses using the following regular expression </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.3.">Results</head><p>The frequencies of matched social security numbers, phone numbers, email addresses, bank account numbers, and IP addresses are presented in Table <ref type="table" target="#tab_3">4</ref>. As shown, the most and least frequent matched types were email addresses and bank account numbers with 1,840 and 12 regular expression matches, respectively. Due to the sufficiently low number of original matches, we were able to perform manual verification of all the cases, presented also in Table <ref type="table" target="#tab_3">4</ref>. According to this inspection, the phone numbers and email addresses occurred in two contexts. First, similarly to the instant messaging usernames , 491 out of 858 and 1,622 out of 1,837 of the phone numbers and email addresses, respectively, were posted as contact information by individuals themselves. The remaining cases were posted as a means of targeting people. In such cases, personal details (e.g., name, relationship information, area of residence) were shared in connection with one or more usernames, in order to paint the person as a potential target for violence. Bank account numbers occurred similarly in two contexts. Out of the 16 IP addresses, 10 cases were included as a means of targeting, while the remaining 6 were provided as a type of contact information. Finally, all 73 and 12 found cases of social security numbers and bank account numbers were posted with a purpose of targeting. Thus, we identified in total 667 cases of targeting in 295 posts using this method. Finally, we created a second regular expression list using words and prefixes related to the personal information contained in the identified 295 targeting posts. This list consisted of 77 keywords and parts of person names and addresses. <ref type="foot" target="#foot_11">14</ref> After performing a second search with these patterns and a subsequent manual inspection, we identified an additional set of 166 posts submitted as a means of targeting.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.4.">Discussion</head><p>Similarly to automatic NER, rule-based search for text patterns is a commonly used part of text anonymization pipelines to varying extents <ref type="bibr" target="#b8">[8,</ref><ref type="bibr" target="#b32">32,</ref><ref type="bibr" target="#b0">1,</ref><ref type="bibr" target="#b27">27,</ref><ref type="bibr" target="#b7">7,</ref><ref type="bibr" target="#b11">11,</ref><ref type="bibr" target="#b13">13,</ref><ref type="bibr" target="#b12">12]</ref>. For multiple cases, such as email addresses and social security numbers, it is rather straightforward to write the patterns as regular expressions. Moreover, one can efficiently find pattern matches from large data sets using suitable databases, such as the MongoDB, or search engines. However, the caveat of examining the results is, of course, that manually verifying the matches only provides us with an estimate of the precision of the search while neglecting recall. In other words, one can not estimate the number of cases missed by the search in the whole corpus without manually curating thousands or, preferably, tens of thousands, of posts. Unfortunately, an annotation effort of this scale was not feasible given the available resources of the consortium. Nevertheless, using the search combined with manual examination of the search results, we were able to uncover a set of 461 posts submitted with a purpose of targeting individuals. While extremely rare compared to the total number of posts in the corpus, these posts are particularly interesting from the point of view of data reduction and will be discussed further in Section 5.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Data Release</head><p>In this section, we draw on the discussion and experimental results presented in the previous sections and present an outline of the FINDarC data release.</p><p>Data Reduction Conventionally, the most direct approach to protect data subjects from re-identification has been to anonymize the data by removing/obscuring the parts containing personal information. <ref type="bibr" target="#b26">[26]</ref> However, it appears evident that, if implemented successfully, this type of processing would have a profound impact on the usefulness of FINDarC for research purposes. For example, subsequent to removing usernames from their post contexts or from the data altogether, one would not be able to replicate the study of Hämäläinen et al. <ref type="bibr" target="#b16">[16]</ref> who examined how sellers and buyers of illegal drugs represent themselves in their usernames. In turn, subsequent to removing location and/or timestamp data, one would no longer be able to replicate the study of Karjalainen et al. <ref type="bibr" target="#b21">[21]</ref> who studied the availability of drugs specifically in the city of Tampere during the COVID-19 epidemic in the spring of 2020. From a utility point of view, therefore, it could be argued that reducing personal information from the buy/sell post threads would quickly degrade, or destroy, the usefulness of the corpus as a data source for research. This problem is generally referred to as the privacy-utility trade-off within the data privacy literature <ref type="bibr" target="#b24">[24,</ref><ref type="bibr" target="#b2">3]</ref>.</p><p>Due to the problematic privacy-utility trade-off, we posit here that reducing the FINDarC extensively would not be appropriate even if sufficient resources could be allocated for domainspecific tool development and manual labour. Furthermore, we note that Torilauta and other drug trading sites have also been under observation by other parties, including both criminals and law enforcement agencies. Therefore, it is our assessment that leaving the sell/buy posts, which form the majority of the FINDarC, largely intact poses few additional risks to the studied populations. However, as discussed in 4.2, in addition to the sell/buy posts, the data also contains posts with the intention of doxxing/targeting individuals. Here, our position is that removing these submissions is warranted from an ethical point of view while not decreasing the value of the corpus as a data source significantly. This is because these posts are not directly related to the main functionality of the site as an online marketplace. Accordingly, we removed from the corpus all 461 posts containing identified doxxing/targeting information described in Section 4. The reduced corpus, therefore, comprises 3,104,515 posts. A summary of the removed posts by board is presented in Table <ref type="table" target="#tab_4">5</ref>.</p><p>Finally, as per the Terms of Service of Torilauta, the site users gave consent to data collection for academic use by using the site. Consequently, site users could opt out of the data collection by not submitting new posts and/or contacting the site administration about previously submitted posts. However, it could be argued that by removing a previously submitted post, a user has withdrawn the permission to use the data. Unfortunately, as discussed in Section 3.2, the original data set received from the site administration did not include information about the reasons behind post deletions. Therefore, we were not able to exclude any posts from the corpus based on the deletion status. Access Restrictions Due to the limited applicability of data reduction as a means of protecting data subjects from reidentification, we next discuss restricting the access to the corpus. In general, the Language Bank recommends sharing resources using standard CC-BY or other open source licenses, in which case only the metadata of the resource needs to be registered with the Language Bank, although the Language Bank also hosts openly available and publicly accessible resources (PUB) in its Language Bank Download service. However, since the FINDarC resource in its current form contains personal data, both copyright and personal data legislations apply and the corpus cannot be published with open access. Instead, FINDarC is provided via protected access under the CLARIN RES licence which means that permission to download and use the corpus is only granted to researchers based on written applications reviewed by the data controller (principal investigator of the ENNCODE consortium) including a data protection impact assessment. The purpose of this limitation is to ensure that the material is accessed only by verified researchers for legimitate research purposes. It also lessens sharing-related risks to both the researchers and the subjects of study, as mandated by the consortium's data management policy. Restricting access to the corpus as described here is in line with the current literature on data sharing <ref type="bibr" target="#b26">[26,</ref><ref type="bibr" target="#b28">28,</ref><ref type="bibr" target="#b31">31,</ref><ref type="bibr" target="#b10">10,</ref><ref type="bibr" target="#b9">9]</ref> which also acknowledges the limitations of data anonymization/reduction and encourages the use of user group limitations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Corpus Version Control</head><p>Resources deposited in the Language Bank may have several different variants (i.e. versions) which form a resource group. Typically, a resource group consists of different annotations (raw data or preprocessed data for a single corpus), accumulated data (the content is almost identical but one version has more or newer content), or repaired data (flaws or necessary modifications, e.g. justified requests to remove or stop processing data, which have been identified and fixed manually or automatically). As for FINDarC, we emphasize the importance of the third point and note that if the Language Bank receives a notification and/or request for removal of content on grounds of sensitive data from a user of the corpus, these requests will be reviewed and acted upon. In particular, the Language Bank may update the corpus by reducing the data further and only store and share the most recent version.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Conclusions</head><p>We discussed the archiving procedure of FINDarC, a Finnish dark web marketplace corpus, in the Language Bank of Finland. The discussion included an overview of the data, assessment of the risk and impact of data subject re-identification, assessment and implementation of viable data reduction approaches using manual and automatic text processing, assessment of privacy and security measures implemented by the Language Bank of Finland, and a future corpus management plan implemented and coordinated by the Language Bank of Finland. As a result of the presented work, a reduced version of the corpus has been archived in the Language Bank of Finland. Researchers can apply for access to the corpus under the CLARIN RES licence. As this article shows, the data set was cleared using best practices for ethical proofreading, which consistently sought to prioritize the protection of the posters on the forum. Given how even indirect identifiers could be utilized against the site's users by either law enforcement or other members of the drug-using community, it was necessary to opt for maximal removal efficiency whenever possible. Nevertheless, the amount of data is so high that no clearing can be considered ethically sufficient for the purpose of releasing the data openly, so we opted for a gatekeeping approach in addition to clearing everything that could be found.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>( 2 5</head><label>5</label><figDesc>[ 0 -5 ] \ 2[0-4][0-9]|[01]?[0-9][0-9]?))3(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?| which matches patterns such as 88.777.66.555 and so forth.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1</head><label>1</label><figDesc>An overview of the number of submitted posts by board. Board topics marked with (market) indicate that the board was dedicated to transactions.</figDesc><table><row><cell cols="2">board topic</cell><cell cols="4"># posts posts (%) # threads threads (%)</cell></row><row><cell>hki</cell><cell>City of Helsinki (market)</cell><cell>787,459</cell><cell>25.4</cell><cell>239,750</cell><cell>25.9</cell></row><row><cell>tre</cell><cell>City of Tampere (market)</cell><cell>435,463</cell><cell>14.0</cell><cell>152,410</cell><cell>16.5</cell></row><row><cell>vnt</cell><cell>City of Vantaa (market)</cell><cell>242,848</cell><cell>7.8</cell><cell>74,173</cell><cell>8.0</cell></row><row><cell>oulu</cell><cell>City of Oulu (market)</cell><cell>242,240</cell><cell>7.8</cell><cell>86,114</cell><cell>9.3</cell></row><row><cell cols="2">muut Other areas (market)</cell><cell>232,513</cell><cell>7.5</cell><cell>29,293</cell><cell>3.2</cell></row><row><cell>tku</cell><cell>City of Turku (market)</cell><cell>232,225</cell><cell>7.5</cell><cell>75,121</cell><cell>8.1</cell></row><row><cell>esp</cell><cell>City of Espoo (market)</cell><cell>224,707</cell><cell>7.2</cell><cell>57,229</cell><cell>6.2</cell></row><row><cell>kpo</cell><cell>City of Kuopio (market)</cell><cell>155,304</cell><cell>5.0</cell><cell>27,903</cell><cell>3.0</cell></row><row><cell>jkl</cell><cell>City of Jyväskylä (market)</cell><cell>145,890</cell><cell>4.7</cell><cell>49,079</cell><cell>5.3</cell></row><row><cell>lti</cell><cell>City of Lahti (market)</cell><cell>110,358</cell><cell>3.6</cell><cell>35,860</cell><cell>3.9</cell></row><row><cell>bulk</cell><cell>Bulk transactions (market)</cell><cell>65,220</cell><cell>2.1</cell><cell>17,070</cell><cell>1.8</cell></row><row><cell>vsa</cell><cell>City of Vaasa (market)</cell><cell>60,137</cell><cell>1.9</cell><cell>23,795</cell><cell>2.6</cell></row><row><cell>t</cell><cell>Dates</cell><cell>33,688</cell><cell>1.1</cell><cell>13,694</cell><cell>1.5</cell></row><row><cell>roi</cell><cell>City of Rovaniemi (market)</cell><cell>27,850</cell><cell>0.9</cell><cell>11,460</cell><cell>1.2</cell></row><row><cell>seka</cell><cell>Miscellaneous (market)</cell><cell>22,208</cell><cell>0.7</cell><cell>9,566</cell><cell>1.0</cell></row><row><cell>hm</cell><cell>Narcotics markets</cell><cell>18,280</cell><cell>0.6</cell><cell>2,204</cell><cell>0.2</cell></row><row><cell>b</cell><cell>Random</cell><cell>9,068</cell><cell>0.3</cell><cell>1,249</cell><cell>0.1</cell></row><row><cell>h</cell><cell>Narcotics</cell><cell>7,942</cell><cell>0.3</cell><cell>1,753</cell><cell>0.2</cell></row><row><cell>y</cell><cell>Jobs</cell><cell>7,579</cell><cell>0.2</cell><cell>2,532</cell><cell>0.3</cell></row><row><cell>pm</cell><cell>Mail orders (market)</cell><cell>7,023</cell><cell>0.2</cell><cell>2,074</cell><cell>0.2</cell></row><row><cell>a</cell><cell>Everyday/mundane</cell><cell>6,772</cell><cell>0.2</cell><cell>1,351</cell><cell>0.1</cell></row><row><cell>hox</cell><cell>Hormones (market)</cell><cell>6,180</cell><cell>0.2</cell><cell>2,137</cell><cell>0.2</cell></row><row><cell>hax</cell><cell>Hacking</cell><cell>4,197</cell><cell>0.1</cell><cell>1,203</cell><cell>0.1</cell></row><row><cell>kkk</cell><cell>Cultivation</cell><cell>3,725</cell><cell>0.1</cell><cell>1,349</cell><cell>0.1</cell></row><row><cell cols="2">meta Meta discussion</cell><cell>3,646</cell><cell>0.1</cell><cell>827</cell><cell>0.1</cell></row><row><cell>test</cell><cell>Testing</cell><cell>2,927</cell><cell>0.1</cell><cell>2,240</cell><cell>0.2</cell></row><row><cell cols="2">spam Spamming</cell><cell>2,495</cell><cell>0.1</cell><cell>2,495</cell><cell>0.3</cell></row><row><cell>rotta</cell><cell>Vendor feedback</cell><cell>2,220</cell><cell>0.1</cell><cell>314</cell><cell>0.0</cell></row><row><cell>tt</cell><cell>Health</cell><cell>1,838</cell><cell>0.1</cell><cell>248</cell><cell>0.0</cell></row><row><cell>k</cell><cell>Getting and staying sober</cell><cell>1,562</cell><cell>0.1</cell><cell>247</cell><cell>0.0</cell></row><row><cell>fap</cell><cell>Porn</cell><cell>1,323</cell><cell>0.0</cell><cell>343</cell><cell>0.0</cell></row><row><cell>pgp</cell><cell>PGP public keys</cell><cell>89</cell><cell>0.0</cell><cell>2</cell><cell>0.0</cell></row><row><cell>total</cell><cell></cell><cell>3,104,976</cell><cell>100.0</cell><cell>925,085</cell><cell>100.0</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2</head><label>2</label><figDesc>Data fields comprising a single post. The column titled missing (%) indicates the portion of all posts where the field value is not available.</figDesc><table><row><cell></cell><cell>description</cell><cell>example</cell><cell>missing (%)</cell></row><row><cell cols="2">boardUri board identifier</cell><cell>roi</cell><cell>0.0</cell></row><row><cell>creation</cell><cell cols="2">post creation datetime (UTC) 2020-01-14T17:51:24.714Z</cell><cell>0.0</cell></row><row><cell>deletion</cell><cell cols="2">post deletion datetime (UTC) 2020-01-27T16:49:03.663Z</cell><cell>2.8</cell></row><row><cell cols="2">threadId thread identifier</cell><cell>27961</cell><cell>0.0</cell></row><row><cell>postId</cell><cell>post identifier</cell><cell>28069</cell><cell>29.8</cell></row><row><cell>name</cell><cell>poster name</cell><cell>example-name</cell><cell>54.1</cell></row><row><cell>subject</cell><cell>message subject</cell><cell>Example message subject</cell><cell>46.2</cell></row><row><cell cols="2">message message text body</cell><cell cols="2">Example message text body 0.0</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 3</head><label>3</label><figDesc>A modified example post containing personal data.</figDesc><table><row><cell>field</cell><cell>value</cell></row><row><cell cols="2">boardUri hki</cell></row><row><cell>creation</cell><cell>2019-01-01T12:12:12.123Z</cell></row><row><cell>deletion</cell><cell>2019-02-02T13:13:13.345Z</cell></row><row><cell cols="2">threadId 12345</cell></row><row><cell>postId</cell><cell>23456</cell></row><row><cell>name</cell><cell>[missing]</cell></row><row><cell>subject</cell><cell>kukkaa myynnissä (selling bud )</cell></row><row><cell cols="2">message kukkaa myynnissä idässä. kuljetus metron lähelle. Wicker // example-name</cell></row><row><cell></cell><cell>(selling bud in the east. delivery nearby metro. Wicker // example-name)</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>Table 4</head><label>4</label><figDesc>Matched regular expression frequencies. The columns titled matches and verified denote the number of found regular expression matches and the number of manually verified cases, respectively, The columns titled posts and threads denote the number of distinct posts and threads where the verified cases occurred.</figDesc><table><row><cell></cell><cell cols="3">matches verified posts</cell></row><row><cell>phone</cell><cell>875</cell><cell>858</cell><cell>699</cell></row><row><cell>hetu</cell><cell>91</cell><cell>73</cell><cell>65</cell></row><row><cell>email</cell><cell>1,840</cell><cell cols="2">1,837 1,707</cell></row><row><cell>iban</cell><cell>12</cell><cell>12</cell><cell>12</cell></row><row><cell>ip_address</cell><cell>121</cell><cell>16</cell><cell>14</cell></row><row><cell>total</cell><cell>2,939</cell><cell cols="2">2,796 2,261</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_4"><head>Table 5</head><label>5</label><figDesc>Number of posts removed from FINDarC prior to archiving due to personal information used for targeting.</figDesc><table><row><cell cols="2">board # posts</cell></row><row><cell>hm</cell><cell>113</cell></row><row><cell>b</cell><cell>110</cell></row><row><cell>esp</cell><cell>62</cell></row><row><cell>muut</cell><cell>41</cell></row><row><cell>hki</cell><cell>23</cell></row><row><cell>t</cell><cell>19</cell></row><row><cell>tre</cell><cell>14</cell></row><row><cell>a</cell><cell>10</cell></row><row><cell>rotta</cell><cell>10</cell></row><row><cell>tku</cell><cell>9</cell></row><row><cell>jkl</cell><cell>8</cell></row><row><cell>kpo</cell><cell>7</cell></row><row><cell>bulk</cell><cell>5</cell></row><row><cell>vnt</cell><cell>5</cell></row><row><cell>spam</cell><cell>4</cell></row><row><cell>seka</cell><cell>4</cell></row><row><cell>vsa</cell><cell>3</cell></row><row><cell>oulu</cell><cell>3</cell></row><row><cell>roi</cell><cell>2</cell></row><row><cell>hox</cell><cell>2</cell></row><row><cell>pm</cell><cell>1</cell></row><row><cell>meta</cell><cell>1</cell></row><row><cell>k</cell><cell>1</cell></row><row><cell>hax</cell><cell>1</cell></row><row><cell>h</cell><cell>1</cell></row><row><cell>fap</cell><cell>1</cell></row><row><cell>lti</cell><cell>1</cell></row><row><cell>total</cell><cell>461</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">Consortium website: https://research.tuni.fi/enncode/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1">Permanent link to the corpus: http://urn.fi/urn:nbn:fi:lb-2022062221</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2">In their paper, Haasio et al.<ref type="bibr" target="#b15">[15]</ref> refer to Torilauta using its other commonly used name Sipulitori.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_3">This is in agreement with the recommendation of the site administrator.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_4">Available at http://urn.fi/urn:nbn:fi:lb-2021042102</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_5">This is in agreement with the recommendation of the site administrator.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="9" xml:id="foot_6">The tokenization of messages was acquired using the Finnish Tagtools (ver. 1.5) toolkit.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="10" xml:id="foot_7">https://www.mongodb.com/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="11" xml:id="foot_8">Specification of numbers in the Finnish phone network is available at: https://www.finlex.fi/fi/viranomaiset/normi/480001/47180</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="12" xml:id="foot_9">The RFC 5322 specification is available at: https://datatracker.ietf.org/doc/html/rfc5322</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="13" xml:id="foot_10">//www.iso.org/standard/81090.html</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="14" xml:id="foot_11">We do not present the list here due to obvious privacy issues.</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgments</head></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">AnonyMate: A toolkit for anonymizing unstructured chat data</title>
		<author>
			<persName><forename type="first">A</forename><surname>Adams</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Aili</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Aioanei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Jonsson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Mickelsson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Mikmekova</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Roberts</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">F</forename><surname>Valencia</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Wechsler</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Workshop on NLP and Pseudonymisation</title>
				<meeting>the Workshop on NLP and Pseudonymisation</meeting>
		<imprint>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="1" to="7" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Classifying illegal activities on tor network based on web textual contents</title>
		<author>
			<persName><forename type="first">Al</forename><surname>Nabki</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">W</forename></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Fidalgo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Alegre</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><forename type="middle">De</forename><surname>Paz</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics</title>
		<title level="s">Long Papers</title>
		<meeting>the 15th Conference of the European Chapter of the Association for Computational Linguistics</meeting>
		<imprint>
			<date type="published" when="2017">2017</date>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="page" from="35" to="43" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Differential privacy: on the trade-off between utility and information leakage</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">S</forename><surname>Alvim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">E</forename><surname>Andrés</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Chatzikokolakis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Degano</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Palamidessi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Workshop on Formal Aspects in Security and Trust</title>
				<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2011">2011</date>
			<biblScope unit="page" from="39" to="54" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<title level="m" type="main">Inter-annotator Agreement. Handbook of Linguistic Annotation</title>
		<author>
			<persName><forename type="first">R</forename><surname>Artstein</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2017">2017</date>
			<publisher>Springer</publisher>
			<biblScope unit="page" from="297" to="313" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<monogr>
		<author>
			<persName><forename type="first">G</forename><surname>Branwen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Christin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Décary-Hétu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">M</forename><surname>Andersen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Stexo</surname></persName>
		</author>
		<author>
			<persName><surname>Presidente</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Anonymous</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">K</forename><surname>Lau</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Sohhlz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Cakic</surname></persName>
		</author>
		<author>
			<persName><surname>Buskirk</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Whom</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Mckenna</surname></persName>
		</author>
		<author>
			<persName><surname>Goode</surname></persName>
		</author>
		<ptr target="https://www.gwern.net/DNM-archives" />
		<title level="m">Dark net market archives</title>
				<imprint>
			<date type="published" when="2015-07">2015. July. 2022-06-28</date>
			<biblScope unit="page" from="2011" to="2015" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<monogr>
		<title level="m">Suomi24 virkkeet -korpus 2001-2020</title>
				<imprint>
			<publisher>City Digital Group</publisher>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
	<note>Korp-versio. tekstikorpus</note>
</biblStruct>

<biblStruct xml:id="b6">
	<monogr>
		<title/>
		<author>
			<persName><surname>Kielipankki</surname></persName>
		</author>
		<ptr target="http://urn.fi/urn:nbn:fi:lb-2021101525" />
		<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Challenges and Open Problems of Legal Document Anonymization</title>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">M</forename><surname>Csányi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Nagy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Vági</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">P</forename><surname>Vadász</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Orosz</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Symmetry</title>
		<imprint>
			<biblScope unit="volume">13</biblScope>
			<biblScope unit="issue">8</biblScope>
			<biblScope unit="page">1490</biblScope>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Towards personal data identification and anonymization using machine learning techniques</title>
		<author>
			<persName><forename type="first">F</forename><surname>Di Cerbo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Trabelsi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">European Conference on Advances in Databases and Information Systems</title>
				<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2018">2018</date>
			<biblScope unit="page" from="118" to="126" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<monogr>
		<author>
			<persName><forename type="first">M</forename><surname>Elliot</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Mackey</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>O'hara</surname></persName>
		</author>
		<title level="m">The anonymisation decision-making framework 2nd Edition: European practitioners&apos; guide</title>
				<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Functional anonymisation: Personal data and the data environment</title>
		<author>
			<persName><forename type="first">M</forename><surname>Elliot</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>O'hara</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Raab</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">M</forename><surname>O'keefe</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Mackey</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Dibben</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Gowans</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Purdam</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Mccullagh</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Computer Law &amp; Security Review</title>
		<imprint>
			<biblScope unit="volume">34</biblScope>
			<biblScope unit="issue">2</biblScope>
			<biblScope unit="page" from="204" to="221" />
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Anonymization for the GDPR in the Context of Citizen and Customer Relationship Management and NLP</title>
		<author>
			<persName><forename type="first">G</forename><surname>Francopoulo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">P</forename><surname>Schaub</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">workshop on Legal and Ethical Issues</title>
				<imprint>
			<publisher>ELRA</publisher>
			<date type="published" when="2020">2020. Legal2020</date>
			<biblScope unit="page" from="9" to="14" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">jan. Automatic Curation of Court Documents: Anonymizing Personal Data</title>
		<author>
			<persName><forename type="first">D</forename><surname>Garat</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Wonsever</surname></persName>
		</author>
		<idno type="DOI">10.3390/INFO13010027</idno>
		<ptr target="https://doi.org/10.3390/INFO13010027" />
	</analytic>
	<monogr>
		<title level="j">Information</title>
		<imprint>
			<biblScope unit="volume">13</biblScope>
			<biblScope unit="issue">27 13</biblScope>
			<biblScope unit="page">27</biblScope>
			<date type="published" when="2022">2022. 2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">jun. Anonymization of German legal court rulings</title>
		<author>
			<persName><forename type="first">I</forename><surname>Glaser</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Schamberger</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Matthes</surname></persName>
		</author>
		<idno type="DOI">10.1145/3462757.3466087</idno>
		<ptr target="https://doi.org/10.1145/3462757.3466087" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 18th International Conference on Artificial Intelligence and Law, ICAIL 2021</title>
				<meeting>the 18th International Conference on Artificial Intelligence and Law, ICAIL 2021</meeting>
		<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page" from="205" to="209" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Message understanding conference-6: A brief history</title>
		<author>
			<persName><forename type="first">R</forename><surname>Grishman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><forename type="middle">M</forename><surname>Sundheim</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">The 16th International Conference on Computational Linguistics</title>
				<imprint>
			<date type="published" when="1996">1996. 1996</date>
			<biblScope unit="volume">1</biblScope>
		</imprint>
	</monogr>
	<note>COLING</note>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">mar. Information needs of drug users on a local dark Web marketplace</title>
		<author>
			<persName><forename type="first">A</forename><surname>Haasio</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">T</forename><surname>Harviainen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Savolainen</surname></persName>
		</author>
		<idno type="DOI">10.1016/j.ipm.2019.102080</idno>
		<ptr target="https://doi.org/10.1016/j.ipm.2019.102080" />
	</analytic>
	<monogr>
		<title level="j">Information Processing and Management</title>
		<imprint>
			<biblScope unit="volume">57</biblScope>
			<biblScope unit="issue">2</biblScope>
			<biblScope unit="page">102080</biblScope>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">aug. Usernames on a Finnish Online Marketplace for Illegal Drugs</title>
		<author>
			<persName><forename type="first">L</forename><surname>Hämäläinen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Haasio</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">T</forename><surname>Harviainen</surname></persName>
		</author>
		<idno type="DOI">10.5195/NAMES.2021.2234</idno>
		<ptr target="https://doi.org/10.5195/NAMES.2021.2234" />
	</analytic>
	<monogr>
		<title level="j">Names A Journal of Onomastics</title>
		<imprint>
			<biblScope unit="volume">69</biblScope>
			<biblScope unit="issue">3</biblScope>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">Kukkaa, amfea, subua ja essoja: Huumausaineiden slanginimitykset Tor-verkon suomalaisella kauppapaikalla</title>
		<author>
			<persName><forename type="first">L</forename><surname>Hämäläinen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Ruokolainen</surname></persName>
		</author>
		<idno type="DOI">10.30673/sja.106615</idno>
		<ptr target="https://doi.org/10.30673/sja.106615" />
	</analytic>
	<monogr>
		<title level="j">Sananjalka</title>
		<imprint>
			<biblScope unit="volume">63</biblScope>
			<biblScope unit="page" from="130" to="153" />
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">jan. Drug traders on a local dark web marketplace</title>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">T</forename><surname>Harviainen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Haasio</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Hämäläinen</surname></persName>
		</author>
		<idno type="DOI">10.1145/3377290.3377293</idno>
		<ptr target="https://doi.org/10.1145/3377290.3377293" />
	</analytic>
	<monogr>
		<title level="m">ACM International Conference Proceeding Series</title>
				<imprint>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="20" to="26" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">Information protection in dark web drug markets research</title>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">T</forename><surname>Harviainen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Haasio</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Ruokolainen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Hassan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Siuda</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Hamari</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Hawaii International Conference on System Sciences</title>
				<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page">1</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<monogr>
		<title level="m" type="main">Shedding new light on the language of the dark web</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Jin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Jang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Shin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">W</forename><surname>Chung</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2204.06885</idno>
		<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
	<note>To appear in NAACL 2022</note>
</biblStruct>

<biblStruct xml:id="b21">
	<analytic>
		<title level="a" type="main">Huumeiden saatavuus, käyttö ja huumausainerikollisuus Tampereella koronakeväänä</title>
		<author>
			<persName><forename type="first">K</forename><surname>Karjalainen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Nyrhinen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Gunnar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Ylöstalo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Ståhl</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Yhteiskuntapolitiikka</title>
		<imprint>
			<biblScope unit="volume">86</biblScope>
			<biblScope unit="issue">2</biblScope>
			<biblScope unit="page" from="80" to="90" />
			<date type="published" when="2020">2021. 2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b22">
	<analytic>
		<title level="a" type="main">Syntactic n-gram collection from a large-scale corpus of internet finnish</title>
		<author>
			<persName><forename type="first">V</forename><surname>Laippala</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Ginter</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Sixth International Conference Baltic HLT</title>
				<meeting>the Sixth International Conference Baltic HLT</meeting>
		<imprint>
			<date type="published" when="2014">2014</date>
			<biblScope unit="volume">268</biblScope>
			<biblScope unit="page">184</biblScope>
		</imprint>
	</monogr>
	<note>Human Language Technologies-The Baltic Perspective</note>
</biblStruct>

<biblStruct xml:id="b23">
	<analytic>
		<title level="a" type="main">Creating a corpus of sensitive and hard-toaccess texts: Methodological challenges and ethical concerns in the building of the WiSP Corpus</title>
		<author>
			<persName><forename type="first">M</forename><surname>Leedham</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Lillis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Twiner</surname></persName>
		</author>
		<idno type="DOI">10.1016/j.acorp.2021.100011</idno>
		<ptr target="https://doi.org/https://doi.org/10.1016/j.acorp.2021.100011" />
	</analytic>
	<monogr>
		<title level="j">Applied Corpus Linguistics</title>
		<imprint>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="issue">3</biblScope>
			<biblScope unit="page">100011</biblScope>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b24">
	<analytic>
		<title level="a" type="main">On the tradeoff between privacy and utility in data publishing</title>
		<author>
			<persName><forename type="first">T</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Li</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining</title>
				<meeting>the 15th ACM SIGKDD international conference on Knowledge discovery and data mining</meeting>
		<imprint>
			<date type="published" when="2009">2009</date>
			<biblScope unit="page" from="517" to="526" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b25">
	<analytic>
		<title level="a" type="main">A Broad-coverage Corpus for Finnish Named Entity Recognition</title>
		<author>
			<persName><forename type="first">J</forename><surname>Luoma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Oinonen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Pyykönen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Laippala</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Pyysalo</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 12th Language Resources and Evaluation Conference</title>
				<meeting>the 12th Language Resources and Evaluation Conference<address><addrLine>Marseille, France</addrLine></address></meeting>
		<imprint>
			<publisher>European Language Resources Association</publisher>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="4615" to="4624" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b26">
	<analytic>
		<title level="a" type="main">Broken promises of privacy: Responding to the surprising failure of anonymization</title>
		<author>
			<persName><forename type="first">P</forename><surname>Ohm</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">UCLA Law Review</title>
		<imprint>
			<biblScope unit="volume">57</biblScope>
			<biblScope unit="page">1701</biblScope>
			<date type="published" when="2009">2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b27">
	<analytic>
		<title level="a" type="main">AnoPpi: A pseudonymization service for Finnish court documents</title>
		<author>
			<persName><forename type="first">A</forename><surname>Oksanen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Tamper</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Tuominen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Hietanen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Hyvönen</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">JURIX 2019</title>
				<imprint>
			<publisher>IOS Press</publisher>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="251" to="254" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b28">
	<analytic>
		<title level="a" type="main">Anonymization and risk</title>
		<author>
			<persName><forename type="first">I</forename><forename type="middle">S</forename><surname>Rubinstein</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Hartzog</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Wash. L. Rev</title>
		<imprint>
			<biblScope unit="volume">91</biblScope>
			<biblScope unit="page">703</biblScope>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b29">
	<analytic>
		<title level="a" type="main">aug. A Finnish news corpus for named entity recognition</title>
		<author>
			<persName><forename type="first">T</forename><surname>Ruokolainen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Kauppinen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Silfverberg</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Lindén</surname></persName>
		</author>
		<idno type="DOI">10.1007/S10579-019-09471-7</idno>
		<idno type="arXiv">arXiv:1908.04212</idno>
		<ptr target="https://doi.org/10.1007/S10579-019-09471-7" />
	</analytic>
	<monogr>
		<title level="j">Language Resources and Evaluation</title>
		<imprint>
			<biblScope unit="volume">54</biblScope>
			<biblScope unit="issue">1 54</biblScope>
			<biblScope unit="page" from="247" to="272" />
			<date type="published" when="2019">2019. 2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b30">
	<analytic>
		<title level="a" type="main">Introduction to the conll-2003 shared task: Languageindependent named entity recognition</title>
		<author>
			<persName><forename type="first">E</forename><forename type="middle">T K</forename><surname>Sang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><forename type="middle">De</forename><surname>Meulder</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003</title>
				<meeting>the Seventh Conference on Natural Language Learning at HLT-NAACL 2003</meeting>
		<imprint>
			<date type="published" when="2003">2003</date>
			<biblScope unit="page" from="142" to="147" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b31">
	<analytic>
		<title level="a" type="main">Anonymous data v. personal data-false debate: an EU perspective on anonymization, pseudonymization and personal data</title>
		<author>
			<persName><forename type="first">S</forename><surname>Stalla-Bourdillon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Knight</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Wis. Int&apos;l LJ</title>
		<imprint>
			<biblScope unit="volume">34</biblScope>
			<biblScope unit="page">284</biblScope>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b32">
	<analytic>
		<title level="a" type="main">Anonymization Service for Finnish Case Law: Opening Data without Sacrificing Data Protection and Privacy of Citizens</title>
		<author>
			<persName><forename type="first">M</forename><surname>Tamper</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Oksanen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Tuominen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Hyvönen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Hietanen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Others</forename></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Conference on Law via the Internet</title>
				<imprint>
			<publisher>LVI</publisher>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b33">
	<monogr>
		<author>
			<persName><forename type="first">A</forename><surname>Virtanen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Kanerva</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Ilo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Luoma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Luotolahti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Salakoski</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Ginter</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Pyysalo</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1912.07076</idno>
		<title level="m">Multilingual is not enough: Bert for Finnish</title>
				<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b34">
	<monogr>
		<author>
			<persName><forename type="first">R</forename><surname>Weischedel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Palmer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Marcus</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Hovy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Pradhan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Ramshaw</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Xue</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Taylor</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Kaufman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Franchini</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Others</forename></persName>
		</author>
		<title level="m">Ontonotes release 5.0 ldc2013t19</title>
				<meeting><address><addrLine>Philadelphia, PA</addrLine></address></meeting>
		<imprint>
			<date type="published" when="1923">2013. 23</date>
		</imprint>
	</monogr>
	<note>Linguistic Data Consortium</note>
</biblStruct>

<biblStruct xml:id="b35">
	<monogr>
		<title level="m" type="main">Ylilauta-korpuksen ladattava versio</title>
		<author>
			<persName><surname>Ylilauta</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
	<note>tekstikorpus</note>
</biblStruct>

<biblStruct xml:id="b36">
	<monogr>
		<title/>
		<author>
			<persName><surname>Kielipankki</surname></persName>
		</author>
		<ptr target="http://urn.fi/urn:nbn:fi:lb-2016101210" />
		<imprint/>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
