An Anonymization Tool for Open Data Publication of Legal Documents Arttu Oksanen1 , Eero Hyvönen1,2 , Minna Tamper1,2 , Jouni Tuominen1,2,3 , Henna Ylimaa4 , Katja Löytynoja4 , Matti Kokkonen4 and Aki Hietanen5 1 Aalto University, Dept. of Computer Science 2 University of Helsinki, Helsinki Centre for Digital Humanities (HELDIG) 3 University of Helsinki, Helsinki Institute for Social Sciences and Humanities (HSSH) 4 Statistics Finland 5 Ministry of Justice in Finland Abstract The EU General Data Protection Regulation (GDPR) requires anonymization of documents containing personal data, such as court decisions, for public use. Doing this manually is costly and time-consuming but can be automated by applying Natural Language Processing (NLP) methods. This paper introduces the ANOPPI tool developed for (semi-)automatic anonymization of Finnish texts. The tool can be used both as a web application and programmatically through a REST API. Evaluation shows that ANOPPI performs well with different types of documents, however, further improving the performance of the named entity recognition and disambiguation methods would enhance the usefulness of the software. The tool is being published as open source for public use by the Ministry of Justice in Finland. A use case of ANOPPI is to publish court decisions on the Web in the LawSampo semantic portal for human close reading and as Linked Open Data for data analysis in legal informatics. Keywords anonymization, pseudonymization, case law, named entity recognition 1. Introduction Many texts and legal documents of interest to a wider audience, such as interviews of people and court decisions, contain sensitive personal data. This makes it difficult to publish and use them given the EU General Data Protection Regulation (GDPR)1 , unless personal data contained is disguised, which is costly and time-consuming to do manually. This paper presents the software architecture and first evaluation results of the ANOPPI tool and web service2 for automatic and semi-automatic anonymization of Finnish documents, extending our earlier short paper [1] about the tool. Utilizing both machine learning (ML) and Joint Proceedings of ISWC2022 Workshops: the International Workshop on Artificial Intelligence Technologies for Legal Documents (AI4LEGAL) and the International Workshop on Knowledge Graph Summarization (KGSum) (2022) Envelope-Open arttu.oksanen@aalto.fi (A. Oksanen); eero.hyvonen@aalto.fi (E. Hyvönen); minna.tamper@aalto.fi (M. Tamper); jouni.tuominen@helsinki.fi (J. Tuominen); katja.loytynoja@stat.fi (K. Löytynoja); matti.kokkonen@stat.fi (M. Kokkonen); aki.hietanen@om.fi (A. Hietanen) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) https://europa.eu/youreurope/business/dealing-with-customers/data-protection/ 1 data-protection-gdpr/ 2 Project homepage: https://seco.cs.aalto.fi/projects/anoppi/ 12 rule-based named entity recognition and linking methods (NER/NEL) [2] and morphological analysis, ANOPPI is able to automatically or semi-automatically anonymize documents written in Finnish while preserving their readability and layout. ANOPPI is the first anonymization tool developed for Finnish. The tool was developed for automatic anonymization of court decisions [1] in the Anoppi project funded by the Ministry of Justice in Finland but can and has been used in other contexts, too. The source code will be published with an open license after the ongoing deployment process of the service is finished. This paper first introduces the ANOPPI tool and the underlying method for automatic anonymization from an end-user point of view (Section 2). We then present the underlying technical ideas of the tool starting with a description of the software architecture, language analysis component, and the workflow in Section 3. After this, first results of evaluating the performance of ANOPPI in fully automatic anonymization are presented. In conclusion, related works and future directions of development are discussed . 2. ANOPPI Client User Interface Figure 1: ANOPPI client user interface The web-based user interface (WUI) of ANOPPI is depicted in Fig. 1. It is an HTML5/JavaScript application implemented with React user interface library. NPM (Node package manager) modules redux and redux-observable are being used for handling the application state and asynchronous logic. The general idea of the tool is to find automatically named entities in the document and mark them in different colors and using different symbols based on their type (person, place, organization, etc.). A numeric ID is used to identify mark and disambiguate entities; multiple 13 occurrences of the same entity have the same ID. The marked document is shown on the left in Fig. 1. After finding the named entities and their occurrences in the text, the occurrences are replaced with referent identifiers. To assign a reasonable identifier for a given named entity its category must also be resolved. For example, we must be able to differentiate towns and corporations so that an identifier can be correctly determined as either “town A” or “corporation A”. Categorical disambiguation is based on a scoring scheme that weighs the results obtained from the different named entity recognizers. On the right hand side of Fig. 1, each entity found is listed in a table that can be used for editing the entities and for specifying how the entity label is replaced in the anonymized document. For example, wrongly identified entities can be deleted and missing new ones be inserted, their type can be changed, the replacement string can be edited, and so on. By clicking on a “preview” button the final result can be seen and finally saved with its original formatting. The court orders to be anonymized are available in electronic format either as plain text, XML, HTML, or DOCX files. Based on Natural Language Processing (NLP) tooling discussed in [3], we have developed a tool that is able to find the named entities from these documents and annotate the occurrences of the named entities with special tags. 3. ANOPPI Architecture ANOPPI is a web-based tool for automatic or semi-automatic anonymization. It comes in two versions: 1. Web application version that has a user interface and the anonymization is done with manual checks and editing performed by a human user. The interface was described in the previous section. 2. Standalone REST API version that can be used programmatically for automatic or semi-automatic anonymization. 3.1. Software Architecture The Web application version depicted in Fig. 2 consists of a WUI (ANOPPI Client on the left), a back-end ANOPPI Server. The ANOPPI Server consists of language analysis components (LAC) that perform named entity recognition (NER) and anonymization of the text. The NER process is carried out by the Nelli Tagger Service [3, 4] that performs the morphological analysis, NER, and disambiguation on the text using different types of NER tools that apply rule-based methods, machine learning based methods, and vocabulary based methods. The results are then displayed in the ANOPPI client where the user can modify the result of the annotation. After modifications the document is anonymized by ANOPPI Server. All of the individual components are packaged as Docker containers. The whole system can run on a single machine and a single closed network. The Web application version of the ANOPPI Server is implemented using Scala and Play web framework3 . The back-end uses the Nelli Tagger Service to find the mentions of named 3 https://www.playframework.com 14 Figure 2: Components of the ANOPPI Web application entities in text when a new document is uploaded to ANOPPI. In addition, LAS (SeCo Lexical Analysis Services) [5] and Uralic NLP [6] are used to morphologically analyze and lemmatize a piece of text when eg. creating a new named entity or generating the correctly inflected form of a referent identifier. The back-end stores the uploaded documents in a relational database (PostgreSQL4 ). The service generates UUID identifiers for the uploaded documents. These identifiers are stored in the database as well as the local memory of the user’s web browser. Currently the uploaded documents are connected to the user’s session and there is no separate user authentication mechanism. After the browser cache is emptied the uploaded documents can no longer be accessed. The Standalone REST API version is similar to Fig. 2 but does not include a user interface client nor stores any user data in a database. The API can be used programmatically for either automatic or semi-automatic anonymization. It takes as input the document and produces as output the annotated or anonymized document with a separate list of all the named entities found in the document. A public test instance of this API5 runs on the CSC Rahti container cloud. API documentation, based on Swagger, is available on the Web6 . The public API includes, for example, the following endpoints consumed by the ANOPPI Client: POST /project Create a new project (document to be anonymized). The endpoint is consumed when uploading a new document to the web application. This function sends the document content to the tagger that finds mentions of named entities in the text. This tagged text is then returned back to the client. POST /project/preview This endpoint is consumed by the client when previewing the 4 https://www.postgresql.org/ 5 The API can be found at https://nlp.ldf.fi/anoppi. 6 Documentation: https://app.swaggerhub.com/apis-docs/apoksane/open-api_nlp_ldf_fi/1.0.0. 15 anonymized version of the document. It replaces the selected named entities in the text with their referent identifiers. POST /project/export This endpoint is consumed by the client when exporting the finished anonymized version of the document. This endpoint prepares the final anonymized version of the document and sends it back to the client. POST /project/tag-entity-occurrences Find mentions of a specific entity in text. This endpoint is used when user manually selects a new named entity from the text to be anonymized. GET /text/analyze Analyze text. This endpoint is invoked after selecting a new name (a surface form) from the document text. The function resolves the lemma and case of the surface form. The required morphological analysis is done by utilizing LAS. 3.2. Language Analysis Components In order to produce the automatically anonymized document, the ANOPPI server calls first for the Nelli Tagger Service to perform the NER process to find the named entities from the text. After obtaining the NER results, the anonymized version of the document can be produced using the anonymization component. The Nelli Tagger and anonymization process are presented in more detail in the following subsections. 3.2.1. Nelli Tagger Service The identification of named entities in Finnish language documents can be a challenging task. The documents can contain not only regular names but also unusual names and registry numbers that are often not picked up by rule or machine learning based NER tools. For example, for both types of NER tools it can be difficult to extract foreign person names when processing documents that are, for example, related to immigration or refugees. Similarly, typically the ready-made tools do not pick registry numbers such as car plates or insurance numbers. To find the named entities the Nelli Tagger Service [4, 3] is used. It uses an ensemble method that combines results obtained from multiple different named entity recognizers. First of all we use ready-made statistics- and rule-based named entity recognition (NER) software, such as FinBERT’s NER model7 [7], a rule-based named entity recognizer for Finnish language, and Stanford NER [8]. Secondly, we have developed our own set of regular expression patterns to recognize things such as vehicle registration plates and property identifiers [9]. In addition, we use an all-inclusive Finnish person name ontology [9] that is based on the open data published by the Population Register Centre8 to look up person names appearing in the court cases. Finally, we use the Finnish Turku Neural parser [10, 11] to support deciding if a term appearing in the text is a name. The FinBERT NER model and Finnish Turku Neural parser were selected for their performance. The other NER tools complement the results of the FinBERT NER model’s results by adding entities such as rare names that cannot be identified with the pretrained models. After performing NER, the results are disambiguated by scoring results of each NER method’s interpretation of the text, where the most popular interpretation wins. The results are then converted to JSON format that consists of a list of entities and their features, the original text 7 https://turkunlp.org/fin-ner.html 8 https://vrk.fi/en/ 16 annotated with named entities using HTML span tags, as well as metadata about the run of Nelli Tagger Service (e.g., timestamp, success or error codes). The Nelli Tagger Service’s API is available as a REST API9 . This service has been previously evaluated to perform with 86 % accuracy [3], however, since the evaluation, tools have been updated as the FinBERT NER model has been added that can perform NER with 93 % accuracy [4, 7]. 3.2.2. Anonymization The anonymization is done by replacing the recognized names with grouped sequential identi- fiers, such as ‘A’ and ’B’ for persons or ’place A’ and ‘place B’ for places, keeping track of the entities throughout the text. Using capital letters as referent identifiers was chosen to mimic the way anonymization is currently carried out manually. As Finnish is a highly inflected language the correct inflected form for the referent identifier has to be derived to enhance readability of the anonymized version of the text. For this reason, morphological analysis is being used to distinguish, for example, the case and possessive suffix of nouns. Turku Neural Parser [11] is used to perform morphological analysis of the original text and UralicNLP10 [12] to correctly inflect the referent identifiers to match the original text. Inflection is done using a finite state transducer (FST) model. 3.3. Workflow The flow diagram in Fig. 3 illustrates how the data is processed in the ANOPPI system during different steps of the anonymization process. As presented in Fig. 3 the anonymization process starts from the left when the user uploads a new document to ANOPPI. Firstly, the document is loaded to the ANOPPI Server and the original document is stored in the database. Then the text is extracted from the document and sent to the Nelli Tagger Service. The service performs named entity recognition and named entity disambiguation (NED) on a document. Here, the Nelli Tagger Service aims to maximize recall in the named entity recognition phase because it is easier for the user of WUI to delete suggested entities than to manually pick new entities from the text, if LAC did not automatically recognize them. It is also possible to restrict anonymization to certain types of entities (for example only person names). Further fine-tuning can be done by whitelisting certain names and contexts, for example to keep names of the judges overt in court judgments. Once the Tagger Service returns the entities to the server the annotated result is recorded into a database for later usage and sent to the client in HTML format. The user checks the results in the WUI and can correct and preview the result. Once the client is satisfied with the result, the client can export the anonymized document. As the original DOCX file was stored initially the anonymized document can be created by fetching the original file from the database and switching its content with the anonymized one, preserving the original layout of the file. The database holds the records of the document and in case the same document is requested again, the document can be dug up from the database instead of reprocessing it again with the 9 https://app.swaggerhub.com/apis-docs/SeCo/nlp.ldf.fi/1.0.0 10 https://uralicnlp.com/ 17 Figure 3: ANOPPI Workflow overview LAC tagger component. However, old documents are scheduled to be deleted from the database after a configurable period of time. 4. Evaluation To test the performance of the LACs in anonymization of persons we created a tailor-made test dataset that consists of text in Finnish with person names added in grammatically appropriate places. The names were handpicked from the Population Information System data with emphasis on selecting both common and rare names. Attention was paid to select names including both traditional Finnish names and foreign names as well as names with two parts and names with a common meaning, for example Karhu (bear). Investigation reports from Safety Investigation Authority of Finland (SIAF) were used as base text for the test data. These reports do not originally contain any names but only references to people in the form of pronouns and job titles that were replaced randomly by the selected names and their combinations. Eventually the test data contained 152 added names and name combinations from which 18 ANOPPI identified and anonymized 136 (89,5 %) correctly. In total, ANOPPI identified 141 names or name combinations from the test data, but five of them were false positive words. A total of 16 names remained unidentified. However, most of the unidentified names were located in parts of the text where names are not common which would not be an issue with real data. In addition to evaluating the performance of the LAC, the ANOPPI tool as a whole has been evaluated by measuring and comparing the time it takes to anonymize a court decision both semi-automatically using the WUI and manually using only word processor software. The results obtained in this manner so far show that on average it takes about half the time to anonymize a court order using ANOPPI as compared to manual anonymization. ANOPPI makes some mistakes especially by confusing person names with place names and vice versa and spotting and correcting all of these incorrect categories using the WUI slows down the process. Moreover, in order to verify the correctness of the anonymization result a human expert still has to get acquainted with the content of the document regardless of the anonymization being automatic. 5. Related Works and Discussion Automatic or semi-automatic anonymization and pseudonymization11 methods are already in use in several European judicial systems [13]. For example, in Denmark an anonymization tool for court orders was implemented using solely manually crafted grammar rules to find the named entities in the texts [14]. Recent projects similar to ours focusing on automatic anonymization of court orders using ML methods possibly in combination with rule-based ones have been conducted for example in Poland, Austria, Germany, Latvia and France12 . Regarding the NER/NEL services embedded in ANOPPI there are lots of related research surveyed in [2]; our approach and system with related works were discussed in Section 3.2. Evaluation of ANOPPI shows promising results in locating the names of persons, organi- zations, places, and different types of identifiers of specific form. Still, it is difficult to build a general solution for anonymization as the sufficiency of de-identification varies in each case. The category-based selection of named entities used in the current model is not sufficient if for example names of small companies should be anonymized but large ones should not. Another issue in the ANOPPI project is the lack of task-specific training data as we are not able to store and make use of real production data in order to continuously train ML models due to restrictions imposed by the GDPR. That is why we ended up using a general NER model for Finnish language along with configurable case-based rules. The ANOPPI service is currently in pilot testing in the Ministry of Justice of Finland for anonymization of Finnish court decisions in order to make them available on the Web and for data analysis in the forth-coming public LawSampo data service and portal for publishing and studying Finnish legislation and case law. Future work focuses on further improving the performance of the named entity recognition algorithm and including identification of new 11 See https://tietosuoja.fi/en/pseudonymised-and-anonymised-data for the difference between the notions of anonymization and pseudonymization. 12 Based on webinar presentations at: https://ec.europa.eu/info/policies/justice-and-fundamental-rights/ digitalisation-justice/conferences-and-events_en#webinarsontheuseofartificialintelligenceinthejusticefield 19 entity types such as rare diseases or unique job titles that make re-identification of people straightforward. Acknowledgments This work is part of Finnish AI special funding program by the Ministry of Finance, for experiments that promote productivity. CSC – IT Center for Science, Finland provided computational resources. References [1] A. Oksanen, M. Tamper, J. Tuominen, A. Hietanen, E. Hyvönen, Anoppi: A pseudonymiza- tion service for Finnish court documents, in: Legal Knowledge and Information Systems. JURIX 2019: The Thirty-second Annual Conference (Araszkiewicz, M. and Rodríguez- Doncel, V. (eds.)), IOS Press, 2019, pp. 251–254. [2] J. L. Martinez-Rodriguez, A. Hogan, I. Lopez-Arevalo, Information extraction meets the semantic web: A survey, Semantic Web – Interoperability, Usability, Applicability 11 (2020) 255–335. doi:10.3233/SW- 180333 . [3] M. Tamper, E. Hyvönen, P. Leskinen, Visualizing and analyzing networks of named entities in biographical dictionaries for digital humanities research, in: Proceedings of the 20th International Conference on Computational Linguistics and Intelligent Text Processing (CICling 2019), Springer-Verlag, 2019. Forthcoming. [4] M. Tamper, A. Oksanen, J. Tuominen, A. Hietanen, E. Hyvönen, Automatic annotation service appi: Named entity linking in legal domain, in: A. Harth, V. Presutti, R. Troncy, M. Acosta, A. Polleres, J. D. Fernández, J. Xavier Parreira, O. Hartig, K. Hose, M. Cochez (Eds.), The Semantic Web: ESWC 2020 Satellite Events, volume 12124 of Lecture Notes in Computer Science, Springer-Verlag, 2020, pp. 208–213. URL: https://doi.org/10.1007/ 978-3-030-62327-2_36. doi:10.1007/978- 3- 030- 62327- 2_36 . [5] E. Mäkelä, LAS: an integrated language analysis tool for multiple languages, The Journal of Open Source Software 1 (2016). [6] M. Hämäläinen, UralicNLP: An NLP library for Uralic languages, Journal of Open Source Software 4 (2019) 1345. doi:10.21105/joss.01345 . [7] J. Luoma, M. Oinonen, M. Pyykönen, V. Laippala, S. Pyysalo, A broad-coverage corpus for Finnish named entity recognition, in: Proceedings of the 12th Language Resources and Evaluation Conference, European Language Resources Association, Marseille, France, 2020, pp. 4615–4624. URL: https://aclanthology.org/2020.lrec-1.567. [8] J. Rose Finkel, T. Grenager, C. Manning, Incorporating non-local information into in- formation extraction systems by Gibbs sampling, in: ACL-05 - 43rd Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, 2005, pp. 363–370. [9] M. Tamper, P. Leskinen, J. Tuominen, E. Hyvönen, Modeling and publishing finnish person names as a linked open data ontology, in: 3rd Workshop on Humanities in the Semantic Web (WHiSe 2020), CEUR Workshop Proceedings, vol. 2695, 2020, pp. 3–14. URL: http://ceur-ws.org/Vol-2695/paper1.pdf. [10] K. Haverinen, J. Nyblom, T. Viljanen, V. Laippala, S. Kohonen, A. Missilä, S. Ojala, T. Salakoski, F. Ginter, Building the essential resources for Finnish: the Turku Depen- 20 dency Treebank, Language Resources and Evaluation 48 (2014) 493–531. doi:10.1007/ s10579- 013- 9244- 1 , open access. [11] J. Kanerva, F. Ginter, N. Miekka, A. Leino, T. Salakoski, Turku neural parser pipeline: An end-to-end system for the conll 2018 shared task, in: Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, Association for Computational Linguistics, 2018. [12] M. Hämäläinen, UralicNLP: An NLP library for uralic languages, Journal of Open Source Software 4 (2019) 1345. doi:10.21105/joss.01345 . [13] M. van Opijnen, G. Peruginelli, E. Kefali, M. Palmirani, On-Line Publication of Court Decisions in the EU: Report of the Policy Group of the Project ’Building on the European Case Law Identifier’, 2017. Available at SSRN: https://ssrn.com/abstract=3088495. [14] C. Povlsen, B. Jongejan, D. H. Hansen, B. K. Simonsen, Anonymization of court orders, in: 11th Iberian Conference on Information Systems and Technologies (CISTI), IEEE, Las Palmas, Spain, 2016. doi:10.1109/CISTI.2016.7521611 . 21