<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>D. Dosso); alberto.testolin@unipd.it (A. Testolin)</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Exploiting Large Language Models to Train Automatic Detectors of Sensitive Data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Simone De Renzis</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dennis Dosso</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alberto Testolin</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Siav S.p.A.</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Italy</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of General Psychology, University of Padova</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Mathematics, University of Padova</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0001</lpage>
      <abstract>
        <p>This paper describes a machine learning system designed to identify sensitive data within Italian text documents, aligning with the definitions and regulations outlined in the General Data Protection Regulation (GDPR). To overcome the lack of suitable training datasets, which would require the disclosure of sensitive data from real users, the proposed system exploits a Large Language Model (LLM) to generate synthetic documents that can be used to train supervised classifiers to detect the target sensitive data. We show that “artificial” sensitive data can be generated using both proprietary or open source LLMs, demonstrating that the proposed approach can be implemented either using external services or by relying on locally runnable models. We focus on the detection of six key domains of sensitive data, by training supervised classifiers based on the BERT Transformer architecture adapted to carry out text classification and Named-Entity Recognition (NER) tasks. We evaluate the performance of the system using fine-grained metrics, and show that the NER model can achieve a remarkable detection performance (over 90% F1 score), thus confirming the quality of the synthetic datasets generated with both proprietary and open source LLMs. The dataset we generated using the open source model is made publicly available for download.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Generative Artificial Intelligence</kwd>
        <kwd>Sensitive data detection</kwd>
        <kwd>NER</kwd>
        <kwd>BERT</kwd>
        <kwd>LLM</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        In today’s digital era safeguarding personal data has become a priority, especially with the
advent of the GDPR [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. For digital archives, it’s essential to identify documents containing
sensitive data, ensuring compliance and efective information management. The GDPR details
two main categories of personal data: the first one includes information that can directly lead to
the identification of an individual, while the second one includes a broader range of expressions
that disclose sensitive aspects of a person’s life. This second category is the focus of the present
work and will be referred to as sensitive data. In particular, we deal with six key categories
of sensitive data: (i) Health: Physical and mental well-being of individuals, with details
regarding existing diagnoses, medical conditions, and disabilities; (ii) Political: Individual’s
political beliefs, their political orientation, specific party afiliation, as well as membership in
work unions or similar organizations; (iii) Sexuality: Individual’s sexual orientation, habits,
and gender identity; (iv) Judicial: Legal matters, such as ofenses, crimes, charges, pending
criminal proceedings, accusations, and trial proceedings involving an individual; (v) Philosophy:
Individual’s philosophical and religious beliefs and afiliations; (vi) Ethnic: Individual’s ethnic
origin and heritage.
      </p>
      <p>
        The present article describes an original approach to implement a system based on machine
learning classifiers to automatically detect sensitive data in text documents. The proposed
method relies on Large Language Models (LLMs) to generate synthetic documents with
“artificial” sensitive data, which can then be used to train Transformer-based text classifiers [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Our
empirical investigations show that a neural model based on the Bidirectional Encoder
Representations from Transformers (BERT) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] architecture adapted for Named Entity Recognition (NER)
achieves the best detection performance, both when trained using data generated by proprietary
LLMs like GPT-4 [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], but also when the synthetic data is generated using open source LLMs
such as OpenLLaMa [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. The dataset generated using the open source LLM is made publicly
available for download to promote further research on this domain.
      </p>
      <p>The paper is structured as follows: Section 2 presents the current state of research on sensitive
data detection, Section 3 details the process of automated generation and labeling of synthetic
corpora, and our method based on BERT. Section 4 reports the experimental results and Section
5 discusses some limitations of our method and possible directions for future research.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        While the problem of detecting Personally Identifiable Information (PII) has been extensively
studied in both academic and industrial settings [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], the task of identifying sensitive data has
been much less explored [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <sec id="sec-2-1">
        <title>2.1. Training corpora with sensitive data</title>
        <p>
          The nature of this topic makes it dificult to find real-world documents containing sensitive
data, since organizations are generally unwilling to grant access to private documents due to
concerns regarding proper data handling protocols [
          <xref ref-type="bibr" rid="ref10 ref8 ref9">8, 9, 10</xref>
          ]. This is especially true in the Italian
scenario, which is the specific focus of our inquiry, where research on sensitive data detection
is primarily based on manually curated datasets that are not released for public use [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ].
        </p>
        <p>
          Some publicly available datasets involve classifying emails from the Enron corpus [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ],
detecting privacy leaks in Tweets [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] and health-related information [14]. One approach,
employed by Petrolini et al. [15], involves extracting conversations from specific subsections
of the Reddit forum that deal with sensitive topics. Although collecting datasets from scraped
tweets or Reddit messages is a cost-efective way to obtain sensitive data, their lack of diversity
may hinder their efectiveness in training models for various types of documents. Gambarelli et
al. [16] manually curated two datasets containing various categories of sensitive data. Such
corpora are undoubtedly of higher quality, but are also more expensive to build due to the need
for manual labeling, often requiring the involvement of domain experts.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Machine learning models</title>
        <p>A variety of machine learning models and deep learning architectures have been employed to
perform Natural Language Processing (NLP) tasks, such as text classification and NER. Various
architectures are involved in the domain of PII and sensitive data detection, from Convolutional
Neural Networks (CNN) [17] to Transformer-based models like BERT [14]. In Karl and Scherp
[18] a comparative investigation is carried out to evaluate the performance of various methods
in the domain of short text classification, highlighting Transformer-based models as the best
performing in terms of accuracy and speed. The BERT model is used also by Petrolini et al.
[15] and Gambarelli et al. [16]. The first work proposes a method that relies on identifying a
“sensitive topic” and a PII that can be linked to it. However, personal data is often mentioned
separately from the related sensitive topic or may not be actually related to it. Our approach
aims to make detection more robust by feeding the entire document to the classifier: this enables
the model to consider the complete context and develop an understanding of the relationship
between the person and its sensitive data disclosure. The second work instead introduces a
multi-step inference pipeline in which a first prediction is done to distinguish between sensitive
and non sensitive sentences, and then a finer inference is done to classify the category of the
sensitive sentence. Our approach uses a single BERT model for prediction that discriminates
between the six sensitive categories and a non sensitive one, thus speeding up the process of
inference and decreasing the memory load.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Methods</title>
      <p>Our proposal involves leveraging the generation capabilities of recent LLM architectures to
generate documents and perform automatic labeling, reducing the data acquisition costs.</p>
      <sec id="sec-3-1">
        <title>3.1. Document generation and data labeling</title>
        <p>
          The procedure we propose for creating synthetic training data involves two distinct phases:
document generation, which consists in the creation of documents of specific types, and span
labeling, which requires to explicitly detect and categorize the sensitive data spans within the
generated documents. We use the term span to denote a segment of text, varying in size, that is
of particular interest—specifically, one that reveals sensitive information. In our experiments,
we used two families of LLMs: BingAI, a chat interface integrated into Microsoft browser
which is powered by GPT-4 [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]; and OpenLLaMa [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], an open and commercially permissive
reimplementation of LLaMa1.
        </p>
        <p>For document generation, we defined a list of document types (e.g., clinical records, medical
prescriptions, criminal records, etc.) that might contain sensitive data belonging to one of
the six categories mentioned in Section 1. An automated script was devised to prompt the
LLMs to generate such documents containing sensitive data. The template of the prompt
looks like this: "Puoi generare un documento di finzione ma realistico riguardante NAME del tipo
“DOCUMENT_TITLE”, che includa informazioni riguardo SENSITIVE_ INFO di NAME?". This
procedure was similar between BingAI and OpenLLaMA, but for the latter model a custom
1https://github.com/facebookresearch/llama/blob/main/LICENSE (Last visited: June 2023)
system prompt was instantiated, asking it to act as a document generator: this trick belongs to
a set of techniques that consists on carefully crafting prompts that have been shown to improve
the quality of text generation [19].</p>
        <p>The span labeling phase has been approached in two distinct ways. For the BingAI model, a
prompt was built based on the type of sensitive data the document is supposed to contain: the
prompt asks to generate the document provided as input, but with the sensitive information
spans “censored” or concealed with a specific tag. To guide the model in detecting specific
types of sensitive information, the prompt is automatically constructed based on the known
sensitive category data associated with the given document. This approach has been found to
be more efective than simply asking to return the sensitive spans themselves. Similar to the
prompt used for document generation, the labeling prompt also follows a structured format with
specific variable words that are filled based on the document type and the associated sensitive
information: "Puoi censurare tutte e sole le porzioni di frasi che contengono informazioni o possono
ricondursi a SENSITIVE_ INFO di NAME? Fornisci il documento con sole frasi che non hanno niente
a che fare con SENSITIVE_ INFO di NAME. Leggendo il documento non devo essere in grado di
ricostruire alcun’informazione relativa a SENSITIVE_ INFO di NAME. Usa l’etichetta [LABEL] per
sostituire le porzioni di frase che contengono informazioni relative a SENSITIVE_ INFO di NAME."</p>
        <p>The OpenLLaMa model, being a much smaller (13 billions parameters) and less capable model,
required a few shot learning approach to get the best results. A predefined set of sentences, each
with corresponding labels, is incorporated into the prompt tailored on the type of sensitive data
to be labeled. Subsequently, the document to be labeled is tokenized into sentences, maintaining
a consistent format with the provided examples. This approach proves efective in guiding the
model to both comprehend the nuances of sensitive data and to adhere to a programmatically
exploitable format for document labeling.</p>
        <p>Supplementary documents, consisting of paragraphs extracted from Wikipedia and covering
specific categories related to sensitive data, were also included in the dataset. The addition
of text addressing sensitive topics, without disclosing sensitive information about individuals
(e.g., general articles about politics, illnesses, etc.) was aimed to enhance the robustness of the
models. In particular, this strategy helps preventing models from incorrectly associating the
vocabulary of sensitive topics with the actual disclosure of sensitive information.</p>
        <p>As a comparison, our dataset generated by OpenLLaMa comprises 26’821 data points if split
at a sentence level, largely exceeding the dataset proposed by Gambarelli et al. [16], which
contains 5’562 sentences in its fine-grained version. In particular, our open dataset features
370 documents related to the categories health and sexuality, 191 judicial, 96 political, 132
philosophical, 134 ethnic, 638 non sensitive and 490 of mixed categories, for a total of 2051
documents. The dataset is freely available for download along with a detailed description of its
structure2.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Sensitive data detection</title>
        <p>We tested three diferent classification models, each based on a diferent variation of the
basic BERT architecture. Due to the imbalanced distribution in the training data, where over
2https://github.com/SimoDR/sensitive-data-detection
70% of tokens correspond to non-sensitive spans, we employed a weighted softmax loss for
all classification models. This approach assigns higher weights to the sensitive data class,
mitigating the bias inherent in favor of the majority class, as discussed in [20]. To evaluate the
models, a test set composed of 50 documents generated with BingAI was created. Notably, the
test dataset was built to include also document types that were non present in the training set
to further test the robustness of sensitive data detection models.</p>
        <p>The results were evaluated in terms of precision, recall and F1 scores on the categories of
sensitive data in the task of span detection. The evaluation metrics are based on Segura-Bedmar
et al. [21] methodologies for NER evaluation and individual tokens serve as the unit for counting
True Positives, False Negatives, and False Positives. We also tested the BingAI model as a zero
shot detection model, i.e., we prompted it asking to perform NER on a document, without any
other form of example.</p>
        <sec id="sec-3-2-1">
          <title>3.2.1. Sentence Classification (SC)</title>
          <p>We used BERT as a text classifier, where each sentence is classified into one of six sensitive
categories plus a non-sensitive one. This corresponds to a multi-class text classification task ,
where each sentence serves as a distinct data point in the dataset. As discussed in Section 2,
determining whether a sentence is sensitive or not is also dependent on the context in which
the sentence is embedded.</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>3.2.2. Sentence Classification with Context (SCC)</title>
          <p>To address the limitations of the SC model, in this version we included contextual information
from the surrounding text along with each sentence to improve the classification task. Therefore,
as input to this model we used two chunks of text: the one to be classified and the surrounding
text, forming the context. They are separated by the special token [SEP], here used to help BERT
consider the diference between the two chunks. Notably, the chunks are of fixed length, thereby
obviating the need for sentence tokenization. The context also adheres to a predetermined
length, ensuring consistency across the training examples. To generate the training examples
for the SCC model, a sliding window approach is used. By using a stride, the training examples
are partially overlapping, efectively introducing a form of data augmentation. Although
this approach resolves the issue encountered in the SC model by incorporating contextual
information within each chunk, the sliding window approach requires the model to perform
inference on a significantly larger number of inputs, limiting its computational eficiency. As a
result, this limitation has led us to treat the task as a token classification problem instead of a
sequence classification problem.</p>
        </sec>
        <sec id="sec-3-2-3">
          <title>3.2.3. Named Entity Recognition (NER)</title>
          <p>This approach involves the identification and categorization of significant information, known
as named entities, within a given text. By classifying each token and identifying consecutive
tokens with the same label, we can concatenate them to form spans that represent specific
categories. In this case, we used the BERT model with a linear layer that performs classification
for each token, using a softmax function to determine the most probable label for each token.
For labeling, we adopted a variation of the BIO format: tokens are tagged as either B (beginning),
I (inside), or O (outside) of an entity [22]. In our implementation, we do not use the B tag, as
the frequency of chunk beginnings is relatively low compared to tokens inside and outside
of chunks. We also split the documents into fixed-length chunks with a specified stride. This
approach augments the data and allows the model to focus on shorter paragraphs within the
text, as opposed to processing the entire document. The final dataset results to assign for each
token its respective label in the format “I-" followed by one of the six sensitive categories, or
"O” for the non sensitive one.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <p>The two rows in Table 1 present the performance metrics of the three classification models
when trained on either the proprietary dataset or the open synthetic dataset All classifiers
significantly outperformed the “Zero-shot BingAI” model, and the NER model achieved superior
performance compared to the other classifiers on both datasets. This might be attributed to
the fact that the training dataset for the NER model incorporates documents that were not
specifically generated and labeled by BingAI. In this context, a unique prompt was utilized
for BingAI, which difers from the labeling stage where each category of sensitive data had a
distinct and personalized prompt. It also is worth mentioning that if reference examples were
provided to BingAI as part of the prompt, the results might have been considerably improved.
However, in this experimental setting, our objective was to evaluate the zero-shot capabilities
of the model as an out-of-the-box tool.
The lower quality of the OpenLLaMa dataset results in a slightly lower, though almost
negligible, detection accuracy. The graph in Figure 1 further investigates this issue by comparing
how the performance of the NER model changes with diferent sizes of the artificial training
datasets. The lower quality of the OpenLLaMA dataset requires to generate a significantly large
amount of artificial samples to achieve a similar classification accuracy (2K documents vs 860).</p>
      <p>The second two rows of Table 1 show the same comparison, but the metrics are applied at
document level. Since our primary goal is to detect whether a document contains sensitive
data or not, this task evaluates the models’ capability to classify documents into one of the six
sensitive classes or the non-sensitive class. In this assessment, each document is assigned one or
more labels based on the presence of at least one span corresponding to each sensitive class in its
text. Results show that the NER model still significantly outperforms all the other approaches,
reaching over 90% of weighted F1 score when trained on any of the artificial datasets.</p>
      <p>As a final analysis, we compared the execution time and the throughput of the three classifiers
by collecting data from 10 distinct runs. The SC model exhibited the lowest latency in terms of
average time per document (2.09± 0.15 s) and the highest throughput (0.46± 0.03 q/s). The NER
model lagged slightly behind, both in terms of average time per document (2.30 ± 0.19 s) and
throughput (0.42 ± 0.04 q/s). The slowest model was SCC both for average time (13.35 ± 1.01
s) and throughput (0.08 ± 0.01 q/s). Such evaluation does not include the BingAI solution due
to various factors that influence the speed of inference, such as network connection quality and
current trafic conditions.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions</title>
      <p>This paper introduced a novel approach to identify sensitive data in text documents, aligning with
the GDPR legal foundation. The proposed method relies on LLMs to generate artificial datasets
containing sensitive data: our results show that open source, smaller LLMs running on local
environments (OpenLLama) can produce text of suficient quality to train classification models
that perform nearly as well as those trained on higher-quality data generated by proprietary
LLMs (BingAI). Among the considered models, the NER-based model achieved a remarkable
performance, with over 70% of weighted F1 score for the sentence classification task and over
90% of F1 score for the document classification task.</p>
      <p>Future research should explore the performance of more recent open source LLMs
architectures, potentially yielding superior performance in text generation and labeling accuracy. In
addition, it might be interesting to test the zero-shot learning capabilities of open source LLMs:
such investigation would allow to include an assessments on resource utilization, including
considerations of speed, weight, and overall performance.
Y. Chen, J. Vaidya (Eds.), Proceedings of the 10th annual ACM workshop on Privacy in
the electronic society, WPES, ACM, 2011, pp. 1–12. URL: https://doi.org/10.1145/2046556.
2046558. doi:10.1145/2046556.2046558.
[14] A. G. Pablos, N. Pérez, M. Cuadros, Sensitive data detection and classification in spanish
clinical text: Experiments with BERT, CoRR abs/2003.03106 (2020). URL: https://arxiv.org/
abs/2003.03106.
[15] M. Petrolini, S. Cagnoni, M. Mordonini, Automatic detection of sensitive data using
transformer- based classifiers, Future Internet 14 (2022) 228. URL: https://doi.org/10.3390/
if14080228. doi: 10.3390/fi14080228.
[16] G. Gambarelli, A. Gangemi, R. Tripodi, Is your model sensitive? SPEDAC: A new resource
for the automatic classification of sensitive personal data, IEEE Access 11 (2023) 10864–
10880. URL: https://doi.org/10.1109/ACCESS.2023.3240089. doi:10.1109/ACCESS.2023.
3240089.
[17] C. Pearson, N. Seliya, R. Dave, Named entity recognition in unstructured medical
text documents, CoRR abs/2110.15732 (2021). URL: https://arxiv.org/abs/2110.15732.
arXiv:2110.15732.
[18] F. Karl, A. Scherp, Transformers are short text classifiers: A study of inductive
short text classifiers on benchmarks and real-world datasets, CoRR abs/2211.16878
(2022). URL: https://doi.org/10.48550/arXiv.2211.16878. doi:10.48550/ARXIV.2211.
16878. arXiv:2211.16878.
[19] J. White, Q. Fu, S. Hays, M. Sandborn, C. Olea, H. Gilbert, A. Elnashar, J. Spencer-Smith, D. C.</p>
      <p>Schmidt, A prompt pattern catalog to enhance prompt engineering with ChatGPT, CoRR
abs/2302.11382 (2023). URL: https://doi.org/10.48550/arXiv.2302.11382. doi:10.48550/
arXiv.2302.11382. arXiv:2302.11382.
[20] H. Zhu, Y. Yuan, G. Hu, X. Wu, N. Robertson, Imbalance robust softmax for deep embedding
learning, in: Proceedings of the Asian Conference on Computer Vision, 2020.
[21] I. Segura-Bedmar, P. Martínez, M. Herrero-Zazo, Semeval-2013 task 9 : Extraction of
drugdrug interactions from biomedical texts (ddiextraction 2013), in: M. T. Diab, T. Baldwin,
M. Baroni (Eds.), Proceedings of the 7th International Workshop on Semantic Evaluation,
SemEval@NAACL-HLT 2013, The Association for Computer Linguistics, 2013, pp. 341–350.</p>
      <p>URL: https://aclanthology.org/S13-2056/.
[22] L. A. Ramshaw, M. Marcus, Text chunking using transformation-based learning, in:
D. Yarowsky, K. Church (Eds.), Third Workshop on Very Large Corpora, VLC@ACL 1995,
1995. URL: https://aclanthology.org/W95-0107/.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>European</given-names>
            <surname>Commission</surname>
          </string-name>
          ,
          <source>Regulation (EU)</source>
          <year>2016</year>
          /
          <article-title>679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data</article-title>
          ,
          <source>and repealing Directive</source>
          <volume>95</volume>
          /46/EC (
          <article-title>General Data Protection Regulation) (Text with EEA relevance</article-title>
          ),
          <year>2016</year>
          . URL: https://eur-lex.europa.eu/eli/reg/2016/679/oj.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Kaiser</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Polosukhin</surname>
          </string-name>
          ,
          <article-title>Attention is all you need</article-title>
          , in: I. Guyon, U. von Luxburg, S. Bengio,
          <string-name>
            <given-names>H. M.</given-names>
            <surname>Wallach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Fergus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. V. N.</given-names>
            <surname>Vishwanathan</surname>
          </string-name>
          , R. Garnett (Eds.),
          <source>Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems</source>
          <year>2017</year>
          ,
          <year>2017</year>
          , pp.
          <fpage>5998</fpage>
          -
          <lpage>6008</lpage>
          . URL: https://proceedings.neurips.cc/paper/2017/hash/ 3f5ee243547dee91fbd053c1c4a845aa-Abstract.html.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          ,
          <article-title>BERT: pre-training of deep bidirectional transformers for language understanding (</article-title>
          <year>2019</year>
          )
          <fpage>4171</fpage>
          -
          <lpage>4186</lpage>
          . URL: https://doi.org/10.18653/ v1/n19-
          <fpage>1423</fpage>
          . doi:
          <volume>10</volume>
          .18653/V1/N19-1423.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4] OpenAI, GPT-4
          <source>technical report, CoRR abs/2303</source>
          .08774 (
          <year>2023</year>
          ). URL: https://doi.org/10. 48550/arXiv.2303.08774. doi:
          <volume>10</volume>
          .48550/ARXIV.2303.08774. arXiv:
          <volume>2303</volume>
          .
          <fpage>08774</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>X.</given-names>
            <surname>Geng</surname>
          </string-name>
          , H. Liu,
          <source>OpenLLaMA: An open reproduction of LLaMA</source>
          ,
          <year>2023</year>
          . URL: https://github. com/openlm-research/open_llama, online,
          <source>last accessed</source>
          <year>2023</year>
          -
          <volume>06</volume>
          -19.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>T.</given-names>
            <surname>Paccosi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. P.</given-names>
            <surname>Aprosio</surname>
          </string-name>
          ,
          <article-title>REDIT: A tool and dataset for extraction of personal data in documents of the public administration domain</article-title>
          , in: E. Fersini,
          <string-name>
            <given-names>M.</given-names>
            <surname>Passarotti</surname>
          </string-name>
          , V. Patti (Eds.),
          <source>Proceedings of the Eighth Italian Conference on Computational Linguistics</source>
          ,
          <source>CLiCit</source>
          <year>2021</year>
          , Milan, Italy, January
          <volume>26</volume>
          -
          <issue>28</issue>
          ,
          <year>2022</year>
          , volume
          <volume>3033</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2021</year>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3033</volume>
          /paper58.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <article-title>Automated identification of sensitive data from implicit user specification</article-title>
          ,
          <source>Cybersecurity</source>
          <volume>1</volume>
          (
          <year>2018</year>
          )
          <article-title>13</article-title>
          . URL: https://doi.org/10.1186/s42400-018-0011-x. doi:
          <volume>10</volume>
          .1186/ S42400-018-0011-X.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>G. Wilms,</surname>
          </string-name>
          <article-title>Guide on good data protection practice in research</article-title>
          , European University Institute (
          <year>2019</year>
          ). URL: https://www.eui.eu/documents/servicesadmin/deanofstudies/researchethics/ guide
          <article-title>-data-protection-research</article-title>
          .pdf, online,
          <source>last accessed</source>
          <year>2023</year>
          -
          <volume>11</volume>
          -24.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>G.</given-names>
            <surname>Williams</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Pigeot</surname>
          </string-name>
          ,
          <article-title>Consent and confidentiality in the light of recent demands for data sharing</article-title>
          ,
          <source>Biometrical journal 59</source>
          (
          <year>2017</year>
          )
          <fpage>240</fpage>
          -
          <lpage>250</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>C.</given-names>
            <surname>Borgerud</surname>
          </string-name>
          , E. Borglund,
          <article-title>Open research data, an archival challenge?</article-title>
          ,
          <source>Archival Science</source>
          <volume>20</volume>
          (
          <year>2020</year>
          )
          <fpage>279</fpage>
          -
          <lpage>302</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>F.</given-names>
            <surname>Lorè</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Basile</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Appice</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. de Gemmis</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Malerba</surname>
            ,
            <given-names>G. Semeraro,</given-names>
          </string-name>
          <article-title>An AI framework to support decisions on GDPR compliance</article-title>
          ,
          <source>J. Intell. Inf. Syst</source>
          .
          <volume>61</volume>
          (
          <year>2023</year>
          )
          <fpage>541</fpage>
          -
          <lpage>568</lpage>
          . URL: https://doi.org/10.1007/s10844-023-00782-4. doi:
          <volume>10</volume>
          .1007/S10844-023-00782-4.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>B.</given-names>
            <surname>Klimt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <article-title>The enron corpus: A new dataset for email classification research</article-title>
          , in: J.
          <string-name>
            <surname>Boulicaut</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Esposito</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Giannotti</surname>
          </string-name>
          , D. Pedreschi (Eds.),
          <source>Machine Learning: ECML 2004, 15th European Conference on Machine Learning</source>
          , volume
          <volume>3201</volume>
          of Lecture Notes in Computer Science, Springer,
          <year>2004</year>
          , pp.
          <fpage>217</fpage>
          -
          <lpage>226</lpage>
          . URL: https://doi.org/10.1007/978-3-
          <fpage>540</fpage>
          -30115-8_
          <fpage>22</fpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>540</fpage>
          -30115-8\_
          <fpage>22</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>H.</given-names>
            <surname>Mao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Shuai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kapadia</surname>
          </string-name>
          ,
          <article-title>Loose tweets: an analysis of privacy leaks on twitter</article-title>
          , in:
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>