Exploiting Large Language Models to Train Automatic
                                Detectors of Sensitive Data
                                Simone De Renzis1 , Dennis Dosso2 and Alberto Testolin1,3
                                1
                                  Department of Mathematics, University of Padova, Italy
                                2
                                  Siav S.p.A., Italy
                                3
                                  Department of General Psychology, University of Padova, Italy


                                                                         Abstract
                                                                         This paper describes a machine learning system designed to identify sensitive data within Italian text
                                                                         documents, aligning with the definitions and regulations outlined in the General Data Protection Regula-
                                                                         tion (GDPR). To overcome the lack of suitable training datasets, which would require the disclosure of
                                                                         sensitive data from real users, the proposed system exploits a Large Language Model (LLM) to generate
                                                                         synthetic documents that can be used to train supervised classifiers to detect the target sensitive data.
                                                                         We show that “artificial” sensitive data can be generated using both proprietary or open source LLMs,
                                                                         demonstrating that the proposed approach can be implemented either using external services or by
                                                                         relying on locally runnable models. We focus on the detection of six key domains of sensitive data,
                                                                         by training supervised classifiers based on the BERT Transformer architecture adapted to carry out
                                                                         text classification and Named-Entity Recognition (NER) tasks. We evaluate the performance of the
                                                                         system using fine-grained metrics, and show that the NER model can achieve a remarkable detection
                                                                         performance (over 90% F1 score), thus confirming the quality of the synthetic datasets generated with
                                                                         both proprietary and open source LLMs. The dataset we generated using the open source model is made
                                                                         publicly available for download.

                                                                         Keywords
                                                                         Generative Artificial Intelligence, Sensitive data detection, NER, BERT, LLM


                                1. Introduction
                                In today’s digital era safeguarding personal data has become a priority, especially with the
                                advent of the GDPR [1]. For digital archives, it’s essential to identify documents containing
                                sensitive data, ensuring compliance and effective information management. The GDPR details
                                two main categories of personal data: the first one includes information that can directly lead to
                                the identification of an individual, while the second one includes a broader range of expressions
                                that disclose sensitive aspects of a person’s life. This second category is the focus of the present
                                work and will be referred to as sensitive data. In particular, we deal with six key categories
                                of sensitive data: (i) Health: Physical and mental well-being of individuals, with details
                                regarding existing diagnoses, medical conditions, and disabilities; (ii) Political: Individual’s
                                political beliefs, their political orientation, specific party affiliation, as well as membership in
                                20th conference on Information and Research science Connecting to Digital and Library science, Bressanone, Brixen,
                                Italy - 22-23 February 2024
                                $ dennis.dosso@siav.it (D. Dosso); alberto.testolin@unipd.it (A. Testolin)
                                 0000-0001-7307-4607 (D. Dosso); 0000-0001-7062-4861 (A. Testolin)
                                                                       © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                    CEUR
                                    Workshop
                                    Proceedings
                                                  http://ceur-ws.org
                                                  ISSN 1613-0073
                                                                       CEUR Workshop Proceedings (CEUR-WS.org)


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
work unions or similar organizations; (iii) Sexuality: Individual’s sexual orientation, habits,
and gender identity; (iv) Judicial: Legal matters, such as offenses, crimes, charges, pending
criminal proceedings, accusations, and trial proceedings involving an individual; (v) Philosophy:
Individual’s philosophical and religious beliefs and affiliations; (vi) Ethnic: Individual’s ethnic
origin and heritage.
   The present article describes an original approach to implement a system based on machine
learning classifiers to automatically detect sensitive data in text documents. The proposed
method relies on Large Language Models (LLMs) to generate synthetic documents with “artifi-
cial” sensitive data, which can then be used to train Transformer-based text classifiers [2]. Our
empirical investigations show that a neural model based on the Bidirectional Encoder Represen-
tations from Transformers (BERT) [3] architecture adapted for Named Entity Recognition (NER)
achieves the best detection performance, both when trained using data generated by proprietary
LLMs like GPT-4 [4], but also when the synthetic data is generated using open source LLMs
such as OpenLLaMa [5]. The dataset generated using the open source LLM is made publicly
available for download to promote further research on this domain.
   The paper is structured as follows: Section 2 presents the current state of research on sensitive
data detection, Section 3 details the process of automated generation and labeling of synthetic
corpora, and our method based on BERT. Section 4 reports the experimental results and Section
5 discusses some limitations of our method and possible directions for future research.


2. Related Work
While the problem of detecting Personally Identifiable Information (PII) has been extensively
studied in both academic and industrial settings [6], the task of identifying sensitive data has
been much less explored [7].

2.1. Training corpora with sensitive data
The nature of this topic makes it difficult to find real-world documents containing sensitive
data, since organizations are generally unwilling to grant access to private documents due to
concerns regarding proper data handling protocols [8, 9, 10]. This is especially true in the Italian
scenario, which is the specific focus of our inquiry, where research on sensitive data detection
is primarily based on manually curated datasets that are not released for public use [11].
   Some publicly available datasets involve classifying emails from the Enron corpus [12],
detecting privacy leaks in Tweets [13] and health-related information [14]. One approach,
employed by Petrolini et al. [15], involves extracting conversations from specific subsections
of the Reddit forum that deal with sensitive topics. Although collecting datasets from scraped
tweets or Reddit messages is a cost-effective way to obtain sensitive data, their lack of diversity
may hinder their effectiveness in training models for various types of documents. Gambarelli et
al. [16] manually curated two datasets containing various categories of sensitive data. Such
corpora are undoubtedly of higher quality, but are also more expensive to build due to the need
for manual labeling, often requiring the involvement of domain experts.
2.2. Machine learning models
A variety of machine learning models and deep learning architectures have been employed to
perform Natural Language Processing (NLP) tasks, such as text classification and NER. Various
architectures are involved in the domain of PII and sensitive data detection, from Convolutional
Neural Networks (CNN) [17] to Transformer-based models like BERT [14]. In Karl and Scherp
[18] a comparative investigation is carried out to evaluate the performance of various methods
in the domain of short text classification, highlighting Transformer-based models as the best
performing in terms of accuracy and speed. The BERT model is used also by Petrolini et al.
[15] and Gambarelli et al. [16]. The first work proposes a method that relies on identifying a
“sensitive topic” and a PII that can be linked to it. However, personal data is often mentioned
separately from the related sensitive topic or may not be actually related to it. Our approach
aims to make detection more robust by feeding the entire document to the classifier: this enables
the model to consider the complete context and develop an understanding of the relationship
between the person and its sensitive data disclosure. The second work instead introduces a
multi-step inference pipeline in which a first prediction is done to distinguish between sensitive
and non sensitive sentences, and then a finer inference is done to classify the category of the
sensitive sentence. Our approach uses a single BERT model for prediction that discriminates
between the six sensitive categories and a non sensitive one, thus speeding up the process of
inference and decreasing the memory load.


3. Methods
Our proposal involves leveraging the generation capabilities of recent LLM architectures to
generate documents and perform automatic labeling, reducing the data acquisition costs.

3.1. Document generation and data labeling
The procedure we propose for creating synthetic training data involves two distinct phases:
document generation, which consists in the creation of documents of specific types, and span
labeling, which requires to explicitly detect and categorize the sensitive data spans within the
generated documents. We use the term span to denote a segment of text, varying in size, that is
of particular interest—specifically, one that reveals sensitive information. In our experiments,
we used two families of LLMs: BingAI, a chat interface integrated into Microsoft browser
which is powered by GPT-4 [4]; and OpenLLaMa [5], an open and commercially permissive
reimplementation of LLaMa1 .
   For document generation, we defined a list of document types (e.g., clinical records, medical
prescriptions, criminal records, etc.) that might contain sensitive data belonging to one of
the six categories mentioned in Section 1. An automated script was devised to prompt the
LLMs to generate such documents containing sensitive data. The template of the prompt
looks like this: "Puoi generare un documento di finzione ma realistico riguardante NAME del tipo
“DOCUMENT_TITLE”, che includa informazioni riguardo SENSITIVE_ INFO di NAME?". This
procedure was similar between BingAI and OpenLLaMA, but for the latter model a custom
    1
        https://github.com/facebookresearch/llama/blob/main/LICENSE (Last visited: June 2023)
system prompt was instantiated, asking it to act as a document generator: this trick belongs to
a set of techniques that consists on carefully crafting prompts that have been shown to improve
the quality of text generation [19].
   The span labeling phase has been approached in two distinct ways. For the BingAI model, a
prompt was built based on the type of sensitive data the document is supposed to contain: the
prompt asks to generate the document provided as input, but with the sensitive information
spans “censored” or concealed with a specific tag. To guide the model in detecting specific
types of sensitive information, the prompt is automatically constructed based on the known
sensitive category data associated with the given document. This approach has been found to
be more effective than simply asking to return the sensitive spans themselves. Similar to the
prompt used for document generation, the labeling prompt also follows a structured format with
specific variable words that are filled based on the document type and the associated sensitive
information: "Puoi censurare tutte e sole le porzioni di frasi che contengono informazioni o possono
ricondursi a SENSITIVE_ INFO di NAME? Fornisci il documento con sole frasi che non hanno niente
a che fare con SENSITIVE_ INFO di NAME. Leggendo il documento non devo essere in grado di
ricostruire alcun’informazione relativa a SENSITIVE_ INFO di NAME. Usa l’etichetta [LABEL] per
sostituire le porzioni di frase che contengono informazioni relative a SENSITIVE_ INFO di NAME."
   The OpenLLaMa model, being a much smaller (13 billions parameters) and less capable model,
required a few shot learning approach to get the best results. A predefined set of sentences, each
with corresponding labels, is incorporated into the prompt tailored on the type of sensitive data
to be labeled. Subsequently, the document to be labeled is tokenized into sentences, maintaining
a consistent format with the provided examples. This approach proves effective in guiding the
model to both comprehend the nuances of sensitive data and to adhere to a programmatically
exploitable format for document labeling.
   Supplementary documents, consisting of paragraphs extracted from Wikipedia and covering
specific categories related to sensitive data, were also included in the dataset. The addition
of text addressing sensitive topics, without disclosing sensitive information about individuals
(e.g., general articles about politics, illnesses, etc.) was aimed to enhance the robustness of the
models. In particular, this strategy helps preventing models from incorrectly associating the
vocabulary of sensitive topics with the actual disclosure of sensitive information.
   As a comparison, our dataset generated by OpenLLaMa comprises 26’821 data points if split
at a sentence level, largely exceeding the dataset proposed by Gambarelli et al. [16], which
contains 5’562 sentences in its fine-grained version. In particular, our open dataset features
370 documents related to the categories health and sexuality, 191 judicial, 96 political, 132
philosophical, 134 ethnic, 638 non sensitive and 490 of mixed categories, for a total of 2051
documents. The dataset is freely available for download along with a detailed description of its
structure2 .

3.2. Sensitive data detection
We tested three different classification models, each based on a different variation of the
basic BERT architecture. Due to the imbalanced distribution in the training data, where over

    2
        https://github.com/SimoDR/sensitive-data-detection
70% of tokens correspond to non-sensitive spans, we employed a weighted softmax loss for
all classification models. This approach assigns higher weights to the sensitive data class,
mitigating the bias inherent in favor of the majority class, as discussed in [20]. To evaluate the
models, a test set composed of 50 documents generated with BingAI was created. Notably, the
test dataset was built to include also document types that were non present in the training set
to further test the robustness of sensitive data detection models.
   The results were evaluated in terms of precision, recall and F1 scores on the categories of
sensitive data in the task of span detection. The evaluation metrics are based on Segura-Bedmar
et al. [21] methodologies for NER evaluation and individual tokens serve as the unit for counting
True Positives, False Negatives, and False Positives. We also tested the BingAI model as a zero
shot detection model, i.e., we prompted it asking to perform NER on a document, without any
other form of example.

3.2.1. Sentence Classification (SC)
We used BERT as a text classifier, where each sentence is classified into one of six sensitive
categories plus a non-sensitive one. This corresponds to a multi-class text classification task,
where each sentence serves as a distinct data point in the dataset. As discussed in Section 2,
determining whether a sentence is sensitive or not is also dependent on the context in which
the sentence is embedded.

3.2.2. Sentence Classification with Context (SCC)
To address the limitations of the SC model, in this version we included contextual information
from the surrounding text along with each sentence to improve the classification task. Therefore,
as input to this model we used two chunks of text: the one to be classified and the surrounding
text, forming the context. They are separated by the special token [SEP], here used to help BERT
consider the difference between the two chunks. Notably, the chunks are of fixed length, thereby
obviating the need for sentence tokenization. The context also adheres to a predetermined
length, ensuring consistency across the training examples. To generate the training examples
for the SCC model, a sliding window approach is used. By using a stride, the training examples
are partially overlapping, effectively introducing a form of data augmentation. Although
this approach resolves the issue encountered in the SC model by incorporating contextual
information within each chunk, the sliding window approach requires the model to perform
inference on a significantly larger number of inputs, limiting its computational efficiency. As a
result, this limitation has led us to treat the task as a token classification problem instead of a
sequence classification problem.

3.2.3. Named Entity Recognition (NER)
This approach involves the identification and categorization of significant information, known
as named entities, within a given text. By classifying each token and identifying consecutive
tokens with the same label, we can concatenate them to form spans that represent specific
categories. In this case, we used the BERT model with a linear layer that performs classification
for each token, using a softmax function to determine the most probable label for each token.
For labeling, we adopted a variation of the BIO format: tokens are tagged as either B (beginning),
I (inside), or O (outside) of an entity [22]. In our implementation, we do not use the B tag, as
the frequency of chunk beginnings is relatively low compared to tokens inside and outside
of chunks. We also split the documents into fixed-length chunks with a specified stride. This
approach augments the data and allows the model to focus on shorter paragraphs within the
text, as opposed to processing the entire document. The final dataset results to assign for each
token its respective label in the format “I-" followed by one of the six sensitive categories, or
"O” for the non sensitive one.


4. Results
The two rows in Table 1 present the performance metrics of the three classification models
when trained on either the proprietary dataset or the open synthetic dataset All classifiers
significantly outperformed the “Zero-shot BingAI” model, and the NER model achieved superior
performance compared to the other classifiers on both datasets. This might be attributed to
the fact that the training dataset for the NER model incorporates documents that were not
specifically generated and labeled by BingAI. In this context, a unique prompt was utilized
for BingAI, which differs from the labeling stage where each category of sensitive data had a
distinct and personalized prompt. It also is worth mentioning that if reference examples were
provided to BingAI as part of the prompt, the results might have been considerably improved.
However, in this experimental setting, our objective was to evaluate the zero-shot capabilities
of the model as an out-of-the-box tool.

Table 1
Performance comparison between the detection models trained on the BingAI and OpenLLaMa generated
datasets. Precision, recall and F1 are weighted averages. The first two rows are referred to the span taks,
the second two to the document level task.
 Detection model           SC                  SCC                    NER              BingAI (Zero-shot)
 Training dataset   Prec Recall    F1    Prec Recall    F1      Prec Recall    F1      Prec Recall     F1
 BingAI             0.611 0.651 0.631    0.663 0.673 0.668      0.815 0.690 0.735      0.642 0.332 0.437
 OpenLLaMa          0.649 0.700 0.663    0.620 0.710 0.654      0.734 0.728 0.731      0.642 0.332 0.437
 BingAI             0.717 0.837 0.770    0.769 0.911 0.820      0.929 0.914 0.921      0.667 1.000 0.789
 OpenLLaMa          0.621 0.937 0.744    0.669 1.000 0.791      0.915 0.906 0.910      0.667 1.000 0.789


  The lower quality of the OpenLLaMa dataset results in a slightly lower, though almost
negligible, detection accuracy. The graph in Figure 1 further investigates this issue by comparing
how the performance of the NER model changes with different sizes of the artificial training
datasets. The lower quality of the OpenLLaMA dataset requires to generate a significantly large
amount of artificial samples to achieve a similar classification accuracy (2K documents vs 860).
  The second two rows of Table 1 show the same comparison, but the metrics are applied at
document level. Since our primary goal is to detect whether a document contains sensitive
data or not, this task evaluates the models’ capability to classify documents into one of the six
sensitive classes or the non-sensitive class. In this assessment, each document is assigned one or
more labels based on the presence of at least one span corresponding to each sensitive class in its
                                   0.75

                                   0.70

                                   0.65

                                   0.60


                        F1 score
                                   0.55

                                   0.50

                                   0.45
                                                                              BingAI
                                                                              OpenLLaMa
                                   0.40
                                          0   500   1000        1500   2000        2500
                                                      Dataset size

Figure 1: Span level scores (weighted F1) obtained by training the NER model on various size variations
of the two datasets, generated with BingAI and with OpenLLaMa.


text. Results show that the NER model still significantly outperforms all the other approaches,
reaching over 90% of weighted F1 score when trained on any of the artificial datasets.
   As a final analysis, we compared the execution time and the throughput of the three classifiers
by collecting data from 10 distinct runs. The SC model exhibited the lowest latency in terms of
average time per document (2.09±0.15 s) and the highest throughput (0.46±0.03 q/s). The NER
model lagged slightly behind, both in terms of average time per document (2.30 ± 0.19 s) and
throughput (0.42 ± 0.04 q/s). The slowest model was SCC both for average time (13.35 ± 1.01
s) and throughput (0.08 ± 0.01 q/s). Such evaluation does not include the BingAI solution due
to various factors that influence the speed of inference, such as network connection quality and
current traffic conditions.


5. Conclusions
This paper introduced a novel approach to identify sensitive data in text documents, aligning with
the GDPR legal foundation. The proposed method relies on LLMs to generate artificial datasets
containing sensitive data: our results show that open source, smaller LLMs running on local
environments (OpenLLama) can produce text of sufficient quality to train classification models
that perform nearly as well as those trained on higher-quality data generated by proprietary
LLMs (BingAI). Among the considered models, the NER-based model achieved a remarkable
performance, with over 70% of weighted F1 score for the sentence classification task and over
90% of F1 score for the document classification task.
  Future research should explore the performance of more recent open source LLMs architec-
tures, potentially yielding superior performance in text generation and labeling accuracy. In
addition, it might be interesting to test the zero-shot learning capabilities of open source LLMs:
such investigation would allow to include an assessments on resource utilization, including
considerations of speed, weight, and overall performance.
References
 [1] European Commission, Regulation (EU) 2016/679 of the European Parliament and of
     the Council of 27 April 2016 on the protection of natural persons with regard to the
     processing of personal data and on the free movement of such data, and repealing Directive
     95/46/EC (General Data Protection Regulation) (Text with EEA relevance), 2016. URL:
     https://eur-lex.europa.eu/eli/reg/2016/679/oj.
 [2] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polo-
     sukhin, Attention is all you need, in: I. Guyon, U. von Luxburg, S. Bengio, H. M.
     Wallach, R. Fergus, S. V. N. Vishwanathan, R. Garnett (Eds.), Advances in Neural In-
     formation Processing Systems 30: Annual Conference on Neural Information Processing
     Systems 2017, 2017, pp. 5998–6008. URL: https://proceedings.neurips.cc/paper/2017/hash/
     3f5ee243547dee91fbd053c1c4a845aa-Abstract.html.
 [3] J. Devlin, M. Chang, K. Lee, K. Toutanova, BERT: pre-training of deep bidirectional
     transformers for language understanding (2019) 4171–4186. URL: https://doi.org/10.18653/
     v1/n19-1423. doi:10.18653/V1/N19-1423.
 [4] OpenAI, GPT-4 technical report, CoRR abs/2303.08774 (2023). URL: https://doi.org/10.
     48550/arXiv.2303.08774. doi:10.48550/ARXIV.2303.08774. arXiv:2303.08774.
 [5] X. Geng, H. Liu, OpenLLaMA: An open reproduction of LLaMA, 2023. URL: https://github.
     com/openlm-research/open_llama, online, last accessed 2023-06-19.
 [6] T. Paccosi, A. P. Aprosio, REDIT: A tool and dataset for extraction of personal data in
     documents of the public administration domain, in: E. Fersini, M. Passarotti, V. Patti
     (Eds.), Proceedings of the Eighth Italian Conference on Computational Linguistics, CLiC-
     it 2021, Milan, Italy, January 26-28, 2022, volume 3033 of CEUR Workshop Proceedings,
     CEUR-WS.org, 2021. URL: https://ceur-ws.org/Vol-3033/paper58.pdf.
 [7] Z. Yang, Z. Liang, Automated identification of sensitive data from implicit user specification,
     Cybersecurity 1 (2018) 13. URL: https://doi.org/10.1186/s42400-018-0011-x. doi:10.1186/
     S42400-018-0011-X.
 [8] G. Wilms, Guide on good data protection practice in research, European University Institute
     (2019). URL: https://www.eui.eu/documents/servicesadmin/deanofstudies/researchethics/
     guide-data-protection-research.pdf, online, last accessed 2023-11-24.
 [9] G. Williams, I. Pigeot, Consent and confidentiality in the light of recent demands for data
     sharing, Biometrical journal 59 (2017) 240–250.
[10] C. Borgerud, E. Borglund, Open research data, an archival challenge?, Archival Science 20
     (2020) 279–302.
[11] F. Lorè, P. Basile, A. Appice, M. de Gemmis, D. Malerba, G. Semeraro, An AI framework
     to support decisions on GDPR compliance, J. Intell. Inf. Syst. 61 (2023) 541–568. URL:
     https://doi.org/10.1007/s10844-023-00782-4. doi:10.1007/S10844-023-00782-4.
[12] B. Klimt, Y. Yang, The enron corpus: A new dataset for email classification research, in:
     J. Boulicaut, F. Esposito, F. Giannotti, D. Pedreschi (Eds.), Machine Learning: ECML 2004,
     15th European Conference on Machine Learning, volume 3201 of Lecture Notes in Computer
     Science, Springer, 2004, pp. 217–226. URL: https://doi.org/10.1007/978-3-540-30115-8_22.
     doi:10.1007/978-3-540-30115-8\_22.
[13] H. Mao, X. Shuai, A. Kapadia, Loose tweets: an analysis of privacy leaks on twitter, in:
     Y. Chen, J. Vaidya (Eds.), Proceedings of the 10th annual ACM workshop on Privacy in
     the electronic society, WPES, ACM, 2011, pp. 1–12. URL: https://doi.org/10.1145/2046556.
     2046558. doi:10.1145/2046556.2046558.
[14] A. G. Pablos, N. Pérez, M. Cuadros, Sensitive data detection and classification in spanish
     clinical text: Experiments with BERT, CoRR abs/2003.03106 (2020). URL: https://arxiv.org/
     abs/2003.03106.
[15] M. Petrolini, S. Cagnoni, M. Mordonini, Automatic detection of sensitive data using
     transformer- based classifiers, Future Internet 14 (2022) 228. URL: https://doi.org/10.3390/
     fi14080228. doi:10.3390/fi14080228.
[16] G. Gambarelli, A. Gangemi, R. Tripodi, Is your model sensitive? SPEDAC: A new resource
     for the automatic classification of sensitive personal data, IEEE Access 11 (2023) 10864–
     10880. URL: https://doi.org/10.1109/ACCESS.2023.3240089. doi:10.1109/ACCESS.2023.
     3240089.
[17] C. Pearson, N. Seliya, R. Dave, Named entity recognition in unstructured medical
     text documents, CoRR abs/2110.15732 (2021). URL: https://arxiv.org/abs/2110.15732.
     arXiv:2110.15732.
[18] F. Karl, A. Scherp, Transformers are short text classifiers: A study of inductive
     short text classifiers on benchmarks and real-world datasets, CoRR abs/2211.16878
     (2022). URL: https://doi.org/10.48550/arXiv.2211.16878. doi:10.48550/ARXIV.2211.
     16878. arXiv:2211.16878.
[19] J. White, Q. Fu, S. Hays, M. Sandborn, C. Olea, H. Gilbert, A. Elnashar, J. Spencer-Smith, D. C.
     Schmidt, A prompt pattern catalog to enhance prompt engineering with ChatGPT, CoRR
     abs/2302.11382 (2023). URL: https://doi.org/10.48550/arXiv.2302.11382. doi:10.48550/
     arXiv.2302.11382. arXiv:2302.11382.
[20] H. Zhu, Y. Yuan, G. Hu, X. Wu, N. Robertson, Imbalance robust softmax for deep embedding
     learning, in: Proceedings of the Asian Conference on Computer Vision, 2020.
[21] I. Segura-Bedmar, P. Martínez, M. Herrero-Zazo, Semeval-2013 task 9 : Extraction of drug-
     drug interactions from biomedical texts (ddiextraction 2013), in: M. T. Diab, T. Baldwin,
     M. Baroni (Eds.), Proceedings of the 7th International Workshop on Semantic Evaluation,
     SemEval@NAACL-HLT 2013, The Association for Computer Linguistics, 2013, pp. 341–350.
     URL: https://aclanthology.org/S13-2056/.
[22] L. A. Ramshaw, M. Marcus, Text chunking using transformation-based learning, in:
     D. Yarowsky, K. Church (Eds.), Third Workshop on Very Large Corpora, VLC@ACL 1995,
     1995. URL: https://aclanthology.org/W95-0107/.