<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Privacy-Preserving Data Integration for Health: Adhering to OMOP-CDM Standard</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Lisa Trigiante</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Domenico Beneventano</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Modena and Reggio Emilia</institution>
          ,
          <addr-line>Modena</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The digital transformation of health processes and the resulting availability of vast amounts of healthrelated data about patients ofer signicfiant promise to advance multiple medical research projects and enhance both the public and private healthcare systems. Exploiting the full potential of this vision requires a unified representation of diferent autonomous data sources to facilitate detailed data analysis capacity. To this end, OMOP CDM has emerged as the de facto standard for organizing healthcare data from diverse sources. However, collecting and processing sensitive data about individuals leads to consideration of privacy requirements and confidentiality concerns. Privacy-Preserving Data Integration (PPDI) is the process of establishing a unified view of personal data across multiple data sources while protecting the privacy of individuals represented in the underlying data. This discussion paper ofers a concise overview of the research field related to PPDI, highlighting associated challenges and opportunities within the healthcare domain. In particular, it delves into the specific research challenges encountered by the PPDI process alongside the utilization of OMOP-CDM, with particular attention directed towards the Schema Alignment phase and the classification of data based on identifiability and privacy.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>The digitization of legal, administrative, and healthcare processes, among many others, has
generated vast amounts of data describing people and their behavior. The resulting person-related
Big Data presents substantial intrinsic worth and holds considerable potential to feed multiple
research areas with the aim of enhancing the human condition. Achieving this vision requires an
eficient Data Integration (DI) process, enabling users to access a unified and consistent view of
diverse data sources. However, the integration of personal data is limited by ethical and privacy
concerns. The European General Data Protection Regulation (GDPR) bases the classification of
data content on the concepts of identifiability and privacy:
• Non-Sensitive Data: denotes attributes that contain neither identifying information nor
information which deserves protection (e.g. metadata).</p>
      <p>
        Privacy-Preserving Data Integration (PPDI) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] is the process aimed at providing a unified and
accurate representation of personal information across multiple heterogeneous data sources
while preventing privacy disclosure of individuals represented in the underlying data.
      </p>
      <p>
        The GDPR leads toward the adoption of specific techniques [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] to prevent internal parties
involved in the PPDI process and external adversaries from the possibility of identifying a
specific individual, called Re-identification .
      </p>
      <p>
        Our research in the field of PPDI has encompassed concrete application projects, such as the
design and development of a Proof of Concept (PoC) for the Criminal Data Warehouse project,
establishing a PPDI process across Italian legal data sources to assess the recidivism phenomena
[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. However, we intended from the design stage to accommodate diferent application scenarios
and not tailor solutions to the justice domain. To this end, our collaboration with the Health
Departments of the Emilia Romagna region have underscored the challenges inherent in
privacypreserving processing of health-related data [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ] , emphasizing the necessity for a standard and
comprehensive PPDI framework. Pursuing this line, the European Health Data Evidence Network
(EHDEN) aims to promote the adoption of the Observational Medical Outcomes Partnership
Common Data Model (OMOP-CDM) in Europe. In furtherance of this objective, our research
group is participating in the European ARISTOTELES project, which includes a research strand
focused on the PPDI process adherent to OMOP-CDM standards.
      </p>
      <p>
        This discussion paper aims to delineate our current and future research eforts directed toward
the creation of a novel and comprehensive PPDI framework within the EHDEN ecosystem.
• Section 2 provides a concise overview of the PPDI process devised for our framework (a
more in-depth discussion was addressed in [
        <xref ref-type="bibr" rid="ref5 ref6">6, 5</xref>
        ]).
• In Section 3 we present the OMOP-CDM standard and discuss the major advantages
and drawbacks, particularly concerning privacy issues that have not been adequately
addressed in the literature.
• Section 4 proposes a primary contribution to overcome some of these challenges,
exploiting a semantic-based tool to classify schema elements in QID and SPI, facilitating the
schema alignment between local sources and OMOP-CDM and allowing to maximize the
trade-of between privacy and utility.
• Finally, in Section 5, we conclude and provide insights for future directions and
developments.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. COMPREHENSIVE PPDI FRAMEWORK</title>
      <p>
        In this section, we outline the methodology and architectural approach devised to support the
creation of a novel and comprehensive PPDI framework. The idea behind the PPDI framework is
an incremental extension of the MOMIS (Mediator envirOnment for Multiple Information Sources)
[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] Data Integration system toward a Trusted Third-Party (TTP) [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] microservice architecture,
including specific software modules to realize PPDI in compliance with the GDPR. As shown in
Fig. 1, the TTP will serve as the PPDI Domain to provide the Consumer Domain with a unified
and privacy-preserving representation of the diferent autonomous data sources within the
Source Domain.
      </p>
      <sec id="sec-2-1">
        <title>2.1. Privacy-Preserving Data Integration process</title>
        <p>The PPDI process usually involves three steps:
• Schema Alignment resolves inconsistencies at the schema level by finding the semantic
correspondences among the schema of the Local Sources and producing an integrated
Global Schema. Furthermore, within the privacy context, this step involves the
classification of data based on identifiability and privacy. The sets of PII and SPI are typically
considered disjointed in a PPDI process and undergo distinct procedures.
• Privacy-Preserving Record Linkage (PPRL) resolves inconsistencies at the tuple level
by identifying records about the same individuals from diferent sources. PPRL can be
viewed as a classification problem that labels pairs of records across diferent sources
like a match (i.e. two records refer to the same individual) or a non-match. To this end,
PII undergoes specific pseudonymization techniques to facilitate record linkage while
preventing re-identification.</p>
        <p>
          Pseudonymization [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] is the process of replacing PII with a pseudonym (or encrypted code),
to allow further processing.
• Data Fusion resolves inconsistencies at the value level by fusing duplicate entries from
diferent sources and creating a unique record for each individual. Data Fusion is aimed
at increasing the conciseness and consistency of data that are made available to users
and applications to facilitate data analysis. To this end, the outcome of PPDI is in plain
format and therefore includes only the SPI as PII and QID possess the potential to enable
re-identification.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Privacy and usability trade-of</title>
        <p>
          In real-world privacy scenarios, with any information disclosure, there is always some privacy
loss, and with any masking (or pseudonymization) technique, there is always some information
loss. An important issue of privacy-preserving approaches is to ensure the optimal
tradeof between measures to maximize the utility of data to be disclosed (which is equivalent to
minimizing information loss) and to maximize privacy protection. For instance, one of the key
dimensions for assessing the usefulness of data sharing is de-duplication (aka record linkage).
On the other hand, the evaluation of privacy is one of the biggest impediments in a PPDI
process as it represents the resistance to re-identification attacks and depends on aspects that
are complex to quantify, such as the nature of the data involved and the publicly available
information [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. Some methods have been proposed to pursue this objective [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ], nevertheless,
they tend to concentrate only on PPRL. For these reasons, the determination of a set of standard
measures for the empirical evaluation of the trade-of between privacy and usability of data
is still a developing area of the literature that necessitates careful consideration. From our
perspective, PPDI needs to be collectively approached, especially in addressing privacy.
        </p>
        <p>This discussion paper delves into the specific research challenges encountered by the PPDI
process alongside the utilization of OMOP-CDM, with particular attention directed toward the
privacy and usability trade-of.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. OMOP Common Data Model</title>
      <p>The advent of Big Health Data (BHD) has led to an upsurge in the need for methods to efectively
manage their information content and ofer a unified view to enable eficient analysis. The
intrinsic aspects of BHD require careful consideration and impose strict demands on the data
resulting from the PPDI process concerning completeness, consistency, interoperability, and
scalability over time.</p>
      <p>The Observational Health Data Sciences and Informatics1 (OHDSI) program proposed the
OMOP-CDM to standardize the structure and content of health data and to enable eficient
analyses that can produce reliable evidence. A central component of the OMOP-CDM is the
OMOP standardized vocabularies which allow organization and standardization of medical
terms. OMOP-CDM plays a crucial role in addressing the challenges of data heterogeneity and
interoperability among disparate healthcare systems by facilitating consistency, compatibility,
and eficiency of the integration process. Moreover, OMOP-CDM addresses scalability challenges
by accommodating large datasets and allowing for the independent addition of new sources,
thereby empowering the management of vast amounts of health data with high performance
and reliability. For these reasons, EHDEN launched a program aimed to promote the adoption
of OMOP/OHDSI in Europe, addressing the challenges in generating insights and evidence
from real-world clinical data on a large scale. The project’s goal is to assist patients, clinicians,
regulators, governments, and the industry in understanding well-being, disease, treatments, and
outcomes, as well as new therapeutics and novel devices. Due to this initiative, the OMOP-CDM
has been widely adopted across various healthcare systems, research institutions, and data
repositories worldwide, and now constitutes a vast repository of health data for observational
studies and evidence-based research. The literature concerning the procedure to harmonize
data with respect to OMOP-CDM encompasses a diverse range of data types, including but not
limited to electronic health records (EHRs) [12], claims datasets [13], registries [14], and clinical
trial [15]. Within such literature, the mapping process to ensure standardized representation
and compatibility with OMOP CDM can be summarized in the following three phases:
1. Vocabulary mapping: is the process of mapping elements from a local data source
(especially medical terms) to an appropriate standard concept defined within the OMOP
vocabularies.
2. Data tables mapping: is the process of aligning the structure and semantics of the local
data source with the standardized tables and fields defined in the OMOP-CDM.
3. Extract-Transform-Load (ETL): is the process that involves the extraction of local data
and their transformation based on the Data tables mapping rules defined in the previous
steps. Finally, local data are loaded into the OMOP CDM-compliant database.</p>
      <p>One of the main drawbacks, nevertheless, is that while in the numerous works that undertake
health database mapping in OMOP [12, 13, 14], the data table mapping phase is often performed
manually and/or coded only in the ETL stage. This negatively afects the tradeof between
privacy and usability of the overall process.</p>
      <sec id="sec-3-1">
        <title>3.1. Privacy and usability trade-of in OMOP/OHDSI ecosystem</title>
        <p>The vocabulary mapping process is extremely dificult, time-consuming, and mostly conducted
manually by domain experts. To facilitate this human-in-the-loop process, some tools are
provided by OHDSI. The most important one is Usagi2, a vocabulary mapping tool that utilizes
probabilistic algorithms to suggest mappings between local source terminologies and standard
vocabularies to domain experts. One of the major drawbacks of Usagi is its exclusive reliance
on a probabilistic algorithm based on syntactic matching. This results in limited accuracy,
particularly with ambiguous terms and complex relationships, along with linguistic dependence,
challenges in adapting to domain-specific vocabularies, and scalability issues. This aspect highly
afects the utility of the resulting data Diferent research programs have been established to
improve Usagi’s performance. Deep learning-based methods demonstrate to outperform both
Usagi and previous simple word-level matching algorithms. However, the main limitation lies
in the need for a conspicuous and accurate training set as the presence of negative training
samples significantly afects the outcomes. Other researchers extended the vocabulary mapping
to diferent languages [ 16] through automatic translation methods. However, challenges remain
in translating nonstandard expressions and resolving abbreviations into full names. To overcome
the limitations of these methods, we are developing techniques that will be briefly presented in
Section 4.2. For instance, [17] discusses how to translate multilingual nonstandard expressions
and resolve abbreviations.</p>
        <p>The focus of this article is indeed on the privacy issues that arise when transforming data into
OMOP-CDM. One of the main privacy challenges encountered by the PPDI process, alongside
the utilization of OMOP-CDM, concerns the fact that mapping large amounts of data to the
OMOP-CDM raises significant concerns about protecting QID; as clinical terminologies expand
to include new terms that may capture QID, institutions may inadvertently start using them
in clinical data ETL processes. This can potentially put institutions and patients at risk if not
addressed. The OHDSI consortium strongly cautions against this during the ETL process, as
certain vocabularies may contain terms that represent phone numbers, emails, and other QID
information, rather than clinical observations [18]. This highlights the importance of carefully
considering the potential risks and implementing appropriate safeguards when mapping health
data to the OMOP-CDM. To address this challenge, we envisaged a method to semi-automatically
classify OMOP-CDM attributes into QID and SPI, presented in Section 4.1.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. OMOP-CDM Privacy-Preserving Framework</title>
      <p>The objective of this section is to present the methodology devised to develop additional
privacy-preserving services to be implemented within our PPDI framework, based on the
re-implementation of MOMIS, (see Section 2) to enhance the process of harmonizing data
with OMOP-CDM standards (see Section 3). An open-source version of MOMIS is currently
maintained by DataRiver3. DataRiver participated in various European projects aligning
different health-related cross europe sources to OMOP-CDM and was certified as an SME by the
IMI EHDEN consortium for its support to healthcare facilities in standardizing health data
according to the OMOP CDM standard and providing additional services in the EHDEN and
OHDSI ecosystem. This experience has proven important in identifying the main issues of
OMOP/OHDSI approach in concrete applications. In light of these reasons, presented in Section
4.1, we are investigating techniques to semi-automatically classify QID and SPI and facilitate
the schema alignment according to OMOP-CDM.</p>
      <sec id="sec-4-1">
        <title>4.1. Classification of OMOP-CDM data</title>
        <p>Classifying data according to identifiability in a real-world scenario is a complex task that
highly impacts the overall trade-of between privacy and data usability. SPI and QID can overlap
and the combination of attributes identifying an individual may vary from person to person
depending on the rarity of attribute values.</p>
        <p>To address this challenge, we intend to develop a tool adherent to OHDSI principles to
semi-automatically classify OMOP-CDM attributes into QID and SPI. This tool applied to the
OMOP-CDM attributes and to the terminologies in the main vocabularies, will allow us to
determine whether they are QID or SPI.</p>
        <p>This outcome can be leveraged not only in the mapping of new local sources to OMOP-CDM
but also across all the previously mentioned systems and projects where the transformation into
OMOP-CDM is already underway (see section 2), enabling validation of the ETL process from
a privacy perspective. For instance, the local attributes mapped to OMOP concepts classified
as identifiers will necessitate the consideration of appropriate controls and privacy protection
techniques.
3originally founded as a spin-of of DBGroup dbgroup.unimore.it/</p>
        <p>To investigate the feasibility of our idea and highlight potential issues within the process, we
took as a starting point the already available manually annotated terminologies of PII related
to medical data. As it is likely impossible to capture all codes that can potentially contain PII,
due to their wide variability, an initial resource is presented in [18]4. Another one is the PPI
(Participant Provided Information) terminology, a standard vocabulary in OMOP related to the
’All of Us’ program, which focuses on collecting health data from a diverse group of participants
[19]. However, the PPI terminology is not linked by any relationship to any other vocabulary.
Furthermore, these resources are not directly linked to OMOP-CDM vocabularies, therefore
we explored various procedures to match these resources with OMOP-CDM terminologies and
leverage this knowledge to classify the mapped attributes as PII.</p>
        <p>• Initially, we investigated diferent mapping techniques based on syntactic probabilistic
algorithms, which as expected yielded poor results, but with the inclusion of specific
data pre-processing approaches, they do tend to identify some potentially identifiable
attributes.
• In addition, experiments were conducted exploiting Large Language Models (LLM). We
tested some of the most renowned open-source LLM instructing them with specific
examples to perform the mapping. Initially, some models have shown acceptable performance
but later exhibited hallucinations regarding column names.
• Subsequently, we focused on techniques specifically developed for the classification and
annotation tasks. For instance, we adapted the methodology used in [20] by defining
identifiable and sensitive data classes as labels for annotation. However, the performance
of this approach was limited by the impossibility of accessing plain-text column data.
In consideration of these experiments, we consider that a potential approach to achieving
accurate classification methods may lie in training an instance of LLM specifically for this task.
However, these models do not provide high reliability as they are also based on probabilistic
syntactic principles and fail to capture the semantics of concepts [21]. Therefore, we believe
that employing symbolic, explainable, and semantic-based methods will yield more promising
results.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Data mapping to OMOP CDM</title>
        <p>In the application case where a local source has not already been harmonized to OMOP-CDM,
it is not only necessary to perform the data classification procedure described above, but also
the Vocabulary Mapping, Data tables mapping and ELT process described in Section 2. This
approach can be locally performed by a single source, participating in the PPDI framework
outlined in 1. However, the harmonization process is extremely challenging, needing to be
accomplished quasi-totally manually by domain experts.</p>
        <p>
          From a practical standpoint, the first steps of the harmonization process can be overlaid on
the Schema Matching phase of data integration processes (Section 2). In the majority of data
integration projects the schema matching phase is implemented following a bottom-up approach,
ifnding the correspondences between the diferent schema of Local sources and producing a
4github.com/data2health/next-gen-data-sharing/blob/master/CodesWithPPIPotential.csv
unique integrated Global Schema. Within the OMOP/OHDSI ecosystem, the global schema is
represented by OMOP-CDM. Therefore this phase is carried out using a top-down approach,
aligning each local schema to OMOP-CDM and producing mapping rules to harmonize the
original data. This allows parallelization across multiple local sources and the addition of new
ones, dealing with scalability and interoperability issues (see Section 2) of the traditional
bottomup approach. However, within a privacy-preserving context, to prevent data privacy disclosure
is not possible to access the original data in plain format, but only metadata, attribute names,
and their associated descriptions, therefore only schema-level matching methods can be applied.
The schema-level matching method SMAT[22] is considered a baseline within the OMOP-CDM
and it is based on a deep learning model incorporating NLP techniques. Instead, within the PoC
for the Criminal Data Warehouse project [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] we employed a schema-level matching method
based on annotation.
4.2.1. Semantic Privacy-Preserving Schema Annotator
To date, the re-implementation of MOMIS has consisted of the development of diferent
microservices to perform semantic-based schema annotation of property names and values, exploiting
the semantic knowledge of multiple thesauri in multiple languages. The system allows
scalability on multiple simultaneous pipelines and thesaurus configuration. Among them the OMOP
standard for the medical field, including all the diferent standardized vocabularies (such as
SNOMED, ICD10, and several others) and the OMW - Open Multilingual WordNet, ofering
good linguistic coverage. Furthermore 35 translation services are available, supporting all major
languages and specific functions for the identification of non-canonic terms [ 17] are available
allowing better performance. E.g., the term “consenso” is not found in OMOP, however the
automatic translation “consent” can be correctly annotated. We will provide a more accurate
description of this system in other publications as it is still in the development phase, but from
the initial results, our method appears to achieve better results compared to SMAT [22].
        </p>
        <p>On the other hand, it is also advisable to contemplate scenarios where accessing the local
schema is unfeasible and hence explore the concept of Privacy-Preserving Schema Matching
(PPSM) [23]. It establishes that at the time when the schema matching is conducted, no concrete
information regarding the local data and schema has been released to the PPDI framework. For
this reason, future work will concern the application of PPSM methods [23] to the OMOP-CDM
context, based on the classification and privacy assessment presented in 4.1.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion and Future Works</title>
      <p>This paper provides an overview of the Privacy-Preserving Data Integration process
encompassing numerous challenges, especially in the context of Big Health Data. Section 3 presents
the OMOP-CDM standardized data structure, the process of harmonizing healthcare datasets
to OMOP-CDM format and discussing the assessment of the privacy and usability trade-of of
this process. In light of this, Section 4 presents a solution to address the main privacy issues in
the context of OMOP-CDM. Namely, the design and development of specific methodologies
and tools, adherent to the principles of OHDSI, to classify data according to identifiability
and privacy, and perform a Privacy-Preserving Schema Matching [23] process according to
OMOP-CDM. This solution is still in the development phase but promises good results in terms
of performance and privacy protection compared to other state-of-the-art systems. From our
perspective, the broad spectrum of tasks and issues about the process of harmonizing data
according to OMOP-CDM that, to the best of our knowledge, have not received extensive coverage
in the existing literature allow for many future developments. One of the main shortcomings
of this process that limits its usefulness is the absence of de-duplication [24]. To this end, a
possible future development is the adaptation of Privacy-Preserving Record Linkage techniques
and Data fusion approaches to perform de-duplication within the OMOP-CDM context.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgment</title>
      <p>We wish to thank all the members of DBGroup. Lisa Trigiante wishes to mention that her PhD
project is founded by MIUR under D.M.351 with the Emilia Romagna region as partner.
[12] A. Matcho, P. Ryan, D. Fife, C. Reich, Fidelity assessment of a clinical practice research
datalink conversion to the omop common data model, Drug safety 37 (2014) 945–959.
[13] A. Haberson, C. Rinner, A. Schöberl, W. Gall, Feasibility of mapping austrian health
claims data to the OMOP common data model, J. Medical Syst. 43 (2019) 314:1–314:5. URL:
https://doi.org/10.1007/s10916-019-1436-9. doi:10.1007/S10916-019-1436-9.
[14] M. Y. Garza, G. D. Fiol, J. D. Tenenbaum, A. Walden, M. Nahm, Evaluating common data
models for use with a longitudinal community registry, J. Biomed. Informatics 64 (2016) 333–
341. URL: https://doi.org/10.1016/j.jbi.2016.10.016. doi:10.1016/J.JBI.2016.10.016.
[15] H. Liu, S. Carini, Z. Chen, S. P. Hey, I. Sim, C. Weng, Ontology-based categorization
of clinical studies by their conditions, J. Biomed. Informatics 135 (2022) 104235. URL:
https://doi.org/10.1016/j.jbi.2022.104235. doi:10.1016/J.JBI.2022.104235.
[16] A. Chechulina, J. Carus, P. Breitfeld, C. Gundler, H. Hees, R. Twerenbold, S. Blankenberg,
F. Ückert, S. Nürnberg, Semi-automated mapping of german study data concepts to an
english common data model, Applied Sciences 13 (2023). URL: https://www.mdpi.com/
2076-3417/13/14/8159. doi:10.3390/app13148159.
[17] D. Beneventano, S. Bergamaschi, S. Sorrentino, Extending wordnet with compound nouns
for semi-automatic annotation in data integration systems, in: Proceedings of the 5th
International Conference on Natural Language Processing and Knowledge Engineering,
NLPKE 2009, Dalian, China, September 24-27, 2009, IEEE, 2009, pp. 1–8. URL: https:
//doi.org/10.1109/NLPKE.2009.5313842. doi:10.1109/NLPKE.2009.5313842.
[18] E. R. Pfaf, M. A. Haendel, K. Kostka, A. Lee, E. Niehaus, M. B. Palchuk, K. Walters, C. G.</p>
      <p>Chute, Ensuring a safe (r) harbor: Excising personally identifiable information from
structured electronic health record data, Journal of Clinical and Translational Science 6
(2022) e10.
[19] A. of Us Research Program Investigators, The “all of us” research program, New England</p>
      <p>Journal of Medicine 381 (2019) 668–676.
[20] K. Korini, C. Bizer, Column type annotation using chatgpt, arXiv preprint arXiv:2306.00745
(2023).
[21] W. Saba, Stochastic LLMs do not Understand Language: Towards Symbolic, Explainable
and Ontologically Based LLMs, 2023, pp. 3–19. doi:10.1007/978-3-031-47262-6_1.
[22] J. Zhang, B. Shin, J. D. Choi, J. C. Ho, SMAT: an attention-based deep learning
solution to the automation of schema matching, in: L. Bellatreche, M. Dumas, P. Karras,
R. Matulevicius (Eds.), Advances in Databases and Information Systems - 25th
European Conference, ADBIS 2021, Tartu, Estonia, August 24-26, 2021, Proceedings,
volume 12843 of Lecture Notes in Computer Science, Springer, 2021, pp. 260–274. URL: https:
//doi.org/10.1007/978-3-030-82472-3_19. doi:10.1007/978-3-030-82472-3\_19.
[23] T. Amagasa, F. Zhang, J. Sakuma, H. Kitagawa, A scheme for privacy-preserving ontology
mapping, in: Proceedings of the 18th International Database Engineering &amp; Applications
Symposium, IDEAS ’14, Association for Computing Machinery, New York, NY, USA, 2014, p.
87–95. URL: https://doi.org/10.1145/2628194.2628232. doi:10.1145/2628194.2628232.
[24] F. N. Wirth, T. Meurers, M. Johns, F. Prasser, Privacy-preserving data sharing
infrastructures for medical research: systematization and comparison, BMC Medical
Informatics Decis. Mak. 21 (2021) 242. URL: https://doi.org/10.1186/s12911-021-01602-x.
doi:10.1186/S12911-021-01602-X.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Chris</given-names>
            <surname>Clifton</surname>
          </string-name>
          . et al.,
          <article-title>Privacy-preserving data integration and sharing</article-title>
          , in: DMKD, ACM,
          <year>2004</year>
          , pp.
          <fpage>19</fpage>
          -
          <lpage>26</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Luca</given-names>
            <surname>Bolognini</surname>
          </string-name>
          . et al.,
          <article-title>Pseudonymization and impacts of big (personal/anonymous) data processing in the transition from the directive 95/46/ec to the new EU general data protection regulation</article-title>
          ,
          <source>Comput. Law Secur. Rev</source>
          .
          <volume>33</volume>
          (
          <year>2017</year>
          )
          <fpage>171</fpage>
          -
          <lpage>181</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Lisa</given-names>
            <surname>Trigiante</surname>
          </string-name>
          . et al.,
          <article-title>Privacy-preserving data integration for digital justice</article-title>
          , in: International Conference on Conceptual Modeling, Springer,
          <year>2023</year>
          , pp.
          <fpage>172</fpage>
          -
          <lpage>177</lpage>
          . URL: https://link.springer.com/chapter/10.1007/978-3-
          <fpage>031</fpage>
          -47112-4_
          <fpage>16</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Lisa</given-names>
            <surname>Trigiante</surname>
          </string-name>
          . et al.,
          <article-title>Privacy-preserving data integration for health</article-title>
          ,
          <source>31st Symposium on Advanced Database Systems</source>
          (
          <year>2023</year>
          ). URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3478</volume>
          /paper39.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>L.</given-names>
            <surname>Trigiante</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Beneventano</surname>
          </string-name>
          , S. Bergamaschi, [vision paper]
          <article-title>privacy-preserving data integration</article-title>
          ,
          <source>in: 2023 IEEE International Conference on Big Data (BigData)</source>
          ,
          <source>IEEE Computer Society</source>
          , Los Alamitos, CA, USA,
          <year>2023</year>
          , pp.
          <fpage>5614</fpage>
          -
          <lpage>5618</lpage>
          . URL: https://doi.ieeecomputersociety.
          <source>org/ 10.1109/BigData59044</source>
          .
          <year>2023</year>
          .
          <volume>10386703</volume>
          . doi:
          <volume>10</volume>
          .1109/BigData59044.
          <year>2023</year>
          .
          <volume>10386703</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Lisa</given-names>
            <surname>Trigiante</surname>
          </string-name>
          ,
          <article-title>Analysis and experimentation of State-of-the-Art Privacy-Preserving Record Linkage techniques in Data Integration environments</article-title>
          ,
          <source>Master's thesis</source>
          , Unimore,
          <year>2022</year>
          . URL: https://dbgroup.ing.unimore.it/publication/TrigianteL_Master_Thesis.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Sonia</given-names>
            <surname>Bergamaschi</surname>
          </string-name>
          . et al.,
          <article-title>Data integration</article-title>
          , in: Handbook of Conceptual Modeling, Springer,
          <year>2011</year>
          , pp.
          <fpage>441</fpage>
          -
          <lpage>476</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>R.</given-names>
            <surname>Schnell</surname>
          </string-name>
          ,
          <article-title>Privacy-preserving record linkage</article-title>
          ,
          <source>in: Methodological Developments in Data Linkage</source>
          , John Wiley &amp; Sons,
          <year>2015</year>
          , pp.
          <fpage>201</fpage>
          -
          <lpage>225</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Daochen</given-names>
            <surname>Zha</surname>
          </string-name>
          . et al.,
          <article-title>Data-centric AI: perspectives and challenges</article-title>
          ,
          <source>CoRR abs/2301</source>
          .04819 (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Anushka</given-names>
            <surname>Vidanage</surname>
          </string-name>
          . et al.,
          <article-title>Taxonomy of attacks on privacy-preserving record linkage</article-title>
          ,
          <source>J. Priv. Confidentiality</source>
          <volume>12</volume>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Anushka</given-names>
            <surname>Vidanage</surname>
          </string-name>
          . et al.,
          <article-title>A vulnerability assessment framework for privacy-preserving record linkage</article-title>
          ,
          <source>ACM Transactions on Privacy and Security</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>