<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Example:
Input Sentence:
“MD Connect Call</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Can Zero-Shot Commercial API's Deliver Regulatory-Grade Clinical Text De-Identification?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Veysel Kocaman</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Muhammed Santas</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yigit Gul</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mehmet Butgul</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>David Talby</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>John Snow Labs inc.</institution>
          <addr-line>16192 Coastal Highway, Lewes, DE 19958</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <volume>11</volume>
      <issue>59</issue>
      <abstract>
        <p>We systematically assess the performance of three leading API-based de-identi!cation systems-Azure Health Data Services, AWS Comprehend Medical, and OpenAI GPT-4o-against our de-identi!cation systems on a ground truth dataset of 48 clinical documents annotated by medical experts. Our analysis, conducted at both entity-level and token-level, demonstrates that our solution, Healthcare NLP, achieves the highest accuracy, with a 96% F1-score in protected health information (PHI) detection, signi!cantly outperforming Azure (91%), AWS (83%), and GPT-4o (79%). Beyond accuracy, Healthcare NLP is also the most cost-e"ective solution, reducing processing costs by over 80% compared to Azure and GPT-4o. Its !xed-cost local deployment model avoids the escalating per-request fees of cloud-based services, making it a scalable and economical choice. Our results underscore a critical limitation: zero-shot commercial APIs fail to meet the accuracy, adaptability, and cost-e#ciency required for regulatory-grade clinical de-identi!cation. Healthcare NLP's superior performance, customization capabilities, and economic advantages position it as the more viable solution for healthcare organizations seeking compliance and scalability in clinical NLP work$ows.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Electronic Health Records (EHRs) are now widespread across the United States healthcare
system, with adoption rates surpassing 96% in acute care hospitals and 86% among o#ce-based
physicians [
        <xref ref-type="bibr" rid="ref1 ref19">1</xref>
        ]. Although structured data, such as billing and claims information, constitutes
a substantial component of EHRs, a signi!cant proportion of clinical information remains in
unstructured formats, including progress notes, discharge summaries, radiology reports, and
pathology reports. This unstructured data contains valuable contextual details essential for
comprehensive patient care. Its secondary use in research has gained increasing importance,
with potential bene!ts in areas such as population health management, real-world evidence
generation, patient safety enhancements, and drug discovery. However, processing unstructured
data poses substantial ethical and technical challenges. The inherent variability of free-text
documentation complicates e"orts to preserve privacy, as sensitive patient information is
frequently embedded within clinical narratives.
      </p>
      <p>
        Given the highly sensitive nature of this data, it must undergo a de-identi!cation process
before use. De-identi!cation involves removing or obscuring personal health information (PHI)
from medical records to protect patient privacy. De-identi!ed data refers to health information
that has been stripped of all “direct identi!ers”—elements that could uniquely identify an
individual. The Health Insurance Portability and Accountability Act (HIPAA) Safe Harbor
guidelines de!ne 18 such direct identi!ers (U.S. Department of Health &amp; Human Services, 2023
) [
        <xref ref-type="bibr" rid="ref2 ref20">2</xref>
        ], though any additional data points capable of uniquely identifying a patient must also be
considered. The federally regulated HIPAA Privacy Rule outlines two primary methods for
de-identifying PHI: Expert Determination and Safe Harbor.
      </p>
      <p>
        Recent studies suggest that deep learning-based automated de-identi!cation models can
surpass human annotators in identifying PHI, with hybrid approaches demonstrating the
greatest potential [
        <xref ref-type="bibr" rid="ref21 ref3">3</xref>
        ]. Once the de-identi!cation criteria for a speci!c dataset have been
established, advanced technologies can be employed to automate the detection of protected
health information (PHI) in both structured and unstructured data. The combination of machine
learning techniques and sophisticated Natural Language Processing (NLP) algorithms has
markedly enhanced the capacity to identify and $ag PHI across various data formats. To
streamline the de-identi!cation process, researchers can utilize Large Language Models (LLMs),
specialized NLP models, and cloud provider APIs for processing extensive clinical datasets.
However, the task of handling ambiguous or novel instances of identi!able information remains
challenging, necessitating continuous improvement of these automated tools to strike a balance
between e#ciency and the nuanced interpretation required in healthcare settings. It is worth
noting that while LLMs o"er powerful capabilities, their application in de-identifying sensitive
data (PHI) may be considered excessive or potentially unreliable for certain use cases, particularly
when a high degree of customization is required. The choice of technology should be carefully
evaluated based on the speci!c requirements of the de-identi!cation task and the desired level
of precision.
      </p>
      <p>This study examines the performance and compares de-identi!cation services, developed by
us and named as Healthcare NLP library, AWS Comprehend Medical, and Azure Health Data
Services, with a focus on their accuracy when applied to a dataset annotated by healthcare
experts. The comparison of these services provides valuable insights into their respective
strengths and limitations, enabling informed decision-making for researchers, developers, and
organizations seeking appropriate de-identi!cation tools. Additionally, this comprehensive
analysis equips stakeholders with the necessary information to select the most suitable tool based
on accuracy, compliance, cost-e"ectiveness, and scalability for processing sensitive healthcare
data.</p>
      <p>
        For researchers, this analysis helps identify the most accurate, reliable, and cost-e"ective
service for processing sensitive data, which is crucial for maintaining data integrity in clinical
studies. Developers bene!t from understanding the ease of integration and API $exibility of
each service, essential factors for building scalable solutions that can handle large volumes of
clinical data [
        <xref ref-type="bibr" rid="ref22 ref4">4</xref>
        ]. Organizations, especially in the healthcare and !nance sectors, gain valuable
insights into the compliance capabilities and performance of these tools, ensuring that the
chosen solution aligns with regulatory requirements while enhancing operational e#ciency.
      </p>
      <p>The comparison highlights variations in performance among the evaluated services. Our
Healthcare NLP library achieved the highest accuracy, with macro and weighted average
F1scores of 96% and 99%, respectively, followed by Azure Health Data Services with 85% macro
and 99% weighted average F1-scores, and AWS Comprehend Medical with 80% macro and
98% weighted average F1-score. However, performance may vary based on speci!c use cases
and dataset characteristics. Additionally, a cost analysis for processing one million clinical
notes (each containing 5,250 characters) revealed that the Healthcare NLP library is the most
cost-e"ective option, followed by Azure Health Data Services and AWS Comprehend Medical.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Background and Related Work</title>
      <p>
        The de-identi!cation of unstructured data has been extensively studied, with various Natural
Language Processing (NLP) approaches proposed over the years [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ]. This process can be
divided into two main subtasks: !rst, identifying Protected Health Information (PHI) within the
text, and second, replacing those identi!ers through either masking (substituting them with
placeholder values) or obfuscation (replacing them with randomly generated values based on
their type). Among these, the task of PHI identi!cation has been the primary focus of research
[
        <xref ref-type="bibr" rid="ref22 ref4">4</xref>
        ].
      </p>
      <p>
        Early de-identi!cation systems in the clinical domain were predominantly rule-based, as seen
in the work of Sweeney [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and Gupta et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. These systems relied on regular expressions,
syntactic rules, and specialized dictionaries to detect PHI in text. While rule-based approaches
are e"ective in identifying structured PHI elements such as phone numbers, email addresses,
and license numbers, they struggle with more complex entities, including personal names,
professions, and hospital names [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Rule-based systems, while e"ective in speci!c contexts,
often exhibit limited generalizability across diverse datasets. These systems typically require
substantial modi!cations to their underlying dictionaries and rule sets when applied to new
environments, hindering their adaptability and scalability in varied clinical settings.
      </p>
      <p>
        The !eld of automated PHI detection and de-identi!cation has seen signi!cant advancements
in recent years, with several major cloud providers and specialized services o"ering solutions
to address the growing need for secure handling of sensitive healthcare data. The concept of
automatic de-identi!cation gained prominence in 2014 through the Informatics for Integrating
Biology and the Bedside (i2b2) project, which introduced a pioneering academic NLP challenge
focused on automatically detecting PHI identi!ers from medical records [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. This initiative
accelerated research and development of Machine Learning and Deep Learning algorithms for
robust PHI identi!cation, laying the groundwork for more sophisticated approaches that are
now being implemented by major cloud service providers.
      </p>
      <p>
        Recent research suggests that deep learning-based automated de-identi!cation models can
surpass human annotators in PHI identi!cation, with hybrid approaches demonstrating the
greatest potential [
        <xref ref-type="bibr" rid="ref21 ref3">3</xref>
        ]. In the current landscape, several key players have emerged with o"erings
designed to streamline the process of PHI detection and de-identi!cation. Several studies have
conducted performance comparisons of PHI detection systems, providing valuable insights
into the e"ectiveness of various de-identi!cation approaches. These comparisons are
crucial for researchers and healthcare organizations seeking to implement e#cient and accurate
de-identi!cation processes while maintaining data utility for secondary use in research and
analytics.
      </p>
      <p>
        A notable study by Steinkamp et al. [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] evaluated !ve publicly available de-identi!cation
tools on a large corpus of narrative-text radiology reports. The research assessed token-level
recall, precision, and F1 scores for each tool across various PHI subcategories. The study found
that machine learning systems outperformed rule-based systems, with the best-performing
system (NeuroNER) achieving a token-level F1 score of 93.6%. However, this performance was
still below the acceptable level for clinical use (95% recall) on sensitive categories of PHI.
      </p>
      <p>
        Recent advancements in Large Language Models (LLMs) have prompted researchers to
investigate their potential for de-identifying clinical notes. A study by Altalla et al. [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]
compared the de-identi!cation performance of GPT-3.5 and GPT-4, revealing GPT-4’s superior
capabilities in this domain. The study, published on January 31, 2025, reported that GPT-4
achieved remarkable results with a precision of 0.9925, recall of 0.8318, F1 score of 0.8973, and
accuracy of 0.9911, signi!cantly outperforming its predecessor, GPT-3.5.
      </p>
      <p>
        Despite these promising results, the application of LLMs for de-identi!cation presents several
challenges. The nascent stage of LLM utilization in this !eld raises concerns regarding the
privacy and security of health data, particularly when employing API-based models [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
Moreover, LLMs may encounter di#culties in striking a balance between e"ective de-identi!cation
and preserving the clinical utility of notes, potentially altering non-sensitive information crucial
for research and analysis [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. The variation in performance across di"erent datasets highlights
the need for continued development to achieve consistent and reliable results across diverse
clinical settings.
      </p>
      <p>This study aims to contribute to previous performance comparisons in PHI entity recognition
and assist researchers and decision-makers in selecting the most suitable tool for processing
large-scale datasets with high accuracy and cost-e"ectiveness. To achieve this, we compare
three widely used and advanced de-identi!cation tools that incorporate state-of-the-art models
while ensuring consistency: Our Healthcare NLP library, Azure Health Data Services, AWS
Comprehend Medical and GPT4o, a state-of-the-art commercial multi-modal LLM.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Experiments and Results</title>
      <sec id="sec-3-1">
        <title>3.1. The Deidentification Solutions</title>
        <p>In this section, we will provide brief information for each de-identi!cation solution that supports
di"erent set of PHI entities. The list of PHI entities supported by each model is shared in Table
A2.</p>
        <sec id="sec-3-1-1">
          <title>3.1.1. Healthcare NLP &amp; LLM Library</title>
          <p>
            The Healthcare NLP library is a powerful component of Spark NLP platform [
            <xref ref-type="bibr" rid="ref15">15</xref>
            ], speci!cally
designed to facilitate NLP tasks within the healthcare domain [
            <xref ref-type="bibr" rid="ref16">16</xref>
            ]. This library o"ers over
2,500 pre-trained models and pipelines tailored for medical data, enabling accurate information
extraction, named entity recognition (NER) for clinical and medical concepts, and robust text
analysis capabilities. Regularly updated with advanced algorithms, it helps healthcare
professionals derive meaningful insights from unstructured medical data sources such as electronic
health records, clinical notes, and biomedical literature.
          </p>
          <p>Additionally, the library features custom large language models (LLMs) in various sizes and
quantization levels for tasks like medical note summarization, question answering,
retrievalaugmented generation (RAG), and healthcare-related conversational interactions. It also
provides a robust solution for de-identifying medical records using advanced NER models to
automatically detect and remove PHI from clinical notes. This ensures compliance with privacy
regulations while preserving data utility for research, enabling secure data sharing, enhancing
patient privacy, and promoting innovation in medical research.</p>
          <p>
            The Healthcare NLP library allows users to create custom de-identi!cation pipelines
targeting speci!c labels or to utilize pre-trained pipelines with two lines of code to de-identify a
broad range of entities. These entities include AGE, CONTACT, DATE, ID, LOCATION, NAME,
PROFESSION, CITY, COUNTRY, DOCTOR, HOSPITAL, IDNUM, MEDICALRECORD,
ORGANIZATION, PATIENT, PHONE, STREET, USERNAME, ZIP, ACCOUNT, LICENSE, VIN, SSN,
DLN, PLATE, IPADDR, EMAIL, and more. In Beyond Accuracy: Automated De-Identi!cation of
Large Real-World Clinical Text Datasets [
            <xref ref-type="bibr" rid="ref22 ref4">4</xref>
            ], the de-identi!cation process is explained in detail,
describing the implementation of a hybrid context-based model architecture for automated
clinical note processing.
          </p>
          <p>In this study, a pre-trained de-identi!cation pipeline was utilized, speci!cally designed to
extract and de-identify entities such as NAME, IDNUM, CONTACT, LOCATION, AGE, and DATE.
Notably, this pipeline operates independently of any large language model (LLM) components.</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>3.1.2. Azure Health Data Services</title>
          <p>Azure Health Data Services’ de-identi!cation service is designed to safeguard sensitive health
information while maintaining data utility. This API employs advanced natural language
processing techniques to identify, label, redact, or surrogate PHI in unstructured medical
texts. The service provides three essential operations: Tag, Redact, and Surrogate, which allow
healthcare organizations to process various types of clinical documents securely and e#ciently.
By utilizing machine learning algorithms, the service can detect HIPAA’s 18 identi!ers and
other PHI entities, ensuring compliance with various regional privacy regulations such as GDPR
and CCPA.</p>
        </sec>
        <sec id="sec-3-1-3">
          <title>3.1.3. Amazon Comprehend Medical</title>
          <p>Amazon Comprehend Medical is a HIPAA-eligible natural language processing (NLP) service that
leverages machine learning to extract valuable health data from unstructured medical text. This
tool quickly and accurately identi!es medical entities such as conditions, medications, dosages,
tests, treatments, and Protected Health Information (PHI) from various clinical documents
including physician’s notes, discharge summaries, and test results. With its ability to understand
context and relationships between extracted information, AWS Comprehend Medical o"ers a
robust solution for healthcare professionals and researchers looking to automate data extraction,
improve patient care, and streamline clinical work$ows.</p>
        </sec>
        <sec id="sec-3-1-4">
          <title>3.1.4. Open AI GPT-4o for Deidentification</title>
          <p>GPT-4o is a multi-modal model that o"ers improvements in response times and classi!cation
accuracy compared to GPT-4, which could theoretically enhance the precision of identifying and
redacting sensitive information via prompting. While GPT-3.5 and GPT-4 have been extensively
studied for their de-identi!cation capabilities, particularly in processing medical text, GPT-4o
presents an intriguing option due to its enhanced performance over GPT-4 in various tasks.
However, no formal study has yet evaluated GPT-4o’s de-identi!cation capabilities. Given the
importance of PHI redaction in healthcare AI applications, understanding the model’s strengths
and limitations in this area remains crucial. Despite these advantages, its e"ectiveness in
deidenti!cation remains speculative without empirical studies directly assessing its performance.
While there are cost-e"ective alternatives for de-identi!cation, we opted for GPT-4o due to its
widespread adoption, strong presence in research, and its demonstrated advancements over
previous models.</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Dataset</title>
        <p>
          The annotation of patient identi!ers within clinical data is a critical process in healthcare
research and data management. This study employed a comprehensive annotation methodology
utilizing the John Snow Labs’ Annotation Lab software, which facilitated a multi-stage approach
to entity recognition and labeling. The process began with a pre-annotation step using deep
learning models to extract initial entities, followed by human re!nement guided by a dynamic
annotation guide. This iterative approach, involving multiple rounds of review and correction,
ensured high accuracy and adaptability throughout the !ne-tuning and evaluation phases [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ].
        </p>
        <p>The dataset employed in this study comprised 48 clinical notes meticulously annotated by
our domain experts. The dataset was speci!cally curated to facilitate the evaluation of
deidenti!cation systems in a healthcare context. Expert annotations focused on six key entity
types: IDNUM, LOCATION, DATE, AGE, NAME, and CONTACT. These entities represent critical
categories of Protected Health Information (PHI) that are commonly subject to de-identi!cation
under regulatory frameworks such as the Health Insurance Portability and Accountability Act
(HIPAA) and the General Data Protection Regulation (GDPR).</p>
        <p>The selection of these entity types was motivated by their frequent occurrence in clinical
narratives and their signi!cance in ensuring patient privacy. Identi!ers such as patient names,
contact details, and unique ID numbers pose a high risk of re-identi!cation if not properly
anonymized. Similarly, location information, age, and date-related details can contribute to
indirect re-identi!cation, necessitating robust de-identi!cation strategies. By centering the
benchmark on these entities, this study ensures that the performance evaluation remains directly
aligned with real-world de-identi!cation challenges in healthcare settings.</p>
        <p>
          To enhance reproducibility, the benchmark dataset utilized in this study has been made
publicly available in a dedicated repository[
          <xref ref-type="bibr" rid="ref18">18</xref>
          ]. This ensures transparency and facilitates
further research in the !eld of healthcare de-identi!cation.
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Comparison of the Solutions</title>
        <p>The most signi!cant di"erence between these tools lies in their adaptability. Azure Health Data
Services, Amazon Comprehend Medical and GPT-4o are API-based, black-box cloud solutions,
making modifying or adapting results to speci!c needs impossible. On the other hand, the
Healthcare NLP library’s de-identi!cation pipeline can be loaded and utilized with just two
lines of code. The pipeline outputs can be customized by adjusting its stages to meet speci!c
needs, and it can also be used locally with no internet connection.</p>
        <sec id="sec-3-3-1">
          <title>3.3.1. Evaluation Criteria</title>
          <p>In this benchmark study, we employed two distinct approaches to compare accuracy:</p>
        </sec>
        <sec id="sec-3-3-2">
          <title>3.3.2. Entity-Level Evaluation</title>
          <p>Since de-identifying PHI data is a critical task, we evaluated how well de-identi!cation tools
detected entities present in the annotated dataset, regardless of their speci!c labels in the ground
truth. The detection outcomes were categorized as:
• full_match: The entire entity was correctly detected.
• partial_match: Only a portion of the entity was detected.</p>
          <p>• not_matched: The entity was not detected at all.</p>
          <p>For example, for the text: “Patient John Doe was admitted to Boston General Hospital on
01/12/2023.”, the ground truth entity “John Doe (NAME)” could have the following predicted
entities:
• Predicted Entity: “John Doe (NAME)” ==&gt; full_match
• Predicted Entity: “John” ==&gt; partial_match
• Predicted Entity: “Patient” ==&gt; not_matched</p>
          <p>For evaluation results, please refer to Figure A1 in the Appendix section.</p>
        </sec>
        <sec id="sec-3-3-3">
          <title>3.3.3. Token-Level Accuracy</title>
          <p>The text in the annotated dataset was tokenized, and the ground truth labels assigned to
each token were compared with predictions made by the Healthcare NLP library, Amazon
Comprehend Medical, Azure Health Data Services, and GPT-4o model. Classi!cation reports
were generated for each tool, comparing their precision, recall, and F1 scores. Token-level
evaluation results are presented in Figure A2 in the Appendix section.</p>
        </sec>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Methodology</title>
        <p>In this study, di"erences were observed between the predictions generated by the
deidenti!cation services and the ground truth annotations. The ground truth dataset utilized
generic entity labels; for instance, all names were annotated as NAME, rather than
distinguishing between PATIENT_NAME and DOCTOR_NAME. To ensure consistency in evaluation,
the predicted labels from the de-identi!cation tools were mapped to their corresponding ground
truth labels.</p>
        <p>To maintain a fair comparison, entities that did not have a direct mapping to the ground truth
labels—such as PROFESSION, ORGANIZATION, and other non-essential entity types—were
excluded from the predictions before conducting the performance evaluation. This preprocessing
step ensured that the assessment focused solely on the six critical entity types relevant to
healthcare de-identi!cation. Entity mapping table showing entity mapping across di"erent
providers can be seen at Table A3. After obtaining the model predictions and applying the
preprocessing steps, the entity distribution was summarized in Table A5. While evaluating
GPT4o, we used a one-shot prompt to provide the model some sample PHI entity extraction
tasks (the prompt is shared in the Appendix).The model was con!gured with a temperature
of 1 and executed as a single run, while all other parameters were maintained at their default
settings to ensure consistency in evaluation.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments and Results</title>
      <sec id="sec-4-1">
        <title>4.1. Performance Evaluation</title>
        <p>The !nal results can be found at Table 1. The entity-level and token-level evaluations including
comparative analyses and benchmark scores can be found in the Appendix.</p>
        <p>The primary objective of de-identi!cation is to accurately detect PHI entities. In this regard,
we also wanted to evaluate binary classi!cation performance in which entities were classi!ed
as either PHI or non-PHI, disregarding speci!c subcategories. The PHI entity detection results
are also summarized in Table 1 and Figure 2.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Cost Estimation for De-identifying Clinical Data</title>
        <p>Cost is a critical factor when processing large-scale clinical datasets. To estimate expenses, we
simulated the cost of de-identifying 1 million unstructured clinical notes, each averaging 5,250
characters.</p>
        <p>The pricing estimates are as follows:
• Amazon Comprehend Medical: Processing 1M documents costs approximately
$14,525.
• Azure Health Data Services: Processing 1M documents costs approximately $13,125.
• Open AI GPT-4o: Processing 1M documents costs approximately $21,400.
• Healthcare NLP: Using John Snow Labs’ Healthcare NLP Prepaid on an EC2 c6a.8xlarge
instance ($1.2/hour), de-identifying PHI from 48 documents took 39.4 seconds.
Extrapolating, processing 1M documents would take approximately 228 hours (9.5 days), but
with proper scaling, it could be completed in a single day. The total estimated cost:
– Infrastructure: $273
– License: $2,145 (if one-month license cost set to $7,000)
– Total: $2,418</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>In this study, we conducted a comparative analysis of the performance of Healthcare NLP,
Amazon Comprehend Medical, Azure Health Data Services, and Open AI GPT-4o model on a
ground truth dataset annotated by medical experts. The evaluation was performed at two levels:
entity-level and token-level.</p>
      <p>The entity-level analysis demonstrated that Healthcare NLP outperformed its counterparts in
accurately capturing entities while minimizing missed detections. Azure Health Data Services
exhibited the second-best performance, followed by Amazon Comprehend Medical. The GPT-4o
model ranked fourth in this comparative assessment.</p>
      <p>The token-level evaluation further reinforced these !ndings, with Healthcare NLP achieving
the highest precision, recall, and F1-score. Azure Health Data Services, Amazon
Comprehend Medical and GPT-4o followed in that order, indicating a consistent pattern of superior
performance for Healthcare NLP across both evaluation metrics.</p>
      <p>A key di"erentiator among these tools is their adaptability. While Azure Health Data Services,
Amazon Comprehend Medical and GPT-4o function as API-based, black-box cloud solutions with
no customization capabilities, Healthcare NLP provides a $exible and transparent framework.
Its de-identi!cation pipeline can be implemented with minimal coding e"ort, and users can
modify pipeline stages to tailor the output to their speci!c requirements.</p>
      <p>From a cost-e"ectiveness perspective, Healthcare NLP emerges as the most viable solution
for large-scale clinical data processing. Unlike cloud-based services, which impose per-request
pricing that escalates with increasing data volumes, Healthcare NLP allows for !xed-cost, local
deployment. Even when processing substantial datasets, such as one billion clinical notes, its
pricing remains stable over the same time period, providing a signi!cant economic advantage
over API-based alternatives.</p>
      <p>In summary, Healthcare NLP consistently outperformed Azure Health Data Services, Amazon
Comprehend Medical, and GPT-4o across all evaluation metrics by 5-10%, achieving the highest
accuracy while minimizing missed detections. Beyond its superior performance, its adaptability
o"ers a crucial advantage over the black-box nature of cloud solutions, enabling users to
customize de-identi!cation pipelines to meet speci!c needs. Furthermore, its cost-e"ective
deployment model presents substantial savings, making it a compelling alternative to API-based
solutions.</p>
      <sec id="sec-5-1">
        <title>Description</title>
        <p>The original text with identifiable
information</p>
      </sec>
      <sec id="sec-5-2">
        <title>Example Text</title>
        <p>He is a 60-year-old male.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Appendix</title>
      <sec id="sec-6-1">
        <title>Type</title>
        <p>Original</p>
      </sec>
      <sec id="sec-6-2">
        <title>Tool</title>
        <p>Healthcare
NLP Library
Azure Health
Data Services</p>
      </sec>
      <sec id="sec-6-3">
        <title>Entities De-identified</title>
      </sec>
      <sec id="sec-6-4">
        <title>Key Features</title>
        <p>AGE, CONTACT, DATE, ID, LOCA- Highly flexible; the de-identification
TION, NAME, PROFESSION, CITY, pipeline can be easily loaded with two lines
COUNTRY, DOCTOR, HOSPITAL, of code and customized to meet specific
IDNUM, MEDICALRECORD, OR- requirements. Additionally, it can be used
GANIZATION, PATIENT, PHONE, locally.</p>
        <p>STREET, USERNAME, ZIP,
ACCOUNT, LICENSE, VIN, SSN, DLN,
PLATE, IPADDR, EMAIL
DATE, DOCTOR, HOSPITAL,
IDNUM, PATIENT,
MEDICALRECORD, PHONE, AGE, STREET,
STATE, CITY, HEALTHPLAN,
PROFESSION, ZIP, EMAIL,
ORGANIZATION, USERNAME, FAX, URL,
LOCATIONOTHER, ACCOUNT,
COUNTRYORREGION,
SOCIALSECURITY</p>
        <p>API-based, black-box solution; no direct
control over results; suitable for integrated,
cloud-based environments but lacks
flexibility for task-specific adjustments.</p>
        <p>AWS Compre- DATE, NAME, ADDRESS, ID, AGE, API-based, black-box solution;
dehend Medical PHONE_OR_FAX, PROFESSION, identification is limited to specific
URL, EMAIL pre-configured models; lacks
customization and flexibility for adapting results to
specific needs.</p>
        <p>GPT-4o</p>
        <p>No pre-built set of entities</p>
        <p>API-based, black-box solution;
identification is run via prompting.</p>
        <p>de</p>
        <sec id="sec-6-4-1">
          <title>Evaluation Results</title>
          <p>The results obtained by comparing the predictions made by Healthcare NLP, AWS Comprehend
Medical, and Azure Health Data Services with the ground truth entities are presented below.
Figure A1: Entity Level Evaluation</p>
          <p>To further analyze the performance of each de-identi!cation tool, a token-level evaluation
was conducted. This involved tokenizing the ground truth text and associating each token with
the corresponding predicted labels from Healthcare NLP, Amazon Comprehend Medical, Azure
Health Data Services and GPT-4o.</p>
          <p>Figure A2: Token Level Evaluation
Table A4
Match Statistics for Healthcare NLP, Azure, AWS, and GPT-4o Predictions. The table shows the number
of matches and their corresponding percentages for the di!erent prediction models.</p>
        </sec>
      </sec>
      <sec id="sec-6-5">
        <title>Match Type</title>
        <p>Full Match
Partial Match
Not Matched</p>
      </sec>
      <sec id="sec-6-6">
        <title>Healthcare NLP</title>
        <p>1342 (90.7%)
124 (8.4%)
13 (0.9%)</p>
      </sec>
      <sec id="sec-6-7">
        <title>Azure</title>
        <p>1258 (85.0%)
164 (11.1%)
57 (3.8%)</p>
        <p>AWS
1108 (74.9%)
219 (14.8%)
152 (10.3%)</p>
        <p>GPT-4o
983 (66.5%)
280 (18.9%)
216 (14.6%)
You are an expert medical annotator with extensive experience in labeling medical entities within clinical
texts. Your role is to accurately identify and annotate Protected Health Information (PHI) entities in the
provided text, following the specified entity types.</p>
        <p>– IDNUM, LOCATION, DATE, AGE, NAME, CONTACT
[
{{’begin’: 24, ’end’: 30, ’entity_type’: ’DATE’, ’chunk’: ’2/16/69’}}
{{’begin’: 42, ’end’: 45, ’entity_type’: ’NAME’, ’chunk’: ’Hale’}}
{{’begin’: 50, ’end’: 67, ’entity_type’: ’LOCATION’, ’chunk’: ’Senior Care Clinic’}}
{{’begin’: 69, ’end’: 79, ’entity_type’: ’LOCATION’, ’chunk’: ’Queen Creek’}}
{{’begin’: 83, ’end’: 84, ’entity_type’: ’LOCATION’, ’chunk’: ’SD’}}
{{’begin’: 96, ’end’: 105, ’entity_type’: ’NAME’, ’chunk’: ’Terri Bird’}}
]
—
Task:
Extract all PHI entities from the text below. The entity types to identify are: IDNUM, LOCATION, DATE,
AGE, NAME, CONTACT.</p>
        <p>Expected Output Format:
{ entities:[
{’begin’: &lt;start_index&gt;, ’end’: &lt;end_index&gt;, ’entity_type’: ’&lt;entity_type&gt;’,
’chunk’: ’&lt;extracted_text&gt;’}
] }
—
Text to Annotate:
{text}
—</p>
        <p>Your Response:</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>K.</given-names>
            <surname>Myrick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ogburn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ward</surname>
          </string-name>
          ,
          <article-title>Percentage of o#ce-based physicians using any electronic health record (ehr)/electronic medical record (emr) system and physicians that have a certi!ed ehr/emr system, by us state: National electronic health records survey</article-title>
          ,
          <year>2017</year>
          , National Center for Health Statistics (
          <year>2019</year>
          )
          <fpage>2021</fpage>
          -
          <lpage>04</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>U.S.</surname>
          </string-name>
          <article-title>Department of Health &amp; Human Services, Guidance regarding methods for deidenti!cation of protected health information in accordance with the health insurance portability and accountability act (hipaa) privacy rule</article-title>
          ,
          <source>Web Page</source>
          ,
          <year>2023</year>
          . URL: https: //www.hhs.gov/hipaa/for-professionals/special-topics/de-identi!cation/index.html,
          <source>accessed on October 24</source>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>B.</given-names>
            <surname>Negash</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Katz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. J.</given-names>
            <surname>Neilson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Moni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Nesca</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Singer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. E.</given-names>
            <surname>Enns</surname>
          </string-name>
          , De-identi!
          <article-title>cation of free text data containing personal health information: a scoping review of reviews</article-title>
          ,
          <source>International Journal of Population Data Science</source>
          <volume>8</volume>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>V.</given-names>
            <surname>Kocaman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Talby</surname>
          </string-name>
          , H. U. Hak, Rwd143 beyond accuracy: Automated de-identi!
          <article-title>cation of large real-world clinical text datasets</article-title>
          ,
          <source>Value in Health</source>
          <volume>26</volume>
          (
          <year>2023</year>
          )
          <article-title>S532</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>P. M.</given-names>
            <surname>Nadkarni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Ohno-Machado</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. W.</given-names>
            <surname>Chapman</surname>
          </string-name>
          ,
          <article-title>Natural language processing: an introduction</article-title>
          ,
          <source>Journal of the American Medical Informatics Association</source>
          <volume>18</volume>
          (
          <year>2011</year>
          )
          <fpage>544</fpage>
          -
          <lpage>551</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>K.</given-names>
            <surname>Khin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Burckhardt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Padman</surname>
          </string-name>
          ,
          <article-title>A deep learning architecture for de-identi!cation of patient notes: Implementation and evaluation</article-title>
          , arXiv preprint arXiv:
          <year>1810</year>
          .
          <volume>01570</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>L.</given-names>
            <surname>Sweeney</surname>
          </string-name>
          ,
          <article-title>Replacing personally-identifying information in medical records, the scrub system</article-title>
          .,
          <source>in: Proceedings of the AMIA annual fall symposium</source>
          , American Medical Informatics Association,
          <year>1996</year>
          , p.
          <fpage>333</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>D.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Saul</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gilbertson</surname>
          </string-name>
          ,
          <article-title>Evaluation of a deidenti!cation (de-id) software engine to share pathology reports and clinical documents for research</article-title>
          ,
          <source>American journal of clinical pathology 121</source>
          (
          <year>2004</year>
          )
          <fpage>176</fpage>
          -
          <lpage>186</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Chen</surname>
          </string-name>
          , De-identi!
          <article-title>cation of clinical notes via recurrent neural network and conditional random !eld</article-title>
          ,
          <source>Journal of biomedical informatics 75</source>
          (
          <year>2017</year>
          )
          <fpage>S34</fpage>
          -
          <lpage>S42</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Ö. Uzuner</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Luo</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Szolovits</surname>
          </string-name>
          ,
          <article-title>Evaluating the state-of-the-art in automatic de-identi!cation</article-title>
          ,
          <source>Journal of the American Medical Informatics Association</source>
          <volume>14</volume>
          (
          <year>2007</year>
          )
          <fpage>550</fpage>
          -
          <lpage>563</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>J. M. Steinkamp</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Pomeranz</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Adleberg</surname>
            ,
            <given-names>C. E. Kahn</given-names>
          </string-name>
          <string-name>
            <surname>Jr</surname>
            ,
            <given-names>T. S.</given-names>
          </string-name>
          <string-name>
            <surname>Cook</surname>
          </string-name>
          , Evaluation of automated public de-identi!
          <article-title>cation tools on a corpus of radiology reports</article-title>
          ,
          <source>Radiology: Arti!cial Intelligence</source>
          <volume>2</volume>
          (
          <year>2020</year>
          )
          <article-title>e190137</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>B.</given-names>
            <surname>Altalla</surname>
          </string-name>
          ',
          <string-name>
            <given-names>S.</given-names>
            <surname>Abdalla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Altamimi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Bitar</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Al Omari,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kardan</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Sultan</surname>
          </string-name>
          ,
          <article-title>Evaluating gpt models for clinical note de-identi!cation, Scienti!c Reports 15 (</article-title>
          <year>2025</year>
          )
          <fpage>3852</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Shu</surname>
          </string-name>
          , et al.,
          <article-title>Deidgpt: Zero-shot medical text de-identi!cation by gpt-4</article-title>
          , arXiv preprint arXiv:
          <volume>2303</volume>
          .11032 (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>A. R.</given-names>
            <surname>Sarkar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.-S.</given-names>
            <surname>Chuang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Mohammed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Jiang</surname>
          </string-name>
          , De-identi!
          <article-title>cation is not enough: a comparison between de-identi!ed and synthetic clinical notes</article-title>
          ,
          <source>Scienti!c Reports</source>
          <volume>14</volume>
          (
          <year>2024</year>
          )
          <fpage>29669</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>V.</given-names>
            <surname>Kocaman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Talby</surname>
          </string-name>
          ,
          <article-title>Spark nlp: natural language understanding at scale</article-title>
          ,
          <source>Software Impacts</source>
          <volume>8</volume>
          (
          <year>2021</year>
          )
          <fpage>100058</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>V.</given-names>
            <surname>Kocaman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Talby</surname>
          </string-name>
          ,
          <article-title>Accurate clinical and biomedical named entity recognition at scale</article-title>
          ,
          <source>Software Impacts</source>
          <volume>13</volume>
          (
          <year>2022</year>
          )
          <fpage>100373</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>H. A.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Loftsson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Kulynych</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Kaabachi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. L.</given-names>
            <surname>Raisaro</surname>
          </string-name>
          ,
          <article-title>Accelerating clinical text annotation in underrepresented languages: A case study on text de-identi!cation, in: Digital Health and Informatics Innovations for Sustainable Health Care Systems</article-title>
          , IOS Press,
          <year>2024</year>
          , pp.
          <fpage>853</fpage>
          -
          <lpage>857</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18] JohnSnowLabs, De-identi!
          <article-title>cation benchmark ground truth dataset</article-title>
          ,
          <year>2024</year>
          . URL: https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/academic/ DeIdenti!cation_Benchmarks_
          <article-title>Text2Story2025/deidenti!cation_benchmark_ground_ truth_48_doc</article-title>
          .csv,
          <source>accessed: March</source>
          <volume>9</volume>
          ,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <article-title>1 Review the Text: Carefully read the text to understand its medical context</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          2
          <string-name>
            <given-names>Identify</given-names>
            <surname>PHI</surname>
          </string-name>
          <article-title>Entities: Locate any terms or phrases that represent PHI, based on the following entity types:</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <article-title>3 Annotate Entities: For each identified PHI, provide the start and end character indices, the entity type, and the exact text (chunk) of the entity</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          4
          <string-name>
            <given-names>Response</given-names>
            <surname>Format</surname>
          </string-name>
          <article-title>: Return the annotations in a structured JSON format, as demonstrated in the examples below</article-title>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>