<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Tunisian-Algerian Conference on applied Computing, December</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Uterine Ultrasound Image Captioning Using Deep Learning Techniques</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Abdennour Boulesnane</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Boutheina Mokhtari</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Oumnia Rana Segueni</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Slimane Segueni</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>BIOSTIM Laboratory, Faculty of Medicine, Salah Boubnider University</institution>
          ,
          <addr-line>Constantine</addr-line>
          ,
          <country country="DZ">Algeria</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of IFA, Faculty of NTIC, Abdelhamid Mehri University</institution>
          ,
          <addr-line>Constantine</addr-line>
          ,
          <country country="DZ">Algeria</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Obstetrics and Gynecology Clinic</institution>
          ,
          <addr-line>23 Khelifi Abderrahmane Street, Chelghoum Laid, Mila</addr-line>
          ,
          <country country="DZ">Algeria</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>1</volume>
      <fpage>7</fpage>
      <lpage>18</lpage>
      <abstract>
        <p>Medical imaging has revolutionized medical diagnostics and treatment planning, progressing from early X-ray usage to sophisticated methods like MRIs, CT scans, and ultrasounds. This paper investigates the use of deep learning for medical image captioning, with a particular focus on uterine ultrasound images. These images are crucial in obstetrics and gynecology for diagnosing and monitoring various disorders across diverse age demographics. Nonetheless, their interpretation frequently proves dificult because of their intricacy. In this paper, a deep learning-based medical image interpretation system is developed, which integrates convolutional neural networks with bidirectional recurrent unit networks. This hybrid methodology examines both textual and visual components to produce relevant captions for ultrasound images of the uterus. The experimental ifndings demonstrate the eficacy of this strategy relative to baseline procedures, as indicated by superior BLEU and ROUGE scores. The suggested approach has superior performance in generating precise and informative captions. Our research enhances the interpretation of uterine ultrasound images, enabling physicians to make prompt and precise diagnoses, thereby elevating patient care.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Medical Image Captioning</kwd>
        <kwd>Deep Learning</kwd>
        <kwd>Uterine Ultrasound Images</kwd>
        <kwd>Image Interpretation</kwd>
        <kwd>Medical AI</kwd>
        <kwd>Diagnostic Precision</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Throughout history, technological advances in medical imaging have dramatically changed the way
we diagnose and treat [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. This change began with the invention of X-rays over a century ago, which
allowed imaging of the human body without the need for surgery. The field has since continued to
evolve with new technologies such as magnetic resonance imaging (MRI), computed tomography (CT),
positron emission tomography (PET), and ultrasound, which have helped in accurately diagnosing
many medical conditions [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Today, the integration of computer vision, natural language processing
(NLP), and artificial intelligence (AI) has revolutionized the field, opening the door to unprecedented
advances [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        AI technologies, especially those based on deep learning, have shown remarkable potential in diagnosing
various medical conditions quickly and accurately [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. These machine learning algorithms significantly
reduce the workload of medical professionals, leading to significant impacts on healthcare and patient
care [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. One of the most exciting developments in this field is medical image annotation (MIC) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
By leveraging deep learning, medical image annotation systems can automatically generate medical
image annotations, combining expert annotations with images from comprehensive datasets to provide
accurate and detailed analyses. These capabilities enhance medical documentation, speed up diagnosis,
and facilitate remote consultations, thereby improving healthcare delivery overall [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>
        Despite these advances, the interpretation of medical images remains a formidable challenge [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
Variations in physicians’ levels of expertise can lead to inconsistent diagnoses, and misinterpretations
can lead to medical errors that negatively impact patients’ health. Furthermore, reading and analyzing
these images can be time-consuming, especially in emergency settings where rapid decision-making
is critical to the patient’s life. Uterine ultrasound images, in particular, present unique challenges in
obstetrics and gynecology. Their generally low quality compared to other medical images complicates
the interpretation process, which can lead to delayed or incorrect diagnoses and impact patient care [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
The complexity and diversity of these images underscore the need for an efective MIC system, which
is the primary motivation for our research.
      </p>
      <p>Our study aims to address the challenges in interpreting uterine ultrasound images by developing
a specialized MIC system to enhance diagnostic accuracy and eficiency. To this end, we collected a
comprehensive dataset of uterine ultrasound images, prioritizing patient privacy and confidentiality.
This dataset was then carefully annotated using expert-provided descriptions, ensuring high-quality
data for training and evaluation. We then performed extensive data preprocessing, isolating regions of
interest within the images using a cropping algorithm and standardizing text captions using natural
language processing techniques.</p>
      <p>In the feature extraction phase, we used pre-trained convolutional neural network (CNN) models such
as Inception V3 and DenseNet201 to obtain more detailed feature vectors from the images. Meanwhile,
we converted the text data into numerical representations to match the image features. Our deep
learning model combines these processed inputs through a bidirectional gated recurrent unit (BiGRU)
network, generating descriptive captions for ultrasound images. Evaluated using metrics such as BLEU
and ROUGE scores, the CNN-BiGRU model showed promising results in accurately describing uterine
ultrasound images. These results demonstrate the efectiveness of our approach and its potential to
enhance diagnostic accuracy in gynecology, ultimately contributing to improved patient care.</p>
      <p>The remainder of the paper is structured as follows: Section 2 presents a review of related works.
Section 3 details the proposed approaches. In Section 4, we analyze and discuss the experimental results.
Finally, Section 5 ofers conclusions and outlines directions for future research.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Ultrasound imaging is invaluable for visualizing complex anatomical structures, ofering advantages
such as portability, real-time imaging, cost-efectiveness, and the absence of radiation [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. However,
interpreting these images can be challenging due to their often low quality, with common issues such
as fuzzy borders and numerous artifacts [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. While numerous studies have focused on medical image
captioning (MIC) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], the majority target medical reports for chest X-ray images [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], leaving MIC
for ultrasound images relatively underexplored. This section will delve into MIC research specifically
pertaining to ultrasound images.
      </p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], a coarse-to-fine ensemble model for ultrasound image captioning is presented. The model
ifrst detects organs using a coarse classification model, then encodes the images with a fine-grained
classification model, and finally generates annotation text describing disease information using a
language generation model. The model, trained using transfer learning from a pre-trained VGG16
model, achieves high accuracy in ultrasound image recognition.
      </p>
      <p>Building on the concept of combining diferent models, [ 12] introduces an NLP-based method to caption
fetal ultrasound videos using vocabulary typical of sonographers. This approach combines a CNN
(based on VGGNet16, fine-tuned on fetal ultrasound images) and an RNN for textual feature extraction.
The CNN extracts image features, while the RNN encodes text features, merging them to generate
captions for anatomical structures. The model is evaluated with BLEU and ROUGE-L metrics and
produces relevant and descriptive captions for educating sonography trainees and patients.
In [13], a new method for ultrasound image captioning based on region detection is introduced to
improve disease content analysis. The model detects and encodes focus areas in ultrasound images and
then uses LSTM to generate descriptive text. This method increases accuracy in focus area detection and
achieves higher BLEU-1 and BLEU-2 scores with fewer parameters and faster runtimes than traditional
models.</p>
      <p>Expanding on incorporating additional data types, [14] introduces a Semantic Fusion Network to
improve the accuracy of medical image diagnostic reports by integrating pathological information.
This network comprises a lesion area detection model that extracts visual and pathological data and a
diagnostic generation model that combines this information to produce reports. This method enhances
the accuracy of generated reports, showing a 1.2% increase in the ultrasound image dataset compared
to models relying solely on visual features.</p>
      <p>In a similar vein of enhancing multimodal integration, [15] introduces an Adaptive Multimodal Attention
network to generate high-quality medical image reports. The model employs a multilabel classification
network to predict local properties of ultrasound images, using their word embeddings as semantic
features. It integrates semantic and adaptive attention mechanisms with a sentinel gate to balance focus
between visual features and language model memories. This approach enhances report accuracy and
robustness, outperforming baseline models in capturing key local properties.</p>
      <p>Addressing the challenge of small datasets, [16] presents a weakly-supervised method to enhance image
captioning models using a large anatomically-labeled image classification dataset. This encoder-decoder
model generates pseudo-captions for unlabeled images, creating an augmented dataset that significantly
improves fetal ultrasound image captioning. This approach nearly doubles BLEU-1 and ROUGE-L scores,
saving time on manual annotations and improving model performance in communicating information
to laypersons.</p>
      <p>In [17], a transformer-based model is proposed to generate descriptive ultrasound images of lymphoma,
providing auxiliary guidance for sonographers. The model integrates deep stable learning to eliminate
feature dependencies and includes a memory module for enhanced semantic modeling. Using a nonlinear
feature decorrelation method, this approach visualizes cross-attention for interpretability and focuses
on lymphoma features over the background. The result is a more accurate and detailed depiction of
lymphoma in ultrasound images.</p>
      <p>To further improve automatic report generation, [18] introduces a framework utilizing both unsupervised
and supervised learning to align visual and textual features. Unsupervised learning extracts knowledge
from text reports, guiding the model, while a global semantic comparison mechanism ensures accurate,
comprehensive reports. Tested on three large datasets (breast, thyroid, liver), the method outperforms
other approaches without needing manual disease labels, enhancing eficiency and accessibility.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology and Proposed Approach</title>
      <p>This study presents a novel uterine ultrasound image captioning system. To achieve this, we first
gathered a diverse dataset of uterine ultrasound images and carefully annotated them with precise
medical terminology, covering women of various ages and pregnancy stages. Our approach involved data
preprocessing for images and text, followed by feature extraction using pre-trained CNN-based models.
Finally, we implemented our proposed deep learning model, CNN-BiLGRU. Detailed descriptions of
each module follow in the subsequent sections.</p>
      <sec id="sec-3-1">
        <title>3.1. Data Collection and Annotation</title>
        <p>Our dataset focuses specifically on gynecology, the branch of medicine that deals with women’s health.
We created a dataset that delves deeper into the details of gynecological imaging to address the specific
challenges doctors face when diagnosing gynecological problems. This section details collecting and
annotating medical images to train the medical image captioning model.</p>
        <p>Our research utilized a dataset of ultrasound images exceeding 500 in number (505 images). Data
collection involved acquiring ultrasound images from three main sources (see Figure 1). Internally,
we gathered 214 images obtained directly from the Sonoscape SS1-8000 machine. Each image has
a dimension of 1024x768 pixels (width: 1024 pixels, height: 768 pixels) and is stored in JPG format.
Externally, we Externally, we incorporated data from publicly available datasets to enrich this collection
42%</p>
        <p>38%
20%
Uterine Fibroid Ultrasound Images
Fetal_Planes_DB</p>
        <p>Our Collected Uterine Ultrasound Images
and capture a wider range of variations. From the Mendeley repository, which ofered a rich collection
of nearly 1,500 fetal ultrasound images (uterine fibroid ultrasound images [ 19]), we collaborated with
experts to meticulously review and select a subset of 191 images that best aligned with our research
goals. This selection process involved eliminating images with repetitive features, poor capture of the
region of interest, or other factors that could negatively impact model training. Additionally, the Zenodo
[20] dataset (Fetal Planes DB) provided 450 images, meticulously organized to include four images per
patient, each representing the standard fetal planes of the abdomen, brain, femur, and thorax. From
this collection, we selected a subset of 100 images that best aligned with our research goals, ensuring
comprehensive coverage of fetal anatomy across multiple datasets.</p>
        <p>In the form of captions, annotations were then added to each image in our dataset. These captions
captured key features and findings within the ultrasound images, including identifying anatomical
structures such as the stomach, umbilical vein, femur bones, and brain ventricles and noting potential
abnormalities such as dilated organs or fluid pockets in the brain. We communicated closely with
experts during the annotation process to ensure accuracy and quality. This collaboration helped us
resolve image-related issues and made our dataset more valuable for analysis and research.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Data Pre-processing</title>
        <p>Data preprocessing is crucial for ensuring the quality and usability of data for subsequent analysis and
modeling [21]. Our study encompasses rigorous processing of images and text to enhance data integrity
and relevance.</p>
        <p>Image processing plays a crucial role in refining collected data. Initially, we analyze and filter the
images to align with project requirements. The first step involves cropping the images to focus on
the Region of Interest (ROI). Upon reading each ultrasound image, we convert it to grayscale if it is
in color. Subsequently, we determine the cropping points by identifying significant changes in pixel
intensity from the image center toward its edges. This process begins by calculating the mean intensity
column-wise for both the right and left halves of the image, as depicted in Figure 2a. Peaks in these
intensity profiles highlight areas of interest, and points where intensity drops below a predefined
threshold (5% of peak value) denote edges of the ROI (see Figure 2b). Using these change points, we
derive precise cropping coordinates to isolate the ROI (Figure 2c). Post-cropping, all images are resized
uniformly to 224x224 pixels, a standard size compatible with many pre-trained neural networks. Each
resized image instance is then converted into a Numpy array and normalized. Normalization involves
scaling pixel values from 0 to 255 to a normalized range of 0 to 1 by dividing each pixel value by 255.
RightHalf
LeftHalf
ChangePoint1
ChangePoint2
100</p>
        <p>200 Horizonta3l0P0ixel Index 400
(a) Mean intensity variation across image width.
500
600
c1
(b) Original image with crop lines.
(c) ROI-cropped image.</p>
        <p>After completing the image processing and preparing the images, we focused on processing the
text captions associated with each image. These captions were derived from expert-provided medical
descriptions and were systematically linked to their corresponding image file names within an Excel
ifle. To enhance the text data for subsequent analysis, we applied NLP techniques [22]:
• Convert to Lowercase: All sentences were converted to lowercase to maintain consistency and
reduce variability across the dataset.
• Remove Punctuation: Punctuation marks were systematically removed to simplify the text and
emphasize the words.
• Remove Single Letters: Single letters such as ’l’, ’s’, ’a’, and ’à’ were removed, as they typically do
not contribute significant meaning in medical contexts.
• Remove Extra Spaces: Any extraneous spaces within the text were eliminated to ensure uniform
spacing and improve text clarity.
• Add Start and End Tags: Special tags &lt;START&gt; and &lt;END&gt; were appended to the beginning
and end of each sentence. These tags serve as markers during subsequent text processing and
modeling to delineate sentence boundaries efectively.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Feature Extraction</title>
        <p>Feature extraction is crucial in identifying and describing pertinent information within patterns [23].
This process facilitates pattern classification by establishing a structured and systematic approach. This
phase focuses on deriving meaningful numerical representations from text descriptions and ultrasound
images.</p>
        <p>Text feature extraction involves converting textual data, such as medical reports and captions, into a
format suitable for machine learning models. Initially, we employ a tokenizer to create a dictionary of
word indices from our text data. This step allows us to determine the vocabulary size, represent the
total number of unique words in the dataset, and identify the longest caption’s length. Subsequently, we
construct a vocabulary of unique words to map each word to its corresponding index. Shorter sequences
are padded with zeros to ensure uniform input sequence lengths (captions), as neural networks require
consistent input dimensions.</p>
        <p>In addition to text, image feature extraction in this study utilizes advanced convolutional neural
network architectures, namely Inception V3 and DenseNet201. These models have been pre-trained
on the vast ImageNet dataset [24], which consists of millions of annotated images across thousands of
categories. The key advantage of using these pre-trained models is their ability to capture intricate
patterns and hierarchical representations within images.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Proposed Uterine Ultrasound Image Captioning Model</title>
        <p>The proposed uterine ultrasound image captioning model aims to generate meaningful and accurate
captions for medical ultrasound images of the uterus. To achieve high accuracy, the system architecture
incorporates several advanced components. Our dataset consists of 505 images specifically selected to
represent various uterine ultrasound scans commonly encountered in clinical practice.</p>
        <p>As shown in Figure 3, the model’s architecture begins with three input layers. The first input layer
receives features extracted from a DenseNet201 model, shaped as (None, 1920). The second input layer
obtains features from an InceptionV3 model, shaped as (None, 4800). The third input layer receives
tokenized text sequences with a (None, 54) shape, where 54 represents the maximum caption length.
The image features from the DenseNet201 and InceptionV3 models pass through dense layers that
reduce their dimensionality to (None, 256). These outputs are then reshaped into (None, 1, 256) using
reshape layers. Meanwhile, the tokenized text sequences are embedded, resulting in fixed-size tensor
vectors (None, 54, 256).</p>
        <p>The embeddings are concatenated with the reshaped image features, and the combined data is fed into a
bidirectional GRU layer, which processes sequential data bidirectionally and produces an output shape
of (None, 256). A dropout layer with a dropout rate of 0.5 is applied to prevent overfitting. The output
is then passed through an intermediate dense layer that reduces the dimensionality to (None, 128),
followed by another dropout layer with the same rate.</p>
        <p>Finally, a dense layer with a softmax activation function generates the final output, which has a shape
of (None, 626). This represents the predicted caption probabilities for each word in the vocabulary. By
integrating image and text features, this architecture produces accurate and informative captions for
uterine ultrasound images, thereby enhancing medical diagnosis and treatment planning.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments</title>
      <p>In this section, we detail the experimental setup and analyze the results of our image captioning
model. Regarding configuration, we divided the dataset, allocating 85% for training and 15% for testing
(validation). Furthermore, the model parameters were configured with the Adam optimizer, a batch size
of 16, and an early stopping patience of 10 epochs.</p>
      <p>In this analysis, we evaluate the performance of the proposed model and compare it with other models
to ensure the accuracy of the evaluation. To test the captions generated by the model, we used several
metrics such as BLEU and ROUGE scores. BLEU scores (BLEU1, BLEU2, BLEU3, and BLEU4) are
commonly used in machine translation to measure the similarity between generated labels and accurate
references using n-grams. ROUGE scores (ROUGE1, ROUGE2, and ROUGEL) are used to evaluate
text summarization, where ROUGE1 and ROUGE2 measure the recall of singletons and binaries. In
contrast, ROUGEL evaluates the recall of the longest common subsequences between generated labels
and references. These metrics help evaluate the model’s ability to generate accurate and relevant
captions for uterine ultrasound images.</p>
      <sec id="sec-4-1">
        <title>4.1. Performance Analysis of the Proposed CNN-BiGRU Model</title>
        <p>This analysis evaluates the performance of a CNN-BiGRU model that leverages powerful feature
extraction capabilities using pre-trained Inception V3 and DenseNet201 architectures. Additionally, the
model builds on BiGRU’s strength in temporal sequence modeling, which helps it generate accurate
and context-appropriate captions for uterine ultrasound images.</p>
        <p>Figure 4 displays the learning curves for the loss of the proposed CNN-BiGRU model during both the
training and validation phases. These curves indicate that the model was trained appropriately, with no
signs of overfitting. The training and validation losses were closely aligned throughout the training
process, a positive indicator of the model’s generalization capability.</p>
        <p>Specifically, at epoch 29, the model achieved a training loss of 1.64 and a validation loss of 1.86. These loss
values suggest that the model efectively learned the underlying patterns in the data while maintaining
a balance between fitting the training data and generalizing it to unseen validation data. The training
was terminated at epoch 39 due to the early stopping criterion, set with a patience of 10 epochs. This
means that the model stopped training when there was no significant improvement in the validation
loss for 10 consecutive epochs, thereby preventing overfitting and ensuring that the model maintained
its performance on the validation set.</p>
        <p>Achieving a low loss in image captioning is challenging because it requires understanding visual content,
recognizing objects and relationships, and translating this into coherent text. Variability in descriptions
and sequential dependency in caption generation add complexity. Additionally, aligning visual features
with textual representations involves bridging the gap between two diferent data modalities (i.e., images
and text).</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Comparison with Baseline Models</title>
        <p>Our study explored various architectures integrated with diferent processing layers to generate captions
for uterine ultrasound images. We primarily focused on using DenseNet201 and InceptionV3 models
for feature extraction, followed by BiGRU, as well as baseline models such as Unidirectional GRU
(UniGRU), Bidirectional Long Short-Term Memory (BiLSTM), and Unidirectional Long Short-Term
Memory (UniLSTM) networks.</p>
        <p>To provide a comprehensive comparison, we evaluated the performance of these models using several
metrics, including BLEU and ROUGE scores. Higher scores indicate better performance. As depicted in
Figure 5a, the BLEU scores for our models showed that BiGRU and BiLSTM outperformed the baseline
models UniGRU and UniLSTM, with BiGRU achieving the highest BLEU-4 score of 0.55. At the same
time, the ROUGE scores highlighted BiGRU as the best performer, with a ROUGE-L score of 0.78, as
(b)
Loss
Val_Loss</p>
        <p>ROUGE1
ROUGE2
ROUGEL
0,8
0,7
0,6
0,5
0,4
0,3
0,2
0,1
0</p>
        <p>BLUE1
BLUE2
BLUE3
BLUE4
0,8
0,7
0,6
0,5
0,4
0,3
0,2
0,1
0
UniLSTM</p>
        <p>UniGRU</p>
        <p>BiLSTM</p>
        <p>BiGRU</p>
        <p>UniLSTM</p>
        <p>UniGRU</p>
        <p>BiLSTM</p>
        <p>BiGRU
3
2
2,5
1,5
1
0,5
0
(a)</p>
        <p>UniLSTM</p>
        <p>UniGRU</p>
        <p>BiLSTM</p>
        <p>BiGRU
(c)</p>
        <p>Generated Caption
a white straight line at the top center
that represents the femur bone it is
possible to calculate the femur length
the knee is straight
a white line at the top center that
represents the femur bone it is
possible to calculate the femur length the
knee is straight
a large slightly oval circle that
represents the cranial contour of the
fetus inside it is possible to see the
cavum of the septum pellucidum on
the right only but it is possible to
calculate the biparietal diameter
a large slightly oval circle that
represents the cranial contour of the fetus
the cavum of the septum pellucidum
can be seen on the right it is possible
to calculate the biparietal diameter
shown in Figure 5b.</p>
        <p>We also analyzed the training loss and validation loss for the selected models (see Figure 5c). The values
of "Loss", which refers to the training loss calculated on the training dataset, and "Val_Loss", which
stands for validation loss calculated on the dataset, are important indicators of model performance. Our
results showed that the BiGRU model achieved the lowest loss values (as shown in Figure 4), with a
training loss of 1.64 and a validation loss of 1.86. This indicates the robustness and efectiveness of the
model in generating accurate and context-appropriate annotations of uterine ultrasound images.</p>
        <p>BiGRU’s superiority lies in its ability to capture dependencies in both directions within sequences,
a key feature for understanding context and generating accurate and consistent annotations. Unlike
unidirectional models, BiGRU can process data in both forward and backward directions, providing
a more comprehensive understanding of temporal context. This feature makes BiGRU particularly
suitable for complex tasks such as generating image labels, where diferent parts of an image need to be
accurately linked to their corresponding text while maintaining contextual consistency between them.</p>
        <p>The CNN-BiGRU model outperformed the other models in terms of BLEU and ROUGE scores and
showed lower loss values, proving its efectiveness in this application. In addition, Table 1 provides
further evidence by comparing the reference comments with the comments generated by the model.
This comparison clearly shows the model’s ability to generate high-quality comments thanks to its
bidirectional processing capabilities.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion and Future Work</title>
      <p>In this study, we successfully developed a deep learning-based medical image interpretation system
specifically designed for uterine ultrasound images using the CNN-BiGRU architecture. Our model
efectively combined the image feature extraction capabilities of pre-trained CNNs (InceptionV3 and
DenseNet201) with the sequential processing power of a bidirectional recurrent unit network. Through
experimental study, this hybrid approach demonstrated superior performance over baseline models,
achieving higher BLEU and ROUGE scores and maintaining low training and validation losses. The
resulting captions were accurate and informative, improving the interpretability of complex uterine
ultrasound images.</p>
      <p>Our research findings demonstrate the potential of deep learning techniques to enhance diagnostic
accuracy and eficiency in obstetrics and gynecology. By automating the translation process, our model
helps medical professionals make accurate and timely diagnoses and provide helpful second opinions,
potentially improving patient outcomes.</p>
      <p>The CNN-BiGRU model has shown good results, and there are ideas for further development in
the future. One important one is to expand the dataset by adding ultrasound images of the uterus
from diferent sources, which would make the model more robust and accurate. New techniques, such
as attention mechanisms or transformer-based models, could also be tried to improve the quality of
interpretations. In addition, work could be done to develop a system that annotates images in real-time
for use in clinics. Creating user-friendly interfaces with feedback and integrating medical data from
diferent sources could help provide more comprehensive diagnostic tools.</p>
    </sec>
    <sec id="sec-6">
      <title>Data Availability</title>
      <p>The corresponding researchers could provide the data supporting the study’s conclusions upon request.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used ChatGPT and Grammarly to rephrase and perform
Grammar and spelling checks. After using these tools, the authors reviewed and edited the content as
needed. The authors take full responsibility for the publication’s content.
[12] M. Alsharid, H. Sharma, L. Drukker, P. Chatelain, A. T. Papageorghiou, J. A. Noble, Captioning
Ultrasound Images Automatically, Springer International Publishing, 2019, p. 338–346. doi:10.
1007/978-3-030-32251-9_37.
[13] X. Zeng, L. Wen, B. Liu, X. Qi, Deep learning for ultrasound image caption generation based on
object detection, Neurocomputing 392 (2020) 132–141. doi:10.1016/j.neucom.2018.11.114.
[14] X. Zeng, L. Wen, Y. Xu, C. Ji, Generating diagnostic report for medical image by
high-middlelevel visual information incorporation on double deep learning models, Computer Methods and
Programs in Biomedicine 197 (2020) 105700. doi:10.1016/j.cmpb.2020.105700.
[15] S. Yang, J. Niu, J. Wu, Y. Wang, X. Liu, Q. Li, Automatic ultrasound image report generation with
adaptive multimodal attention mechanism, Neurocomputing 427 (2021) 40–49. doi:10.1016/j.
neucom.2020.09.084.
[16] M. Alsharid, H. Sharma, L. Drukker, A. T. Papageorgiou, J. A. Noble, Weakly Supervised
Captioning of Ultrasound Images, Springer International Publishing, 2022, p. 187–198. doi:10.1007/
978-3-031-12053-4_14.
[17] J. Deng, D. Chen, C. Zhang, Y. Dong, Generating lymphoma ultrasound image description with
transformer model, Computers in Biology and Medicine 174 (2024) 108409. doi:10.1016/j.
compbiomed.2024.108409.
[18] J. Li, T. Su, B. Zhao, F. Lv, Q. Wang, N. Navab, Y. Hu, Z. Jiang, Ultrasound report generation with
cross-modality feature alignment via unsupervised guidance, 2024. doi:10.48550/ARXIV.2406.
00644.
[19] T. Yang, Uterine fibroid ultrasound images, 2023. doi: 10.17632/n2zcmcypgb.2.
[20] X. P. Burgos-Artizzu, D. Coronado-Gutierrez, B. Valenzuela-Alcaraz, E. Bonet-Carne, E. Eixarch,
F. Crispi, E. Gratacós, FETAL_PLANES_DB: Common maternal-fetal ultrasound images, 2020.
doi:10.5281/zenodo.3904280.
[21] A. Boulesnane, S. Meshoul, K. Aouissi, Influenza-like illness detection from arabic facebook posts
based on sentiment analysis and 1d convolutional neural network, Mathematics 10 (2022) 4089.
doi:10.3390/math10214089.
[22] A. Boulesnane, Y. Saidi, O. Kamel, M. M. Bouhamed, R. Mennour, Dzchatbot: A medical assistant
chatbot in the algerian arabic dialect using seq2seq model, in: 2022 4th International Conference
on Pattern Analysis and Intelligent Systems (PAIS), IEEE, 2022. doi:10.1109/pais56586.2022.
9946867.
[23] A. O. Salau, S. Jain, Feature extraction: A survey of the types, techniques, applications, in:
2019 International Conference on Signal Processing and Communication (ICSC), IEEE, 2019.
doi:10.1109/icsc45622.2019.8938371.
[24] A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with deep convolutional neural
networks, Communications of the ACM 60 (2017) 84–90. doi:10.1145/3065386.
[25] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: a method for automatic evaluation of machine
translation, in: Proceedings of the 40th annual meeting of the Association for Computational
Linguistics, 2002, pp. 311–318.
[26] C.-Y. Lin, Rouge: A package for automatic evaluation of summaries, in: Text summarization
branches out, 2004, pp. 74–81.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Haidekker</surname>
          </string-name>
          , Medical Imaging Technology, Springer New York,
          <year>2013</year>
          . doi:
          <volume>10</volume>
          .1007/ 978-1-
          <fpage>4614</fpage>
          -7073-1.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>W. G.</given-names>
            <surname>Bradley</surname>
          </string-name>
          ,
          <article-title>History of medical imaging</article-title>
          ,
          <source>Proceedings of the American Philosophical Society</source>
          <volume>152</volume>
          (
          <year>2008</year>
          )
          <fpage>349</fpage>
          -
          <lpage>361</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>R.</given-names>
            <surname>Obuchowicz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Strzelecki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Piórkowski</surname>
          </string-name>
          ,
          <source>Artificial Intelligence in Medical Imaging and Image Processing</source>
          ,
          <string-name>
            <surname>MDPI</surname>
          </string-name>
          ,
          <year>2024</year>
          . doi:
          <volume>10</volume>
          .3390/books978-3-
          <fpage>7258</fpage>
          -1260-8.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>K.</given-names>
            <surname>Suzuki</surname>
          </string-name>
          ,
          <article-title>Overview of deep learning in medical imaging</article-title>
          ,
          <source>Radiological Physics and Technology</source>
          <volume>10</volume>
          (
          <year>2017</year>
          )
          <fpage>257</fpage>
          -
          <lpage>273</lpage>
          . doi:
          <volume>10</volume>
          .1007/s12194-017-0406-5.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>D.</given-names>
            <surname>Kaul</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Raju</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. K.</given-names>
            <surname>Tripathy</surname>
          </string-name>
          , Deep Learning in Healthcare, Springer International Publishing,
          <year>2021</year>
          , p.
          <fpage>97</fpage>
          -
          <lpage>115</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>030</fpage>
          -75855-
          <issue>4</issue>
          _
          <fpage>6</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>D.-R.</given-names>
            <surname>Beddiar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Oussalah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Seppänen</surname>
          </string-name>
          ,
          <article-title>Automatic captioning for medical imaging (mic): a rapid review of literature</article-title>
          ,
          <source>Artificial Intelligence Review</source>
          <volume>56</volume>
          (
          <year>2022</year>
          )
          <fpage>4019</fpage>
          -
          <lpage>4076</lpage>
          . doi:
          <volume>10</volume>
          .1007/ s10462-022-10270-w.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>L.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lv</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zeng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>Deep image captioning: A review of methods, trends and future challenges</article-title>
          ,
          <source>Neurocomputing</source>
          <volume>546</volume>
          (
          <year>2023</year>
          )
          <article-title>126287</article-title>
          . doi:
          <volume>10</volume>
          .1016/j.neucom.
          <year>2023</year>
          .
          <volume>126287</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>B.</given-names>
            <surname>Levienaise-Obadia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gee</surname>
          </string-name>
          ,
          <article-title>Adaptive segmentation of ultrasound images</article-title>
          ,
          <source>Image and Vision Computing</source>
          <volume>17</volume>
          (
          <year>1999</year>
          )
          <fpage>583</fpage>
          -
          <lpage>588</lpage>
          . doi:
          <volume>10</volume>
          .1016/s0262-
          <volume>8856</volume>
          (
          <issue>98</issue>
          )
          <fpage>00177</fpage>
          -
          <lpage>2</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>H.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-H.</given-names>
            <surname>Park</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.-A.</given-names>
            <surname>Heng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. K.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <article-title>Iterative Multi-domain Regularized Deep Learning for Anatomical Structure Detection and Segmentation from Ultrasound Images</article-title>
          , Springer International Publishing,
          <year>2016</year>
          , p.
          <fpage>487</fpage>
          -
          <lpage>495</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>319</fpage>
          -46723-8_
          <fpage>56</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>X.-H.</given-names>
            <surname>Zeng</surname>
          </string-name>
          , B.-G. Liu,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <article-title>Understanding and generating ultrasound image description</article-title>
          ,
          <source>Journal of Computer Science and Technology</source>
          <volume>33</volume>
          (
          <year>2018</year>
          )
          <fpage>1086</fpage>
          -
          <lpage>1100</lpage>
          . doi:
          <volume>10</volume>
          .1007/ s11390-018-1874-8.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Figueredo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. E.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <article-title>A survey of deep learning-based radiology report generation using multimodal data, 2024</article-title>
          . URL: https://arxiv.org/abs/2405.12833. doi:
          <volume>10</volume>
          .48550/ARXIV.2405.12833.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>