TU-Net: Transformer based U-Net for left ventricle MRI
                                segmentation
                                Amit Pandey1, Akansha Singh1, Ajith Abraham2 and Krishna Kant Singh3
                                1
                                  School of CSET, Bennett University, Gautam Buddha Nagar, India
                                2
                                  School of AI, Bennett University, Gautam Buddha Nagar, India
                                3
                                  Delhi Technical Campus, Greater Noida, India

                                                                    Abstract
                                                                    Accurate segmentation of the left ventricle in cardiac MRI images is crucial for evaluating cardiac function
                                                                    and diagnosing cardiovascular conditions. Traditional approaches, including the commonly used U-Net
                                                                    architecture, struggle with capturing the global contextual information required for precise segmentation.
                                                                    This study introduces U-Net MHSA, an enhanced version of U-Net that incorporates Multi-Head Self-
                                                                    Attention (MHSA) in the bottleneck layer to overcome these limitations. By combining the strengths of
                                                                    convolution layers and attention mechanisms, our model effectively captures long-range dependencies
                                                                    while preserving spatial coherence. Our model U-Net MHSA gives better results as compared to the baseline
                                                                    U-Net on the MICCAI 2009 Left Ventricle Segmentation Challenge dataset. U-Net MHSA gives higher scores
                                                                    as compared to baseline U-Net in terms of precision 0.799531 and accuracy 0.797943. Although the model
                                                                    gives a minor trade-off with slightly reduced recall and Intersection over Union (IoU). The overall results
                                                                    shows that the integration of MHSA with U-Net architecture improves the medical image segmentation.

                                                                    Keywords
                                                                    MRI, Cardiac function, U-Net, Multi-Head Self-Attention, medical image segmentation, Self-Attention 1


                                1. Introduction
                                Medical image segmentation (MIA) [1] plays a crucial role in modern healthcare, where accurate and
                                precise diagnostic tools for example Magnetic Resonance Imaging (MRI), X-ray, and CT scans [2] are
                                very crucial in clinical decision-making. Traditional methods like manual and semi-automatic
                                segmentation are purely based on human inputs and are not so much accurate and precise but also
                                time-consuming. In the last few years machine learning [3], deep learning [4], and convolutional
                                neural network [5] have revolutionized the medical image field. U-Net [6], based on a convolutional
                                neural network came into the picture in 2015 and revolutionized the field of medical imaging due to
                                its unique U-Shaped architecture and skip connections. By using skip connections U-Net
                                concatenates the low-level features with high-level features for more accurate and precise
                                segmentations of medical images. Despite having a lot of advantages and success U-Net has some
                                limitations also. Initial layers of the encoder path have poor representations of feature maps and
                                these feature maps also pass through skip connections, which have no use and also increase the time
                                and space complexity. U-Net was also not able to handle long-rage dependencies and parallel
                                computations. In order to handle these limitations, we propose TU-Net a hybrid model which
                                integrates MHSA [7] with U-Net architecture in bottleneck. TU-Net aims to use the strengths of both
                                architectures and gives better performance by capturing global image context and also retains fine-
                                grained spatial feature, which is essential for accurate and precise segmentation. In further sections
                                we explain in detail self-attention, MHSA block and U-Net architecture.


                                ProfIT AI 2024: 4th International Workshop of IT-professionals on Artificial Intelligence (ProfIT AI 2024), September 25–27,
                                2024, Cambridge, MA, USA
                                    : e21soep0035@bennett.edu.in (A. Pandey); akansha1.singh@bennett.edu.in (A. Singh); ajith.abraham@bennett.edu.in
                                (A. Abraham); Krishnaiitr2011@gmail.com (K.K. Singh)
                                    0009-0000-1317-952X (A. Pandey); 0000-0002-5520-8066 (A. Singh); 0000-0002-0169-6738 (A. Abraham); 0000-0002-
                                6510-6768 (K.K. Singh)
                                                               © 2024 Copyright for this paper by its authors.
                                                               Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                    CEUR
                                    Works
                                        hop
                                                ht
                                                 tp:
                                                   //
                                                    ceur
                                                       -ws
                                                         .or
                                                           g   CEUR Workshop Proceedings (CEUR-WS.org)
                                                I
                                                SSN1613-
                                                       0073
                                    Pr
                                     oceedi
                                          ngs


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
2. Methodology
In this particular section, we explain the methodology used in developing TU-Net, a novel
architecture that improves the performance of the U-Net baseline model with Transformer-based
Multi-Head Self-Attention (MHSA) for left ventricle MRI segmentation. The steps of our model are
shown in Figure 1. In the first step, the input image passes into the encoders after that in the second
step output of the last encoder passes into the MHSA block and finally output of the MHSA block
passes into decoders and gets the output segmentation map.

    2.1. U-Net Architecture with Integration of MHSA Block

    The U-Net architecture as shown in Figure 2 was famous for its unique U-shaped encoder-decoder
architecture, enabling precise localization and segmentation capabilities. In the encoder path, feature
maps are extracted by two successive 3x3 convolutions followed by ReLU activation functions. After
that 2x2 max-pooling operations are used to down-sample the image size. The above process is
repeated five times as five encoders are used in U-Net architecture. After the fifth encoder in the
bottleneck section, we integrate the MHSA module which processes the feature maps received from
the last encoder and enables the proposed architecture to capture global contexts and long-range
dependencies within the image. Conversely, in the decoder path, feature maps are up-sampled by
using 2x2 convolutions, and after that concatenate the feature maps from the corresponding encoder
side with the decoder side. After this step two successive 3x3 convolution operations were used
followed by the ReLU activation function this process was also repeated five times and finally 1x1
convolution operation was used after the last decoder to give the final segmentation map.


Figure 1: Steps of Proposed Model


Figure 2: U-Net with MHSA

    2.2. Multi-Head Self-Attention (MHSA)
    MHSA is an advanced technique used in transformer models to improve their ability to process
information. Instead of relying on a single attention mechanism with queries, keys, and values all
having dimensionality umodel, MHSA divides this process into multiple, parallel attention operations.
Each of these operations, known as heads, maps the queries, keys, and values into smaller dimensions
uk and uv using distinct learned linear projections. Attention is computed in parallel for each head,
and the resulting outputs, which are uv-dimensional, are concatenated and re-projected to produce
the final output. This approach allows the model to focus on various representation subspaces at
different positions, whereas a single attention head would average these aspects together.
    Overcome U-Net’s limitation in capturing long-range dependencies, we incorporated MHSA into
the bottleneck of the U-Net architecture. MHSA, which a concept derived from transformers, allows
the model to attend to various parts of the input image simultaneously, thereby capturing global
context more effectively as mentioned in Figure 1. The TU-Net architecture retains the basic
structure of U-Net but integrates MHSA in the bottleneck layer to enhance its ability to capture
global information. The self-attention mechanism as shown in Figure 4 works by calculating
attention scores between various positions within the input image. It consists of three main
components: Query (Q), Key (K), and Value (V ). The attention scores A are calculated by taking
the scaled dot-product of Q and K, and then applying a Soft-Max function to obtain the attention
weights, as shown in equation 1.

                                                         !" !
                                         𝐴 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(           )                                 (1)
                                                         #$"


   where dk is the dimensionality of the key vectors. These weights are then applied to V for the
final output, as shown in Equation 2.

                                   Attention(Q, K, V ) = A · V                                    (2)

    This process is performed multiple times in parallel to create MHSA, enabling the model to
simultaneously focus on different regions of the image as illustrated in Figure 3. The step-wise
working of MHSA is shown in Figure 5. From left to right. In the first step, we simply pass the input
sequence, In the second step, we embed each word, In all encoders except encoder 0, we don’t need
embedding. In the third step, we split into eight heads and multiplied X or R with weight matrices.
In the fourth step, calculate the attention scores by making use of Q, K, and V matrices. In the final
step, concatenate the results of Z matrices and then multiply with the weight matrix W0 and finally
produce the output.


Figure 3: Multi-head self-attention (MHSA)
Figure 4: Detailed Architecture of Transformer


Figure 5: Detailed Working process of MHSA Module
Table 1
Hyperparameters
               Hyperparameter                      Value
               Image Size                          256 x 256
               Batch Size                          64
               Epochs                              50
               Training Images                     4900
               Validation Images                   500
               Test Images                         266
               Total Parameters                    48,195,073 (183.85 MB)
               Trainable Parameters                48,183,297 (183.80 MB)
               Non-trainable Parameters            11,776 (46.00 KB)

    2.3. Training Procedure

    The training procedure for the TU-Net architecture uses the MICCAI 2009 [8] Left Ventricle
Segmentation Challenge dataset. The details of the data set are mentioned in TABLE 1. Before
training, the MRI images were subjected to several preprocessing steps to ensure uniformity and
enhance model performance. Each image was resized to 256 x 256 pixels, and also the pixel intensity
values were normalized. To prevent overfitting, data augmentation techniques such as random
rotations, shifts, flips, and zooms were applied to the training dataset. Adam optimizer were used to
train the TU-Net model, which is known for its efficiency and capability to handle sparse gradients.
A hybrid loss function, which combines binary cross-entropy and Dice was employed to balance
pixel-wise accuracy with the overlap between ground truth and predicted masks. During training,
the TU-Net model’s parameters were iteratively adjusted to minimize the loss function through
forward and backward propagation steps. In the forward pass, the input images were fed through
the model to obtain predictions, which were then compared to the ground truth masks to compute
the loss. In the backward pass, the computed loss was used to update the model parameters through
the Adam optimizer.
    The model’s performance was validated on the 500-image validation set after each epoch,
providing insights into its generalization capability on unseen data. This validation process also
guided the tuning of hyperparameters. After training concluded, the final model underwent
evaluation using a test set comprising 266 images to gauge its performance in real-world scenarios.
The final TU-Net model, incorporating MHSA in the bottleneck layer, comprised a total of 48,195,073
parameters, with 48,183,297 being trainable and 11,776 non-trainable, resulting in a model size of
183.85 MB. The training procedure ensured that the model was well-optimized for accurate and
reliable segmentation of the left ventricle in MRI images as mentioned in Table 1.

    2.4. Evaluation Metrics

   The performance evaluation encompasses key metrics including precision, recall, specificity,
intersection over union (IoU), and a custom evaluation metric derived from the evaluate generator
function, offering a comprehensive assessment of overall accuracy.

3. Results
The performance of the TU-Net model with Multi-Head Self-Attention (MHSA) was evaluated
against the standard U-Net model using several key metrics: Precision, Recall, Specificity, IoU, and
Accuracy. The evaluation was conducted on the MICCAI 2009 Left Ventricle Segmentation Challenge
dataset, focusing on the segmentation of the left ventricle in MRI images. Table 2 below summarizes
the comparative results of the two models.
Table 2
Performance Comparison
 Model        Precision            Recall           Specificity      IoU              Accuracy
 U-Net        0.773880             0.653408         0.996921         0.548658         0.710639
 U-Net MHSA   0.799531             0.576392         0.997670         0.503610         0.797943

    Precision was higher for the U-Net MHSA model (0.799531) compared to the standard U-Net
model (0.773880). This indicates that the incorporation of MHSA helped in reducing false positives.
Recall was higher for the U-Net model (0.653408) compared to the U-Net MHSA model (0.576392).
This suggests that while the U-Net MHSA model had fewer false positives, it also had a slightly
higher number of false negatives. Specificity was slightly better for the U-Net.
    MHSA model (0.997670) compared to the standard U-Net model (0.996921). This improvement,
albeit small, indicates a better performance in correctly identifying negative samples. The IoU metric
was slightly lower for the U-Net MHSA model (0.503610) compared to the standard U-Net model
(0.548658). This suggests that the standard U-Net had a slightly better spatial overlap between the
predicted and true segmentation masks. Accuracy, evaluated using the evaluate generator function,
was significantly higher for the U-Net MHSA model (0.797943) compared to the standard U-Net
model (0.710639). This indicates that the overall performance and correctness of the U-Net MHSA
model in segmenting the left ventricle were superior.
    In addition to the tabular results, Figure 6 illustrates a comparative graph which visually
represents the performance disparities between the convolutional U-Net model and the U-Net MHSA
model. This graph highlights the enhanced accuracy and precision of the U-Net MHSA model,
despite a trade-off in recall and IoU. Figure 7 illustrates a visual comparison between U-Net MHSA
and U-Net.


Figure 6: Detailed Working process of MHSA Module


Figure 7: Visual Comparison of TU-NET and U-Net
4. Discussion
This study aimed to enhance the U-Net architecture for medical image segmentation by
incorporating MHSA into its bottleneck layer. The results indicate that the enhanced model, U-Net
MHSA, shows considerable improvements compared to the standard U-Net, especially regarding
precision and overall accuracy. Integrating MHSA into the U-Net framework enables the model to
more effectively capture long-range dependencies and contextual relationships within the image,
which are crucial for precise segmentation. Our findings show that U-Net MHSA achieved a
precision of 0.799531 and an accuracy of 0.797943, outperforming the standard U-Net, which had a
precision of 0.773880 and an accuracy of 0.710639. These enhancements highlight the benefits of
incorporating attention mechanisms to improve the TU-Net’s ability to focus on important features
throughout the entire image.
    However, while U-Net MHSA showed notable gains in precision and accuracy, it did exhibit a
slightly lower recall (0.576392) and IoU (0.503610) compared to the standard U-Net, which had a
recall of 0.653408 and an IoU of 0.548658. This suggests that although U-Net MHSA is more precise
in identifying the left ventricle, it may miss some true positives, leading to a lower recall. The
decreased IoU indicates a reduced overlap between predicted and actual segmentations, pointing to
a potential area for further optimization. The trade-off between precision and recall observed in our
study is a common challenge in segmentation tasks. Precision measures how many of the identified
segments are correct, while recall measures how many of the actual segments were identified.
Achieving a balance between these metrics is crucial for practical applications, especially in medical
imaging, where both false positives and false negatives can have significant consequences. One of
the strengths of our approach is the ability of MHSA to capture global context, which is often
overlooked by traditional convolution operations that primarily focus on local features. By attending
to different parts of the image simultaneously, MHSA provides a more comprehensive understanding
of spatial relationships, enhancing the model’s ability to delineate complex anatomical structures.
The overall higher accuracy of U-Net MHSA highlights its robustness and effectiveness for the task
of left ventricle segmentation. The additional computational cost introduced by the MHSA module
is justified by the performance gains, demonstrating the potential of self-attention mechanisms in
improving convolution neural network architectures.

5. Conclusions
We present U-Net MHSA for medical image segmentation, especially left ventricle in heart images.
U-Net MHSA is an advanced architecture, incorporating MHSA into the bottleneck layer has shown
significant improvements in precision and overall accuracy. U-Net MHSA has outperformed
standard U-Net. While previously standard U-Net had a precision value of 0.733880 and accuracy
value of 0.710639, now after integration of U-Net MHSA, the precision value has become 0.799531
and accuracy value has become 0.797943 which is better than before. Along with all these benefits,
there is some decrease in recall and Intersection over Union (IOU) values with U-Net MHSA. U-Net
MHSA demonstrates the potential of convolution neural network architecture, self-attention
mechanism to improve segmentation performance. Future research should focus on optimizing the
attention mechanism and validating the model on different segmentation tasks and datasets to ensure
its generalizability and robustness in various clinical scenarios.

Acknowledgements
We would like to express our sincere gratitude to the Department of Computer Science at Bennett
University for providing the necessary resources and support throughout this research. Special
thanks to our colleagues and mentors, whose insights and expertise were invaluable in the
development and refinement of this study. This work was not funded. We also extend our
appreciation to the MICCAI 2009 Left Ventricle Segmentation Challenge for providing the dataset.
References

[1] Singh, Krishna Kant, and Akansha Singh. "A study of image segmentation algorithms for
    different types of images." International Journal of Computer Science Issues (IJCSI) 7.5 (2010):
    414.
[2] Acharya, Raj, et al. "Biomedical imaging modalities: a tutorial." Computerized Medical Imaging
    and Graphics 19.1 (1995): 3-25.
[3] Singh, Pushpa, et al. "Diagnosing of disease using machine learning." Machine learning and the
    internet of medical things in healthcare. Academic Press, 2021. 89-111.
[4] Sharma, Poonam, and Akansha Singh. "Era of deep neural networks: A review." 2017 8th
    International Conference on Computing, Communication and Networking Technologies
    (ICCCNT). IEEE, 2017.
[5] O'Shea, K. "An introduction to convolutional neural networks." arXiv preprint
    arXiv:1511.08458 (2015).
[6] Ronneberger, Olaf, Philipp Fischer, and Thomas Brox. "U-net: Convolutional networks for
    biomedical image segmentation." Medical image computing and computer-assisted
    intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9,
    2015, proceedings, part III 18. Springer International Publishing, 2015.
[7] Vaswani, A. "Attention is all you need." Advances in Neural Information Processing
    Systems (2017).
[8] Cardiac MR Left Ventricle Segmentation Challenge. URLhttp://hdl.handle.net/10380/307