=Paper=
{{Paper
|id=Vol-3609/paper30
|storemode=property
|title=Multimodal Medical Data Learning Approaches for Digital Healthcare
|pdfUrl=https://ceur-ws.org/Vol-3609/short8.pdf
|volume=Vol-3609
|authors=Oleh Basystiuk,Nataliia Melnykova
|dblpUrl=https://dblp.org/rec/conf/iddm/BasystiukM23
}}
==Multimodal Medical Data Learning Approaches for Digital Healthcare==
<pdf width="1500px">https://ceur-ws.org/Vol-3609/short8.pdf</pdf>
<pre>
                         Multimodal Medical Data Learning Approaches for Digital
                         Healthcare
                         Oleh Basystiuka, Nataliia Melnykovaa
                         a
                                Department of Artificial Intelligence, Institute of Computer Science and Information Technologies, Lviv
                                Polytechnic National University, Lviv, 79000, Ukraine

                                          Abstract
                                          The integration of multimodal data has emerged as a game-changing strategy in advancing
                                          smart healthcare, allowing for a holistic comprehension of patient health and tailored treatment
                                          strategies. This exploration delves into the journey from raw data to insightful wisdom,
                                          emphasizing the fusion of various data modalities, notably in CT scans or retinal photographs,
                                          to drive smart healthcare innovations.
                                          Within this review, we comprehensively examine the fusion of diverse medical data modalities,
                                          aiming to unlock a deeper understanding of patient health. Our focus spans various fusion
                                          methodologies—from feature selection to rule-based systems, machine learning, deep learning,
                                          and natural language processing. Furthermore, we explore the challenges inherent in fusing
                                          multimodal data in healthcare settings.
                                          The central focus revolves around determining the most efficient and accurate approach,
                                          crucial for future research endeavors in Ukrainian language audio-to-text conversion systems.
                                          The goal is to ascertain the most effective strategy that will serve as the foundation for further
                                          advancements in this domain.

                                          Keywords 1
                                          Speech-to-Text, speech recognition, video recognition, audio recognition, multimodal data,
                                          machine learning, deep neural network, hybrid approaches

                         1. Introduction

                             In the rapidly evolving landscape of smart healthcare, where innovation and data-centric
                         methodologies are reshaping the industry, the convergence of multimodal data stands out as a pivotal
                         force. This paper delves deep into the realm of multimodal medical data fusion, offering a thorough
                         investigation into how diverse data streams converge to generate meaningful insights. It navigates
                         through the intricate process—from initial data collection to translating it into actionable intelligence—
                         depicted through the detailed four-level pyramid showcased in Figure 1.
                             The Data Information Knowledge Wisdom (DIKW) model serves as a conceptual roadmap
                         illustrating how data evolves into profound wisdom [1]. It elucidates a transformative journey where
                         raw data undergoes a metamorphosis into meaningful information, knowledge, and eventually,
                         wisdom—empowering informed decision-making and adept problem-solving [2]. This model
                         recognizes the inadequacy of raw data in driving insights and actions; it underscores the necessity to
                         process, structure, and contextualize data to extract invaluable information. This synthesized
                         information converges with existing knowledge, fostering an understanding that begets knowledge
                         itself. This accrued knowledge becomes a practical tool for making informed decisions and navigating
                         intricate challenges, ultimately culminating in the attainment of wisdom.


                         IDDM’2023: 6th International Conference on Informatics & Data-Driven Medicine, November 17 - 19, 2023, Bratislava, Slovakia
                         EMAIL: oleh.a.basystiuk@lpnu.ua (A. 1); nataliia.i.melnykova@lpnu.ua (A. 2);
                         ORCID: 0000-0003-0064-6584 (A. 1), 0000-0002-3257-3677 (A. 2);
                                       ©️ 2023 Copyright for this paper by its authors.
                                       Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                       CEUR Workshop Proceedings (CEUR-WS.org)


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
Figure 1: Multimodal data fusion conceptual model
   This study delves into diverse methodologies encompassing feature selection, machine learning,
deep learning, and natural language processing, aiming to fuse multimodal medical data effectively.
Additionally, it confronts a spectrum of challenges, spanning data quality, privacy, security,
processing, analysis, clinical integration, ethics, and result interpretation. Emphasizing the
transformative capacity of multimodal medical data fusion, this paper acts as a springboard for future
research and advancements in smart healthcare. It lays the foundation for enhancing patient care
outcomes and propelling personalized healthcare solutions to the forefront of medical innovation [3].

   The key contributions of this paper are:
      • Utilizing and adapting the established DIKW conceptual model to delineate the evolution
          from data to information to knowledge to wisdom, specifically in the domain of multimodal
          fusion for intelligent healthcare.
      • Current techniques for representing multimodal data in applications utilizing machine
          learning methodologies, based on Ukrainian language-based dataset.
      • Proposing a comprehensive workflow for handling multimodal medical data, such as video,
          audio, and textual sources, using the Sequence-to-Sequence model flow.
      • Analyzing the challenges and proposing solutions for the algorithmic time complexity
          associated with multimodal data handling, focusing on video, audio, and textual medical
          data, while aligning with the proposed approach, to steer future research paths.


   The following sections of this paper are structures in the following order: Section 2 overview
existing techniques of multimodal data representation in applications, based on machine learning
approaches. Section 3 propose multimodal handling flow with encompassing video, audio, and textual
medical data, using Sequence-to-Sequence model flow. Section 4 introduces a comparison of
multimodal handling algorithm time complexity, based on video, audio, and textual medical data.
Section 5 and Section 6 stands for discussion and comparison and evaluation the results.

2. Related works

    This research project will use a qualitative research approach. Data will be collected through a
review of relevant literature, case studies, and interviews with experts in the field. The data will be
analyzed using thematic analysis to identify key themes related to the use of multimodal and artificial
intelligence approaches in the distance education process, namely to recognize video streams of
completed course assignments [4].
    Current AI applications in medicine have typically focused on specific tasks using singular data
sources, like a CT scan or retinal photograph. However, clinicians rely on a multitude of data types and
modalities to make diagnoses, evaluate prognosis, and determine treatment plans [5]. Moreover,
existing AI assessments capture only a snapshot in time, failing to perceive health as an ongoing
continuum.
    The potential for AI models extends far beyond these limitations, envisioning their ability to leverage
diverse data sources, including those beyond most clinicians' scope [8]. Multimodal AI models
integrating video, imaging, text and audio clinical medical data promise. They stand to bridge this gap
by enabling personalized medicine, real-time pandemic surveillance, digital clinical trials, and virtual
health coaching on an unprecedented scale [6, 7].
    This review examines the vast potential of multimodal datasets in healthcare, emphasizing the
transformative possibilities they offer. By incorporating audio and video data into this multimodal
landscape, leveraging machine learning techniques, medical research could attain a new level of depth
and accuracy [9]. For instance, integrating video data from patient interactions or audio data from
diagnostic interviews could enrich these models, leading to more holistic and precise healthcare
solutions.
    For video content processing, the application and categorization of methods are proposed, including
sequential comparison, clustering-based global comparison, and event/object-based methods. The most
valuable compare and contrast techniques include sequence finding, classification, frame decoding, and
evaluation feature detection. The most optimal use of methods is based on artificial intelligence and
machine learning, where deep learning methods will be more effective than conventional methods.
    However, this integration presents significant challenges. Nonetheless, with innovative strategies
and advancements in machine learning, overcoming these hurdles becomes increasingly feasible. The
potential benefits of multimodal data utilization in medical research are vast, heralding a paradigm shift
toward more comprehensive, personalized, and effective healthcare solutions [10].
    The research project is expected to provide a system for evaluating and predicting student success
based on feedback. It will identify the different approaches used, their effectiveness, and the challenges
and opportunities associated with their use. It is expected that the implementation of the proposed
approaches will significantly improve information systems for evaluating the quality of learning
outcomes, which will be used to analyze video content and textual content of answers. The project will
create an approach for providing relevant feedback, based on which suggestions for improving the
courses and evaluating their overall effectiveness. In addition, to influence the success of students
during the distance learning process and the quality of their assimilation of subject materials.
    Our roadmap includes devising an expert-level system tailored specifically for hybrid language
translation within the medical sphere. By incorporating multimodal data, including audio and video,
alongside machine learning techniques, we aim to pioneer a transformative approach that could redefine
diagnostic accuracy, treatment protocols, and overall healthcare practices. This pursuit represents an
innovative frontier poised to revolutionize medical research and application.

3. Methods

    One standout deep learning model, known as Seq2Seq (sequence-to-sequence), has demonstrated
remarkable proficiency in tasks like machine translation and text summarization. These models operate
on the principle that the decoder's attention layers can access only preceding words in the input
sequence, while the encoder's attention layers can access the entirety of the original phrase, fostering
connections that enable an RNN (Recurrent Neural Network) to retain and replicate an entire sequence
of reactions to a stimulus [11].
    Initially, the sequence flows through the encoder, comprising RNNs, culminating in a final
embedding at its conclusion. Subsequently, the decoder utilizes this embedding to predict subsequent
sequences. This process involves using prior hidden states to forecast the succeeding instances within
the sequence, as illustrated Sequence-to-Sequence model flow showcased in Figure 2.
    To optimize accuracy and time efficiency tailored to our specific context, an in-depth investigation
is essential, exploring the methods delineated earlier. In our prior research, the Sequence-to-Sequence
approach based on Recurrent Neural Networks emerged as the most effective method, particularly when
examining the libraries utilized in constructing machine learning methodologies. Notably, TensorFlow,
Keras, and PyTorch stand out as the most widely employed machine learning libraries in RNN-based
language translation methods. Model play a crucial role in enhancing and effectiveness of sequence
prediction and language translation tasks within our research domain [12, 13].


Figure 2: Sequence-to-Sequence model flow

4. Results

    The research project is expected to provide a system for interconnect different data sources, from
multimodal data approach, such as audio, video and text into our structure system, with ability to
reinforce and aggregate these data. Advancements in speech-to-text technology are rapidly progressing,
opening avenues for scaling its application beyond its current horizons. This expansion is crucial not
just for accuracy and reliability but also for the future credibility and coherence of research endeavors
across diverse fields. The groundwork laid by this research serves as a cornerstone for subsequent
investigations, setting the stage for future platforms to tackle multifaceted challenges.
    The landscape of health data is diverse, posing multifaceted challenges in gathering, linking, and
annotating these multidimensional datasets. These medical datasets vary across several dimensions—
sample size, depth of phenotyping, follow-up intervals, participant interactions, heterogeneity,
standardization, and data linkage. While advancements in science and technology facilitate data
collection and phenotyping, striking a balance among these dataset features remains a challenge. For
instance, while larger sample sizes are ideal for training AI models, achieving deep phenotyping and
sustained longitudinal follow-up escalates costs significantly, making it financially impractical without
automated data collection methods.
    In the realm of medicine, current AI applications tend to focus on specific tasks using singular data
sources like CT scans or retinal photographs. This contrasts starkly with clinicians who rely on a diverse
array of data sources and modalities to diagnose, forecast outcomes, and devise treatment plans.
Moreover, existing AI assessments typically offer singular snapshots, capturing a moment in time rather
than perceiving health as an ongoing continuum.
    Biomedical data often grapples with a common issue: a notable prevalence of missing information.
Although excluding patients lacking data prior to training is feasible in certain scenarios, doing so might
introduce selection bias when external factors contribute to these gaps. Consequently, employing
statistical methods like multiple imputation becomes a preferable approach to tackle these voids.
Imputation stands as a crucial preprocessing stage in numerous biomedical disciplines, spanning from
genomics to clinical data analysis.
    However, in theory, AI models possess the potential to harness all available data sources, including
those beyond the reach of most clinicians, like genomic medicine. The evolution of multimodal AI
models that amalgamate data across various modalities—spanning biosensors, genomics, epigenomics,
proteomics, microbiomes, metabolomics, imaging, textual, clinical, social determinants, and
environmental data—holds the promise of narrowing this gap. These advanced models pave the way
for diverse applications, including handling multimodal medical data, such as video, audio, and textual
sources, using the Sequence-to-Sequence model flow showcased in Table 1.


Table 1
Comparison of multimodal handling algorithm time complexity
            Case 1                           Case 2                                Case 3
            0.7020                          0.4150                                 0.5800
            0.8026                          0.4375                                 0.5975
            0.8163                          0.4208                                 0.5308
            0.8441                          0.3920                                 0.6200
            0.7511                          0.5644                                 0.5644
            0.8806                          0.6100                                 0.6050
            0.7507                          0.5835                                 0.5835
            0.6514                          0.5665                                 0.5665
            0.6672                          0.4520                                 0.5520
            0.7535                          0.5390                                 0.5290

5. Discussion

   The research project is expected to provide a system for interconnect different data sources, from
multimodal data approach, such as audio, video and text into our structure system, with ability to
reinforce and aggregate these data. Advancements in speech-to-text technology are rapidly progressing,
opening avenues for scaling its application beyond its current horizons. This expansion is crucial not
just for accuracy and reliability but also for the future credibility and coherence of research endeavors
across diverse fields. The groundwork laid by this research serves as a cornerstone for subsequent
investigations, setting the stage for future platforms to tackle multifaceted challenges [14].
   In our preliminary steps, we delineate specific fields and tasks, anticipating the demands future
platforms will confront. Additionally, we delve into analyzing neural network training techniques that
extend beyond machine translation applications. Our findings from Section 3 highlight the potential of
a hybrid approach employing recurrent networks within a sequence-to-sequence model. This approach
holds promise for yielding optimal outcomes, coupling high time efficiency with commendable
accuracy rates.
   The utilization of recurrent neural networks (RNNs) has garnered significant traction, particularly
in audio-to-text translation. Foreseeing advancements in this domain, we anticipate substantial
enhancements in the near future. Harnessing multimodal data—integrating not only audio but also video
data—alongside machine learning techniques stands as a promising frontier in medical research. This
synergy could revolutionize diagnostic accuracy, treatment methodologies, and overall healthcare
practices, paving the way for groundbreaking advancements in the field [15].

6. Conclusion

   In conclusion, our exploration into multimodal Ukrainian medical data showcased the immense
potential of integrating audio and video data within machine learning-based systems, particularly
employing RNNs (Recurrent Neural Networks). This convergence offers a transformative pathway for
reinforcing medical research endeavors, elevating diagnostic accuracy, treatment methodologies, and
healthcare practices.
   By harnessing the power of RNNs alongside multimodal data, we've unveiled new horizons for
enhanced understanding and analysis within medical research. The fusion of audio and video data
within this framework promises a more comprehensive view of patient health, potentially
revolutionizing diagnostic precision and personalized treatment plans. Increasing accuracy of handling
medical data and cutting medical time to handle the paperwork after medical research, such as retinal
photograph, or scans based on computed tomography (CT) results.
   Moving forward, leveraging these machine learning approaches in multimodal Ukrainian medical
data sets the stage for groundbreaking advancements, paving the way for innovative applications and
more effective healthcare solutions. The fusion of audio, video, and textual data within RNN
frameworks not only strengthens the foundations of medical research but also offers a promising avenue
for the future of healthcare practices.

  Funding Statement: This research is funded by the EURIZON Fellowship Program: “Remote
Research Grants for Ukrainian Researchers”, grand № 138.

7. References

[1] Smit, A. et al. CheXbert: Combining automatic labelers and expert annotations for accurate
     radiology report labeling using BERT. in Proceedings of the 2020 Conference on Empirical
     Methods in Natural Language Processing 1500-1519 (2020).
[2] Willemink, M. J. et al. Preparing medical imaging data for machine learning. Radiology 295, 4–
     15 (2020).
[3] M. Havryliuk, et. al., "Check for updates Interactive Information System for Automated
     Identification of Operator Personnel by Schulte Tables Based on Individual Time Series",
     Advances in Artificial Systems for Logistics Engineering III 180, 372.
[4] Cirillo, D. et al. Sex and gender differences and biases in artificial intelligence for biomedicine and
     healthcare. NPJ Digit. Med. 3, 81 (2020).
[5] Damask, A. et al. Patients with high genome-wide polygenic risk scores for coronary artery disease
     may receive greater clinical benefit from alirocumab treatment in the ODYSSEY OUTCOMES
     trial. Circulation 141, pp. 624–636 (2020).
[6] Vyas, D. A., Eisenstein, L. G. & Jones, D. S. Hidden in plain sight: reconsidering the use of race
     correction in clinical algorithms. N. Engl. J. Med. 383, 874–882 (2020).
[7] O. Basystiuk, N. Melnykova, Multimodal Approaches for Natural Language Processing in Medical
     Data, IDDM 2022 Informatics & Data-Driven Medicine, pp. 246-252.
[8] Z. Rybchak, et. al., Analysis of computer vision and image analysis technics, ECONTECHMOD:
     an international quarterly journal on economics of technology and modelling processes, Lublin,
     Poland, 2017, pp. 79-84.
[9] M. Havryliuk, I. Dumyn, O. Vovk. Extraction of Structural Elements of the Text Using Pragmatic
     Features for the Nomenclature of Cases Verification. In: Hu, Z., Wang, Y., He, M. (eds) Advances
     in Intelligent Systems, Computer Science and Digital Economics IV. CSDEIS 2022. Lecture Notes
     on Data Engineering and Communications Technologies, 2023, vol 158. Springer, Cham.
     https://doi.org/10.1007/978-3-031-24475-9_57.
[10] Rajpurkar, P., Chen, E., Banerjee, O. & Topol, E. J. AI in health and medicine. Nat. Med. 28, 31–
     38 (2022).
[11] A. Esteva, A. et al. Deep learning-enabled medical computer vision. NPJ Digit. Med. 4, 5 (2021).
[12] J.N., Falcone, G.J., Rajpurkar, P. et al. Multimodal biomedical AI. Nat Med 28, 1773–1784 (2022).
     https://doi.org/10.1038/s41591-022-01981-2
[13] Nataliya Shakhovska, et. al. "Big Data analysis in development of personalized medical system",
     The 10th International Conference on Emerging Ubiquitous Systems and Pervasive Networks
     (EUSPN), 160, 229-234. (2019)
[14] Yaroslav Tolstyak, Myroslav Havryliuk "An Assessment of the Transplant's Survival Level for
     Recipients after Kidney Transplantations using Cox Proportional-Hazards Model", Proceedings of
     the 5th International Conference on Informatics & Data-Driven Medicine, Lyon, France,
     November 18 - 20, CEUR-WS.org, 2022. pp. 260-265.
[15] Kang, M., Ko, E. & Mersha, T. B. A roadmap for multi-omics data integration using deep
     learning. Brief. Bioinform. 23 (2022).

</pre>