=Paper=
{{Paper
|id=Vol-3651/HeDAI_paper3
|storemode=property
|title=Non-invasive AI-powered Diagnostics: The case of Voice-Disorder Detection
|pdfUrl=https://ceur-ws.org/Vol-3651/HeDAI-3.pdf
|volume=Vol-3651
|authors=Gabriele Ciravegna,Alkis Koudounas,Marco Fantini,Tania Cerquitelli,Elena Baralis,Erika Crosetti,Giovanni Succo
|dblpUrl=https://dblp.org/rec/conf/edbt/CiravegnaKFCBCS24
}}
==Non-invasive AI-powered Diagnostics: The case of Voice-Disorder Detection==
Non-invasive AI-powered Diagnostics: The case of
Voice-Disorder Detection - Vision paper
Gabriele Ciravegna1 , Alkis Koudounas1 , Marco Fantini2 , Tania Cerquitelli1 , Elena Baralis1 ,
Erika Crosetti3 and Giovanni Succo3
1
Politecnico di Torino, Corso Duca degli Abruzzi, Turin, Italy
2
ENT Unit, San Feliciano Hospital, Rome, Italy
3
ENT Clinic - Head and Neck Cancer Unit, San Giovanni Bosco Hospital, Turin Italy
Abstract
This paper proposes a novel pipeline for non-invasive diagnosis and monitoring in healthcare, leveraging artificial intelligence
(AI). The pipeline allows individuals to record various health data using everyday devices and analyze it via AI algorithms
on a cloud-based platform. Experimental results on voice disorder detection demonstrate the effectiveness of the proposed
approach when compared to existing solutions. Additionally, we discuss the positive impact of the pipeline on diagnosis,
prognosis, and monitoring, emphasizing its non-invasive nature. Overall, we think the proposed pipeline might contribute to
advancing AI-driven healthcare solutions with implications for global healthcare delivery.
Keywords
Artificial Intelligence, Non-invasive diagnostics, Voice disorder recognition, Voice analysis
1. Introduction understanding of an individual’s health, facilitating early
detection and personalized intervention strategies.
Artificial Intelligence (AI) is increasingly integrated into The envisioned framework has the potential to im-
healthcare, offering opportunities to improve diagnostics, prove healthcare delivery as well as have economic and
treatment, and patient care. Through machine learning technological impacts. First, by enabling large scale
algorithms, AI systems can analyze medical data, provid- screening, it may increase early detection, allowing for
ing clinical decision support and personalized treatment timely intervention and improved treatment outcomes.
options [1]. Non-invasive diagnostics is a critical aspect Second, by analyzing the evolution of patient data in
of modern healthcare delivery, offering patients a less in- time, healthcare providers may tailor interventions, opti-
trusive and more comfortable care experience. This holds mizing treatment efficacy and patient satisfaction. From
for both the diagnostic and the monitoring processes, im- an economic perspective, the improved efficiency of di-
proving the overall quality of life for patients [2]. Beyond agnosis and treatments may reduce overall healthcare
enhancing patient comfort, these methods also improve costs. Finally, the developed models may be employed or
accessibility to healthcare services, particularly for un- fine-tuned in related data-scarce contexts.
derserved populations or those with limited access to As part of our investigation, we conducted a prelim-
specialized medical facilities. inary study on voice disorder detection, a key aspect
We propose an AI-based framework for non-invasive of non-invasive diagnostics. We trained a deep learn-
diagnostics that integrates various data modalities, in- ing model for analyzing voice recordings that achieves
cluding voice recordings, self-pictures, typing patterns, very high precision in detecting the presence of pathol-
ECG readings, and sleep analysis, among others. These ogy and accurately identifies the type of pathology. We
diverse sources of data are collected through everyday de- employed a transformer model [3], trained end-to-end
vices such as computers, smartphones, and smartwatches, (E2E) directly on the raw data, outperforming traditional
enabling convenient and continuous monitoring of in- methods such as convolutional neural networks (CNN)
dividual health metrics. Subsequently, these data are trained on frequency-transformed data. These prelimi-
fed into cloud-based applications where advanced AI al- nary results demonstrate the potential of our approach in
gorithms analyze them, identifying patterns that may enabling accurate and efficient non-invasive diagnostics
be indicative of underlying health issues. Also, the in- for voice disorders.
tegration of multiple data sources allows for a holistic The paper begins with an introduction to AI and its
medical applications (Section 2). It then outlines the
Published in the Proceedings of the Workshops of the EDBT/ICDT 2024 proposed idea, and it addresses foreseen challenges (Sec-
Joint Conference (March 25-28, 2024), Paestum, Italy tion 3). We then present the preliminary results on a
$ gabriele.ciravegna@polito.it (G. Ciravegna) voice disorder detection case in Section 4, and conclude
© 2024 Copyright for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0). with a discussion on the framework’s impact (Section 5).
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073
CEUR Workshop Proceedings (CEUR-WS.org)
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
in handling multi-omics and multi-modal data [12].
3. AI for non-invasive medicine
In this paper, we propose a framework for analyzing
an individual’s health condition in time through non-
invasive AI-powered diagnostics. Our proposed pipeline
embodies a user-centric approach, empowering individ-
uals to actively participate in their healthcare journey.
From a medical point of view, we consider the analysis of
all pathologies detectable by human experts through non-
invasive diagnosis. Drawing parallels with advancements
Figure 1: Outline of the proposed pipeline. A user records dif-
in other domains, we argue that AI models can replicate
ferent types of data, e.g., pictures, voice, texts, heart monitors,
and sleep conditions. These data are uploaded to the cloud
and potentially surpass human diagnostic accuracy also
and processed by an artificial intelligence method. The results in this domain. This solution may enable large-scale
are visualized on a user-controlled application but can also be screening through simple but accurate analysis of the
shared with a remote doctor who can require further exams. patient data collected remotely in a non-invasive way.
Data collection In Figure 1 we report a visualization
of the proposed framework. Leveraging everyday devices
2. Background such as smartphones, laptops, or wearables, users can ef-
fortlessly record diverse types of data. Clearly, each type
Deep Learning and Transformers Deep Learning is of device enables the collection of different data types.
a robust method for uncovering patterns and insights Laptops and smartphones allow collecting typing and
from extensive datasets [4]. Unlike conventional ma- click patterns, texts, voice recordings, and pictures (the
chine learning approaches, Deep Learning models learn last two particularly through smartphones). Wearables,
directly from raw data in an E2E manner, without the on the other side, allow (and are already employed for)
need for manual feature engineering. Transformer mod- monitoring heart rate, blood pressure, athletic perfor-
els are a prominent class of Deep Learning architec- mance and sleep conditions, to cite a few.
tures [3], demonstrating the effectiveness of this ap- Data storage analysis Upon collection, the data are
proach in analyzing sequential and multimodal data [5, 6]. uploaded to a cloud-based database and analyzed by
Unlike traditional recurrent and convolutional neural net- means of a single advanced artificial intelligence model.
works, Transformers can capture long-range dependen- The employment of the aforementioned transformer mod-
cies and preserve contextual information over extended els accommodate diverse data modalities and sources. By
sequences, thanks to their self-attention mechanisms. integrating multi-modal data processing capabilities, our
This characteristic enables Transformers to process raw framework aims to capture a holistic view of an individ-
multimodal sequential data, such as text or time-series ual’s health status, enabling comprehensive and personal-
data, as well as images, making them highly adaptable for ized diagnostics. Moreover, the collection of continuous
various applications, including medical diagnostics [7, 8]. health data enable taking into consideration in-time evo-
DL for Medicine Deep Learning has emerged as a lution and predictive modelling. The model may also
transformative technology in various healthcare applica- evolve and improve over time by means of the new data.
tions. First, DL can analyze electronic health records [9], Result visualization The processed results are
enabling personalized treatment recommendations and then presented to users through an intuitive and user-
predictive analytics for patient outcomes. In medical controlled application interface. Through this inter-
imaging [10], DL models can interpret radiological im- face, users gain valuable insights into their health met-
ages, including X-rays, CT scans, and MRI images, with rics, facilitating informed and proactive decisions regard-
levels of accuracy comparable to or even surpassing that ing their healthcare. Furthermore, a user can transmit
of human experts. These models have been utilized for the processed data and diagnostic results to their re-
tasks such as disease classification, lesion detection, and mote healthcare professionals, allowing them to make in-
tumor segmentation, the latter also performed in real- formed clinical decisions and make timely interventions.
time during surgery. In drug discovery, AI algorithms Additionally, healthcare providers can utilize the data for
have been shown capable of developing novel drugs population-level health monitoring and epidemiological
through accurate predictions of protein structures [11]. studies, facilitating the identification of emerging health
Finally, DL models have also shown increased capability trends and proactive public health interventions.
3.1. Challenges coordinate and distribute training over several models
without exchanging raw data. The goal is to mitigate
The effectiveness of the proposed framework relies also
privacy concerns associated with transmitting sensitive
on our ability to address and resolve a number of both
data to a centralized cloud-based database. Furthermore,
technical and non-technical issues.
personalized models generated through federated learn-
ing may exhibit higher diagnostic accuracy, as they are
3.1.1. Data quality trained on data specific to each user’s health profile.
The collection of data through personal devices intro-
duces the risk of data noise, which encompasses various 3.1.3. Human understanding & trustworthiness
factors such as sensor inaccuracies, environmental inter-
The third issue concerns the explainability challenge as-
ference, and incorrect user input. Sensor inaccuracies
sociated with employing deep learning-based black box
may lead to erroneous measurements, while environmen-
models for predictions. While these models offer impres-
tal interference, such as background noise or lighting
sive performance, their inherent complexity makes them
conditions, can distort the recorded data. Additionally,
opaque and difficult to interpret. This lack of explainabil-
users may input incorrect data, such as taking pictures
ity presents challenges for clinicians and end-users who
of the wrong part of their body, which can further exac-
require insights into the model’s decision-making pro-
erbate data noise and impact the analysis model. These
cess [16, 17, 18, 19]. Without transparent explanations,
challenges can hinder the model’s ability to accurately
stakeholders and regulatory institutions may hesitate to
interpret and analyze the data, potentially leading to er-
trust the diagnostic and monitoring framework.
roneous conclusions and suboptimal performance.
Research direction: Concept-based XAI models
Research direction: data augmentation + input
A possible solution is represented by the employment
data checking To address the challenges posed by sensor
of eXplainable AI (XAI) algorithms that can shed light
inaccuracies and environmental interferences, a robust
on the decisions of the models [20, 21, 22]. Particularly,
data augmentation pipeline can be implemented to miti-
Concept-based XAI models offer intrinsic interpretability
gate these sources of noise. By incorporating various data
by mapping raw data to interpretable high-level concepts
augmentation techniques such as noise injection, signal
before making class predictions, offering insights into
filtering, and data synthesis, the pipeline can generate
the model’s decision-making process [23, 24, 25]. This
diverse and representative training data that encapsu-
approach not only addresses the explainability challenge
lates the variability present in real-world scenarios [13].
but also fosters trust and confidence in the diagnostic
Specific preprocessing tailored to each data modality can
and monitoring outcomes produced by the system.
help normalize and enhance the quality of the collected
data. Furthermore, to mitigate the issue of incorrect user
input, a dedicated model can be employed to validate the 4. The case of voice disorder
accuracy of the collected data. This model can check each
type of collected data, such as images or sensor readings, detection
to verify their correctness and flag any discrepancies.
Vocal disorders are prevalent pathologies affecting a sig-
nificant portion of the population and exerting a substan-
3.1.2. Privacy preservation tial impact on patients’ quality of life [26, 27, 28, 29].
The second problem arises from privacy concerns asso- These disorders may originate from various causes, in-
ciated with collecting sensitive data and transmitting cluding both benign and malignant conditions, and neu-
it to a cloud-based database. Individuals may be hesi- rodegenerative disorders [30, 31, 32]. Diagnosis often
tant to share sensitive information due to concerns about relies on clinicians’ auditory assessments of patients’
data security and privacy breaches [14]. Transmitting voices, highlighting the critical need for accurate and
such data to a cloud-based database further exacerbates timely detection. Here, a DL model is used to analyze the
these concerns, as it involves relinquishing control over raw recordings and automatically detect patterns indica-
personal information to third-party service providers. tive of vocal disorders and distinguish between various
Moreover, regulatory compliance [15] impose stringent pathologies, including nodules, polyps, cysts, spasmodic
requirements on the handling of personal health infor- dysphonia or vocal cord paralysis.
mation, adding complexity to the data collection process. Preliminary experiments We demonstrate signifi-
Research direction: a federated learning ap- cant advancements over prior attempts 1
in voice disorder
proach The privacy issue can be effectively addressed detection using AI models [33, 34] . As reported in Fig-
through the adoption of a federated approach. Federated ure 2, our approach achieves notable improvements in
learning is an area of machine learning studying how to
1
We re-implemented these models for a fair comparison.
in its transformative impact from a social, technological,
and economic point of view.
Social Impact By enabling individuals to record vari-
ous types of data remotely using everyday devices, the
proposed pipeline facilitates non-invasive diagnostic pro-
cedures, eliminating the need for expensive and invasive
tests. The accessibility of the proposed pipeline extends
beyond traditional healthcare settings, allowing individ-
uals in remote or underserved areas to access diagnos-
tic services conveniently. Individuals can be monitored
remotely, allowing healthcare providers to track their
Figure 2: Comparative analysis of the test model performance health status in real-time and intervene promptly if ab-
in distinguishing Healthy individuals from those afflicted with normalities are detected [36]. Early detection and in-
a pathological condition. tervention facilitated by the pipeline lead to improved
prognoses for patients, as healthcare providers can initi-
ate treatment at earlier stages of disease progression.
Technological Impact The proposed framework has
significant technological implications. By leveraging the
adaptable nature of the framework, models developed
for one medical application can be readily deployed and
tested in related contexts, accelerating the pace of med-
ical research and innovation. As an example, the pre-
sented voice disorder detection model could be tested for
neurodegenerative patients. Additionally, the framework
enables the fine-tuning of models for specific cases, in-
cluding those with limited data availability, such as rare
diseases. Despite the scarcity of data in such contexts,
the model can still generalize due to its original training
Figure 3: Comparative analysis of test model performance in on a larger and diverse dataset.
identifying the macro pathology that afflicts individuals. Economic Impact From an economic point of view,
the proposed pipeline has low operating costs due to the
utilization of personal devices for data collection and
accuracy, up to +30%, primarily attributed to the utiliza- cloud-based analysis. Additionally, the framework fa-
tion of a Transformer model rather than a Convolutional cilitates the collection of low-cost, virtuous-cycle data,
Neural Network (CNN). The Transformer’s inherent abil- enabling continuous monitoring and feedback loops that
ity to process raw time-series data E2E without any time- enhance the accuracy and effectiveness of diagnostic and
frequency preprocessing offers distinct advantages in monitoring processes over time. Moreover, the frame-
analyzing voice data [35]. Additionally, we introduce work lowers the burden of diagnosis and monitoring on
a robust data augmentation pipeline and consider both public healthcare structures. This, in turn, enables health-
vowel and sentence-based recordings, further enhancing care professionals to focus on more complex and critical
performance by up to +10% compared to the employment medical issues, ultimately improving the efficiency and
of a standard Transformer model only. As reported in resource allocation of the healthcare system.
Figure 3, similar considerations hold also for the clas-
sification of the macro-category pathology. These en- References
hancements underscore the efficacy of our approach in
the diagnosing vocal disorders. [1] P. Rajpurkar, E. Chen, O. Banerjee, E. J. Topol, Ai in health
and medicine, Nature medicine 28 (2022) 31–38.
[2] J.-R. Rueda, I. Sola, A. Pascual, M. S. Casacuberta, Non-invasive
5. Discussion interventions for improving well-being and quality of life in
patients with lung cancer, Cochrane Database of Systematic
Reviews (2011).
Overall, the proposed pipeline holds promise for improv- [3] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.
ing healthcare delivery, leveraging AI to enable non- Gomez, Ł. Kaiser, I. Polosukhin, Attention is all you need,
invasive diagnostics and monitoring. The medical team Advances in neural information processing systems 30 (2017).
involved in the development of this pipeline also believes [4] Y. LeCun, Y. Bengio, G. Hinton, Deep learning, nature 521
(2015) 436–444.
[5] A. Baevski, Y. Zhou, A. Mohamed, M. Auli, wav2vec 2.0: A tematic review, arXiv preprint arXiv:2006.00093 (2020).
framework for self-supervised learning of speech representa- [21] E. Pastor, A. Koudounas, G. Attanasio, D. Hovy, E. Baralis,
tions, Advances in neural information processing systems 33 Explaining speech classification models via word-level audio
(2020) 12449–12460. segments and paralinguistic features, in: Proceedings of the
[6] W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdi- 18th Conference of the European Chapter of the Association
nov, A. Mohamed, Hubert: Self-supervised speech representa- for Computational Linguistics, Association for Computational
tion learning by masked prediction of hidden units, IEEE/ACM Linguistics, 2024.
Transactions on Audio, Speech, and Language Processing 29 [22] G. Ciravegna, P. Barbiero, F. Giannini, M. Gori, P. Lió, M. Mag-
(2021) 3451–3460. gini, S. Melacci, Logic explained networks, Artificial Intelli-
[7] H. Xiao, L. Li, Q. Liu, X. Zhu, Q. Zhang, Transformers in gence 314 (2023) 103822.
medical image segmentation: A review, Biomedical Signal [23] P. Barbiero, G. Ciravegna, F. Giannini, M. Espinosa Zarlenga,
Processing and Control 84 (2023) 104791. L. C. Magister, A. Tonda, P. Lio, F. Precioso, M. Jamnik,
[8] M. La Quatra, L. Vaiani, A. Koudounas, L. Cagliero, P. Garza, G. Marra, Interpretable neural-symbolic concept reasoning,
E. Baralis, How much attention should we pay to mosquitoes?, in: A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato,
in: Proceedings of the 30th ACM International Conference on J. Scarlett (Eds.), Proceedings of the 40th International Con-
Multimedia, MM ’22, Association for Computing Machinery, ference on Machine Learning, volume 202 of Proceedings of
New York, NY, USA, 2022, p. 7135–7139. URL: https://doi.org/ Machine Learning Research, PMLR, 2023, pp. 1801–1825.
10.1145/3503161.3551594. doi:10.1145/3503161.3551594. [24] M. Espinosa Zarlenga, P. Barbiero, G. Ciravegna, G. Marra,
[9] A. Rajkomar, E. Oren, K. Chen, A. M. Dai, N. Hajaj, M. Hardt, F. Giannini, M. Diligenti, Z. Shams, F. Precioso, S. Melacci,
P. J. Liu, X. Liu, J. Marcus, M. Sun, et al., Scalable and accu- A. Weller, et al., Concept embedding models: Beyond the
rate deep learning with electronic health records, NPJ digital accuracy-explainability trade-off, Advances in Neural Infor-
medicine 1 (2018) 18. mation Processing Systems 35 (2022) 21400–21413.
[10] K. Suzuki, Overview of deep learning in medical imaging, [25] E. Poeta, G. Ciravegna, E. Pastor, T. Cerquitelli, E. Baralis,
Radiological physics and technology 10 (2017) 257–273. Concept-based explainable artificial intelligence: A survey,
[11] J. Jumper, R. Evans, A. Pritzel, T. Green, M. Figurnov, arXiv preprint arXiv:2312.12936 (2023).
O. Ronneberger, K. Tunyasuvunakool, R. Bates, A. Žídek, [26] N. Roy, R. M. Merrill, S. D. Gray, E. M. Smith, Voice disor-
A. Potapenko, et al., Highly accurate protein structure predic- ders in the general population: prevalence, risk factors, and
tion with alphafold, Nature 596 (2021) 583–589. occupational impact, The Laryngoscope 115 (2005) 1988–1995.
[12] M. Lovino, V. Randazzo, G. Ciravegna, P. Barbiero, E. Ficarra, [27] S. M. Cohen, Self-reported impact of dysphonia in a primary
G. Cirrincione, A survey on data integration for multi-omics care population: An epidemiological study, The Laryngoscope
sample clustering, Neurocomputing 488 (2022) 494–508. 120 (2010) 2022–2032.
[13] L. Vaiani, A. Koudounas, M. La Quatra, L. Cagliero, P. Garza, [28] N. Bhattacharyya, The prevalence of voice problems among
E. Baralis, Transformer-based non-verbal emotion recogni- adults in the united states, The Laryngoscope 124 (2014).
tion: Exploring model portability across speakers’ genders, [29] N. Spantideas, E. Drosou, A. Karatsis, D. Assimakopoulos,
in: Proceedings of the 3rd International on Multimodal Sen- Voice disorders in the general greek population and in patients
timent Analysis Workshop and Challenge, MuSe’ 22, As- with laryngopharyngeal reflux. prevalence and risk factors,
sociation for Computing Machinery, New York, NY, USA, Journal of Voice 29 (2015) 389–e27.
2022, p. 89–94. URL: https://doi.org/10.1145/3551876.3554801. [30] E. Brunner, K. Eberhard, M. Gugatschka, Prevalence of benign
doi:10.1145/3551876.3554801. vocal fold lesions: Long-term results from a single european
[14] X. Liu, L. Xie, Y. Wang, J. Zou, J. Xiong, Z. Ying, A. V. Vasilakos, institution, Journal of Voice (2023).
Privacy and security issues in deep learning: A survey, IEEE [31] I. Karabayir, S. M. Goldman, S. Pappu, O. Akbilgic, Gradient
Access 9 (2020) 4566–4593. boosting for parkinson’s disease diagnosis from voice record-
[15] T. Madiega, Artificial intelligence act, European Parliament: ings, BMC Medical Informatics and Decision Making 20 (2020).
European Parliamentary Research Service (2021). [32] H. Vieira, N. Costa, T. Sousa, S. Reis, L. Coelho, Voice-based
[16] A. Vellido, The importance of interpretability and visualization classification of amyotrophic lateral sclerosis: where are we
in machine learning for applications in medicine and health and where are we going? a systematic review, Neurodegener-
care, Neural computing and applications 32 (2020). ative Diseases 19 (2020) 163–170.
[17] A. Koudounas, E. Pastor, G. Attanasio, V. Mazzia, M. Giollo, [33] R. Islam, E. Abdel-Raheem, M. Tarique, Voice pathology de-
T. Gueudre, L. Cagliero, L. de Alfaro, E. Baralis, D. Amberti, tection using convolutional neural networks with electroglot-
Exploring subgroup performance in end-to-end speech models, tographic (egg) and speech signals, Computer Methods and
in: ICASSP 2023 - 2023 IEEE International Conference on Programs in Biomedicine Update 2 (2022) 100074.
Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. [34] X. Xie, H. Cai, C. Li, Y. Wu, F. Ding, A voice disease detection
1–5. doi:10.1109/ICASSP49357.2023.10095284. method based on mfccs and shallow cnn, Journal of Voice
[18] A. Koudounas, E. Pastor, G. Attanasio, V. Mazzia, M. Giollo, (2023).
T. Gueudre, E. Reale, L. Cagliero, S. Cumani, L. de Alfaro, [35] M. Radfar, A. Mouchtaris, S. Kunzmann, End-to-End Neu-
E. Baralis, D. Amberti, Towards comprehensive subgroup ral Transformer Based Spoken Language Understanding, in:
performance analysis in speech models, IEEE/ACM Trans- Proc. Interspeech 2020, 2020, pp. 866–870. doi:10.21437/
actions on Audio, Speech, and Language Processing (2024). Interspeech.2020-1963.
doi:10.1109/TASLP.2024.3363447. [36] D. Apiletti, E. Baralis, G. Bruno, T. Cerquitelli, Real-time
[19] A. Koudounas, F. Giobergia, E. Baralis, Bad exoplanet! ex- analysis of physiological data to support medical applications,
plaining degraded performance when reconstructing exoplan- IEEE transactions on information technology in biomedicine
ets atmospheric parameters, in: NeurIPS 2023 AI for Sci- 13 (2009) 313–321.
ence Workshop, 2023. URL: https://openreview.net/forum?id=
9Z4XZOhwiz.
[20] G. Vilone, L. Longo, Explainable artificial intelligence: a sys-