<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>S. N. Zeleke);</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Vision-Language Models in ECG Interpretation: An Exploratory Study</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sileshi Nibret Zeleke</string-name>
          <email>sileshi.zeleke@uniba.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mario Bochicchio</string-name>
          <email>mario.bochiccio@uniba.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop</string-name>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Electrocardiogram, Vision-Language Model, Explainable AI, Low-Rank Adaptation</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, University of Bari Aldo Moro</institution>
          ,
          <addr-line>Bari</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Digital Health National Lab, CINI - Consorzio Interuniversitario Nazionale per l'Informatica</institution>
          ,
          <addr-line>Roma</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>Electrocardiogram (ECG) interpretation remains a critical yet complex task in cardiovascular diagnostics. With the rise of multimodal learning, vision-language models (VLMs) ofer a promising new paradigm for automating and explaining ECG analysis, as well as providing explainable decision support. In this study, we evaluate three state-of-the-art VLMs-GPT-4o, PULSE (zero-shot), and a fine-tuned PULSE via Low-Rank Adaptation (LoRA)-on three benchmark datasets: MIT-BIH, Chapman-Shaoxing, and CPSC-2018. We convert raw ECG signals into high-resolution printout-style images and assess not only abnormality classification but also explanation quality, including factual accuracy, completeness, contextual understanding, and hallucination, using an automated GPT-4-based evaluator. Moreover, our experiments using case studies demonstrate the efectiveness of these models in aligning visual ECG signals with clinical language, generating accurate diagnostic summaries, and providing explanations for uncertain predictions. The findings highlight both the strengths and limitations of selected medical and general-purpose VLMs, ofering insights into their readiness in application.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The electrocardiogram is a fundamental and widely utilized diagnostic tool in healthcare for monitoring
the electrical activity of the heart. Its non-invasive and cost-efective nature makes it a crucial component
in the initial assessment of various cardiac conditions [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Interpreting ECG readings, however, often
demands substantial medical expertise to accurately analyze complex signals and integrate them with
patient-specific information [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. This process can be resource-intensive and prone to errors, particularly
in settings with limited access to specialized medical personnel.
      </p>
      <p>Recent advancements in artificial intelligence (AI), particularly in the field of VLMs, present a
promising avenue for enhancing ECG interpretation. VLMs, which combine computer vision and
natural language processing, have demonstrated exceptional performance across various multimodal
tasks by integrating visual and linguistic information. There is a growing interest in leveraging VLMs
for medical image analysis in the medical domain, aiming to achieve more intelligent and eficient
multitask processing. Despite this potential, the application of VLMs to ECG data remains underexplored.
Meanwhile, VLMs have transformed fields from radiology to pathology by interpreting images and
generating human‐quality reports, raising an obvious question: can these same foundations tackle
ECG interpretation? ECG signals possess unique characteristics, including temporal dependencies
and lead-specific information, which difer significantly from the data types VLMs are typically trained
on. Moreover, the efectiveness of fine-tuning strategies, such as Low-Rank Adaptation (LoRA), in
adapting VLMs to the nuances of ECG data has not been thoroughly investigated.</p>
      <p>
        Prior work in ECG AI largely focuses on time-series models or bespoke ECG‐text systems MEIT
[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], and ECG-Chat [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], which either ignore the visual layout of standard 12-lead printouts or require
massive domain-specific corpora. General-purpose VLMs such as GPT-4o and ECG-specific PULSE
Italy
      </p>
      <p>CEUR</p>
      <p>
        ceur-ws.org
[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], however, are pretrained on image–text pairs and can already “read” chart-style data. Yet rigorous
evaluation of how of-the-shelf VLMs cope with waveform artifacts, grid lines, and multi-panel layouts,
or what gains lightweight fine-tuning might unlock.
      </p>
      <p>In this work, we ask:
• Baseline capability: How do leading VLMs perform on ECGs without any adaptation?
• Eficient adaptation: Can low-rank LoRA updates boost accuracy and interpretability with
minimal compute and data?
• Clinical relevance: Do qualitative case studies reflect genuine understanding of complex
arrhythmias, and where do these models still fail?
Moreover, we evaluate the performance of GPT-4o, PULSE, and a fine-tuned PULSE using LoRA on
three benchmark ECG datasets: MIT-BIH, Chapman-Shaoxing, and CPSC-2018. By addressing these
aspects, we aim to advance the integration of multimodal AI models into clinical ECG interpretation
workflows.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Recent years have witnessed significant progress in AI-driven ECG analysis with three emerging
trajectories: clinical reliability, explainable AI, and multimodal integration of ECG data. Although
the use of VLMs in medical images [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], such as in radiology, has been thoroughly investigated, their
application for ECG interpretation is more recent. Compared to the analysis of static images, the special
characteristics of ECG, which is usually represented as a waveform, present both special potential and
particular obstacles for the implementation of VLM approaches [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>
        Instruction tuning has demonstrated significant eficacy in the multimodal domain, particularly
in VLMs such as LLaVA [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], MiniGPT-4 [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], and InstructBLIP [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. These models exhibit impressive
generalizability across various visual understanding and reasoning tasks. While multimodal instruction
tuning has been applied to general medical imaging tasks, its application to ECG images remains
largely unexplored. A recent study introduced the framework [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], fine-tuning existing open-source
large language models for ECG report generation. However, this approach is limited by a single-task
instruction dataset focused solely on report generation, potentially constraining its adaptability to
other ECG-related tasks. Moreover, their work treats ECG data as temporal signals, whereas PULSE [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]
focuses on encoding ECG images with multimodal large language models, which is more applicable to
real-world scenarios where only printed or digital ECG images are available.
      </p>
      <p>
        ECG-CoCa [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], an ECG encoder trained on ECG-text pairs, alongside ECG-Chat, a modified LLaVA
model capable of processing ECG time series. Moreover, [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] created a framework for instruction tuning
that converts ECG-text pairs into chatbot-style instructions and optimizes the linear layers of the LLM
for automated ECG report creation. A specifically designed model employing an improved ResNet-18
architecture-based ECG encoder transforms raw ECG signals into a high-dimensional feature space.
This feature space is then carefully aligned with the textual feature space derived from a large language
model [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>We selected three models to represent a spectrum of capabilities: (1) GPT-4o: A state-of-the-art,
generalpurpose multimodal LLM, chosen to establish a powerful zero-shot baseline and assess the out-of-the-box
capability of foundational models on ECG data. (2) PULSE: A recently published VLM specifically
pre-trained on ECG-image-text pairs, chosen to represent the current state-of-the-art in domain-specific
VLMs for ECG interpretation in a zero-shot setting. (3) PULSE+LoRA: Our fine-tuned version of PULSE.
This choice allows us to directly isolate and measure the benefit of eficient parameter fine-tuning on a
domain-specific model, answering whether lightweight adaptation can bridge the gap between general
pre-training and specialized clinical performance.</p>
      <p>The fine-tuning process aims to enhance PULSE’s capability in understanding and interpreting ECG
images, especially for arrhythmia interpretation. We evaluate the performance of the fine-tuned PULSE
model both quantitatively and qualitatively, comparing it against the original PULSE and GPT-4o models.
Subsection 3.1 presented a fine-tuning data preparation procedure, and Subsection 3.2 discusses about
ECG signal-to-image conversion method. Finally, Subsection 3.3 is about the fine-tuning process.</p>
      <sec id="sec-3-1">
        <title>3.1. Fine-tuning Data Preparation</title>
        <p>
          For fine-tuning, we constructed a new dataset composed of three widely used ECG datasets: the
singlelead MIT-BIH arrhythmia database [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ], the China Physiological Signal Challenge (CPSC-2018) [12],
and the Chapman-Shaoxing ECG dataset [13]. We complemented the ECG signal with comprehensive
textual descriptions selected from reputable medical sources, including clinical textbooks and ECG
interpretation guidelines, to add clinically relevant context to the training data [14]. These textual
annotations capture both general cardiac features and particular waveform qualities. For instance,
”60–100 bpm, upright P waves preceding each QRS complex, normal PR intervals, narrow QRS duration
(&lt;100 ms, and normal T waves indicating preserved conduction” are characteristics of a typical sinus
rhythm.
        </p>
        <p>The ultimate fine-tuning dataset was formatted in a multiple-choice question style as shown in
Figure 1, with a single correct description and a set of plausible distractors for every ECG sample. To
prevent the model from overfitting to a fixed set of abnormality types, we randomly sampled subsets of
applicable abnormalities for each input. This forces the model to pick up on fine-grained morphological
distinctions while maintaining generalization across clinically similar conditions. Moreover, the dataset,
while constructed from multiple public sources and enriched with clinical context, remains limited in
scale compared to the vast datasets typically used for pre-training general-purpose VLMs.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. ECG Signal to Image Conversion</title>
        <p>Given that many vision language models are designed to process image data, converting ECG signals
into an image format is a crucial step in leveraging these models for ECG interpretation. Various
techniques have been explored for this conversion, allowing the application of established image
processing methods and semantic segmentation tools. One straightforward approach involves directly
encoding each signal and serializing it into an image of the desired dimensions. We utilize the Python
ECG plot library to facilitate the generation of 12-lead ECG images that resemble clinical displays and
printouts directly from signal data.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. LoRA Fine-Tuning</title>
        <p>Due to the limited size of our ECG-text dataset and to avoid catastrophic forgetting and overfitting, we
perform parameter LoRA fine-tuning, as shown in Figure 2. LoRA trains smaller weights while keeping
the original model unchanged. Each LoRA adapter introduces a low-rank learnable matrix of the same
dimensionality as the target weight matrix to enable eficient adaptation while keeping the pre-trained
model parameters frozen. This matrix is integrated into specific model layers to adjust their behavior
without modifying the original parameters.</p>
        <p>To enable eficient fine-tuning, we modify the original weight matrix W ∈ ℝ× by adding a low-rank
adaptation in the form of a rank-constrained matrix ΔW, which is a product of A and B, where A ∈ ℝ×
and B ∈ ℝ× . Given  , the rank of the adaptation acts as a tunable hyperparameter that controls the
trade-of between model capacity and eficiency, with  ≪ min(, ) , where  and  are the working
dimensions of W. During fine-tuning, the base model parameters W are kept frozen, while only the
parameters of A and B are updated. We utilize four weight matrices, denoted as WQ, WK, WV, and WO,
which correspond to the weights of the query, key, value, and output projections within the multi-head
self-attention module, respectively.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental Setting</title>
      <sec id="sec-4-1">
        <title>4.1. Experimental Setting and Implementation</title>
        <p>During the fine-tuning phase, the rank hyperparameter  is empirically optimized to  = 4 based on
common practices in the literature for similar model sizes, achieving an optimal balance between
the augmentation of parameters and model performance. The scaling factor  plays a crucial role
in mediating the stability and adaptability of low-rank metrics; it is methodically set to a value of 8
before integration with W. These parametric values help to eficiently fine-tune the pre-trained model,
minimizing trainable parameters while maintaining adequate model expressiveness through
appropriately scaled updates. Nevertheless, to enhance the modality and tolerance to context shifts between
the pretrained representation and ECG patterns, the LoRA bias is adjusted to lora_only. All models
were then optimized to achieve the best performance across all datasets, including hyperparameter
optimization. Training and inference were both conducted on four NVIDIA RTX A6000 GPUs, each
with 48 GB of VRAM, enabling parallel experimentation and eficient handling of high-dimensional
ECG signals. Experiments were performed in PyTorch and executed on a multi-GPU workstation with
CUDA acceleration.
4.1.1. Evaluation Metrics
We employ several evaluation metrics to assess the performance of VLMs, encompassing ECG signal
analysis and ECG abnormality detection. Given the class imbalance in the datasets, balanced accuracy,
weighted F1, weighted recall, and weighted precision were used to provide a more reliable performance
measure.</p>
        <p>Interpretation Evaluation Metrics To rigorously assess the quality of the textual explanations
generated, we adopt an automated evaluator using GPT-4. The interpretation generated is judged against
the ground truth on five criteria. Grounded in clinical requirements and recent research on explainable
AI in cardiology [15], the metrics are (i) factual accuracy: to verify that the stated facts correctly reflect
the ECG; (ii) completeness: measures whether the interpretation mentions each key feature highlighted
in the ground truth; (iii) detail quality: assesses the specificity and precision of descriptions; (iv)
context understanding: evaluates the interpretation of the overall cardiac context; and (v) hallucination:
penalizes mentions of elements not present in the ECG image. The GPT-based evaluator scores responses
on a scale from 1 to 5, as found eficient in related studies [ 16], with 1 indicating poor quality and 5
indicating high quality for the first four metrics; however, for hallucination, the lower the score, the
better.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results and Discussion</title>
      <sec id="sec-5-1">
        <title>5.1. Abnormality Classification Evaluation Result</title>
        <p>We test GPT-4o, PULSE, and our fine-tuned PULSE on three benchmark datasets. As shown in
Table 1, GPT-4o outperforms ECG-based pre-trained and fine-tuned models on the MIT-BIH dataset
in terms of accuracy and precision. However, its performance dropped on the CPSC-2018 dataset.
PULSE+LoRA (finetuned) consistently improved over PULSE in most metrics across all datasets,
particularly on Chapman-Shaoxing, where it achieved the highest F1 score of 43.28. This result suggests that
lightweight fine-tuning using LoRA helps models better adapt to domain-specific challenges, including
the integration of spatial and temporal features present in multiple lead signals. The benefit of such
targeted adaptation is especially evident in more complex 12-lead datasets. Despite the improvements,
all models struggle with low performance, highlighting the complexity of ECG and suggesting that
current vision-language models are not yet fully optimized for this task.</p>
        <p>Despite these gains, the overall performance of all models remains low across the board. The
classification performance across all tasks indicates substantial challenges in achieving robust ECG
classification with current VLMs. This underperformance reinforces the notion that ECG signals present
unique challenges, such as temporal complexity, noise variability, and inter-patient heterogeneity, that
current VLM architectures are not yet equipped to handle efectively.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Evaluation of Interpretability</title>
        <p>The ability of VLMs to provide accurate, complete, and contextually relevant explanations directly
impacts trust and usability. We selected 150 random test samples to evaluate the interpretation based
on a comprehensive prompt that aligns with established principles for assessing explainability. The
prompt used to evaluate the model ofers actionable insights into model performance by systematically
assessing dimensions such as factual accuracy, quality of detail, completeness, and hallucination.</p>
        <p>Table 2 presents a comparative evaluation of explanation quality using GPT-4o as an automatic
evaluator across two models—Finetuned and PULSE—on three ECG datasets. The fine-tuned model
consistently outperforms PULSE in factual accuracy, completeness, and detail quality, particularly on the
CPSC-2018 dataset, indicating a stronger alignment with clinical ground truth and richer explanatory
depth. While both models achieve high Context Understanding scores on CPSC-2018, PULSE exhibits
elevated hallucination rates, especially on MIT-BIH, suggesting limitations in generating trustworthy
outputs. The results highlight the efectiveness of fine-tuning in enhancing explanation fidelity and
reducing model hallucination, with implications for improving the reliability of VLMs in medical AI
applications.</p>
        <p>Furthermore, the explanations were found to be factually accurate, achieving a mean rating of 3.78
on a 5-point scale, which reflects strong alignment with the ground truth. The comprehensive quality
evaluation results indicate that the generated explanations are consistent with the corresponding
classification outcomes, demonstrating coherent and reliable interpretability, as illustrated in Figure 3.
Lower hallucination scores reflect better restraint, and fine-tuning reduces hallucinations. The biggest
drop on MIT-BIH again highlights that adaptation helps the model avoid inventing findings on complex
tracings; however, non-zero hallucination remains, which is clinically unacceptable.</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Case Studies</title>
        <p>Figure 3 provides a qualitative comparison of interpretability across three vision-language models when
prompted to analyze a 12-lead ECG and select the correct abnormality type with an explanation. The
ECG case corresponds to a Right Bundle Branch Block (RBBB), providing a benchmark for evaluating
each model’s reasoning fidelity and clinical awareness. Despite the visual richness of ECG data, the
interpretability results reveal important diferences in model behavior. PULSE Figure 3 (c) misclassifies
the ECG as Left Bundle Branch Block (LBBB). Although its explanation highlights prolonged QRS
duration and a QS complex in lead V2, which are partially relevant, it overlooks hallmark features of
RBBB, such as a terminal R’ wave in V1 and wide S waves in leads I and V6. This demonstrates limited
generalization in the pretrained model and suggests that it captures superficial waveform cues without
robust clinical grounding.</p>
        <p>In contrast, the fine-tuned PULSE+LoRA model shown in Figure 3 (d) correctly identifies RBBB and
generates a clinically sound explanation, referencing a wide S wave in lead I and a wide R wave in lead
V1, both diagnostic hallmarks of RBBB. It also correctly cites prolonged QRS duration, ofering a more
complete and accurate clinical interpretation. This highlights the eficacy of lightweight fine-tuning in
grounding model outputs in domain-specific knowledge and improving both classification accuracy
and explanation quality. Surprisingly, GPT-4o Figure 3 (e), a general-purpose multimodal LLM, also
misclassifies the case as LBBB. While its explanation is structurally coherent and references classic
signs of LBBB (e.g., notched R waves in V5 and V6), these features do not apply to the given ECG.
This suggests that GPT-4o may be relying on memorized text patterns or visual heuristics, rather than
integrating the prompt with accurate visual signal interpretation. The model’s high performance on
simpler, single-lead tasks like MIT-BIH may reflect this shallow visual-text alignment, which becomes a
liability in more complex, multi-lead cases requiring fine-grained spatiotemporal reasoning.</p>
        <p>Overall, these results emphasize that interpretation quality matters as much as classification accuracy,
especially in clinical settings where trust and transparency are critical. Fine-tuning with domain data
not only improves predictive performance but also enhances the interpretability of model decisions.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>In this study, we investigated the potential of multimodal large language models for ECG interpretation.
We test three state-of-the-art VLM models available for ECG interpretation on three baseline public
datasets. Our findings underscore the promise of VLMs in enhancing diagnostic and signal interpretation.
The primary contribution of this work is not to present a clinically viable tool, but rather to rigorously
benchmark the zero-shot and lightly-tuned capabilities of these models on a complex medical task
for which they were not specifically designed. The low scores are a significant finding in themselves,
highlighting a critical performance gap and the unique challenges of ECG data that VLMs must overcome.
Furthermore, the observed model fragility across datasets emphasizes the importance of robust validation
across diverse and representative clinical benchmarks.</p>
      <p>Future work should focus on designing VLMs that are specifically tailored to the unique characteristics
of ECG signals, including the integration of mechanisms capable of efectively capturing temporal
dynamics. This involves not only exploring novel LLM architectures but also developing specialized,
eficient time-series encoders suited for ECG data. Critically, establishing the clinical utility, safety, and
reliability of these models through rigorous, real-world validation is essential before their integration
into healthcare practice. While current benchmarks and experimental results are encouraging, the true
impact of VLMs on ECG interpretation will depend on comprehensive clinical evaluation and regulatory
acceptance. Although the evaluation of explanations using GPT-4 provided a scalable and consistent
metric, it represents a form of AI self-assessment that may inherit biases and is not a substitute for
clinical expertise</p>
      <sec id="sec-6-1">
        <title>Acknowledgments</title>
        <p>This work was supported by the Age-It project, which is part of the National Recovery and Resilience
Plan (PNRR) program funded by NextGenerationEU.</p>
      </sec>
      <sec id="sec-6-2">
        <title>Declaration of Generative AI Usage</title>
        <p>During the preparation of this work, the authors used ChatGPT and Grammarly to perform grammar
and spell checks, as well as to paraphrase and reword. After using this tool/service, the authors reviewed
and edited the content as needed and take full responsibility for the publication’s content.
[12] F. F. Liu, C. Y. Liu, L. N. Zhao, X. Y. Zhang, X. L. Wu, X. Y. Xu, Y. L. Liu, C. Y. Ma, S. S. Wei, Z. Q.</p>
        <p>He, J. Q. Li, N. Y. Kwee, An open access database for evaluating the algorithms of ecg rhythm
and morphology abnormal detection, Journal of Medical Imaging and Health Informatics 8 (2018)
1368–1373. *C. Y. Liu is the corresponding author.
[13] J. Zheng, H. Guo, H. Chu, A large-scale 12-lead electrocardiogram database for arrhythmia
study (version 1.0.0), PhysioNet, 2022. URL: https://doi.org/10.13026/wgex-er52. doi:10.13026/
wgex- er52.
[14] A. Luthra, ECG Made Easy, Jaypee Brothers Medical Publishers, 2019.
[15] G. Silva, P. Silva, G. Moreira, V. Freitas, J. Gertrudes, E. Luz, A systematic review of ecg arrhythmia
classification: Adherence to standards, fair evaluation, and embedded feasibility, 2025. URL:
https://arxiv.org/abs/2503.07276. arXiv:2503.07276.
[16] J. He, P. Li, G. Liu, S. Zhong, Parameter-eficient fine-tuning medical multimodal large
language models for medical visual grounding, 2024. URL: https://arxiv.org/abs/2410.23822.
arXiv:2410.23822.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>T.</given-names>
            <surname>Seki</surname>
          </string-name>
          , et al.,
          <article-title>Assessing the performance of zero-shot visual question answering in multimodal large language models for 12-lead ecg image interpretation</article-title>
          ,
          <source>medRxiv</source>
          (
          <year>2024</year>
          ). URL: https://doi.org/ 10.1101/
          <year>2024</year>
          .03.19.24304442. doi:
          <volume>10</volume>
          .1101/
          <year>2024</year>
          .03.19.24304442, preprint.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>G.</given-names>
            <surname>Quer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. J.</given-names>
            <surname>Topol</surname>
          </string-name>
          ,
          <article-title>The potential for large language models to transform cardiovascular medicine</article-title>
          ,
          <source>The Lancet Digital Health</source>
          <volume>6</volume>
          (
          <year>2024</year>
          )
          <fpage>e767</fpage>
          -
          <lpage>e771</lpage>
          . URL: https://doi.org/10.1016/S2589-
          <volume>7500</volume>
          (
          <issue>24</issue>
          )
          <fpage>00151</fpage>
          -
          <lpage>1</lpage>
          . doi:
          <volume>10</volume>
          .1016/S2589-
          <volume>7500</volume>
          (
          <issue>24</issue>
          )
          <fpage>00151</fpage>
          -
          <lpage>1</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Tao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Arcucci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. Zhang,</surname>
          </string-name>
          <article-title>Meit: Multi-modal electrocardiogram instruction tuning on large language models for report generation</article-title>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2403.04945. arXiv:
          <volume>2403</volume>
          .
          <fpage>04945</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , P. Han,
          <string-name>
            <surname>T</surname>
          </string-name>
          . Chen,
          <article-title>Ecg-chat: A large ecg-language model for cardiac disease diagnosis</article-title>
          ,
          <year>2025</year>
          . URL: https://arxiv.org/abs/2408.08849. arXiv:
          <volume>2408</volume>
          .
          <fpage>08849</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>R.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Yue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , Teach multimodal llms to comprehend electrocardiographic images,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2410.19008. arXiv:
          <volume>2410</volume>
          .
          <fpage>19008</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>H.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. J.</given-names>
            <surname>Lee</surname>
          </string-name>
          , Visual instruction tuning,
          <year>2023</year>
          . URL: https://arxiv.org/abs/2304.08485. arXiv:
          <volume>2304</volume>
          .
          <fpage>08485</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>H.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Kamarthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Wen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Hartvigsen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. A.</given-names>
            <surname>Prakash</surname>
          </string-name>
          ,
          <article-title>How can time series analysis benefit from multiple modalities? a survey and outlook, 2025</article-title>
          . URL: https://arxiv.org/abs/2503.11835. arXiv:
          <volume>2503</volume>
          .
          <fpage>11835</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Elhoseiny</surname>
          </string-name>
          , Minigpt-4:
          <article-title>Enhancing vision-language understanding with advanced large language models</article-title>
          ,
          <source>arXiv preprint arXiv:2304.10592</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>W.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M. H.</given-names>
            <surname>Tiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Fung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hoi</surname>
          </string-name>
          , Instructblip: Towards general
          <article-title>-purpose vision-language models with instruction tuning</article-title>
          ,
          <source>arXiv preprint arXiv:2305.06500</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>K.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Qiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Nie</surname>
          </string-name>
          ,
          <article-title>Ecg-lm: Understanding electrocardiogram with a large language model</article-title>
          ,
          <source>Health Data Science</source>
          <volume>5</volume>
          (
          <year>2025</year>
          )
          <article-title>0221</article-title>
          . doi:
          <volume>10</volume>
          .34133/hds.0221.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>G. B. Moody</surname>
          </string-name>
          , R. G. Mark,
          <article-title>The impact of the mit-bih arrhythmia database</article-title>
          ,
          <source>IEEE Engineering in Medicine and Biology Magazine</source>
          <volume>20</volume>
          (
          <year>2001</year>
          )
          <fpage>45</fpage>
          -
          <lpage>50</lpage>
          . doi:
          <volume>10</volume>
          .1109/51.932724.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>