<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Overview of the MEDIQA-MAGIC Task at ImageCLEF 2025: Multimodal And Generative TelemedICine in Dermatology</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Wen-wai Yim</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Asma Ben Abacha</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Noel Codella</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Roberto Andres Novoa</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Josep Malvehy</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Hospital Clinic of Barcelona</institution>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Microsoft Health AI</institution>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Stanford University</institution>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The second edition for the MEDIQA-MAGIC[1] task builds on last year's challenges[2, 3] using an expanded multimodal dermatology dataset. Participants receive clinical narratives with related images and must complete two subtasks: (1) segmenting areas showing dermatological issues, and (2) answering closed-ended clinical questions based on the provided context. Test sets are annotated by at least three annotators. Questions and options are available in both English and Chinese. Six teams competed across both subtasks. The best-performing system for segmenting dermatological consumer health images scored 0.646 Jaccard, 0.785 Dice. For dermatological closed-ended QA, the best system achieved 0.76 accuracy.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Visual Question Answering</kwd>
        <kwd>Segmentation</kwd>
        <kwd>Dermatology</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Improvements in generalized artificial intelligence (AI) models, e.g., ChatGPT and DeepSeek, and
their accessibility to consumers have made them powerful tools for question answering and general
knowledge discovery. Their application and accuracy as medical assistive tools is crticial as two trends
emerge: (a) healthcare systems’ increasing adoption of AI into the electronic medical record and
healthcare operations, and (b) patients’ strengthened empowerment over health information seeking
behavior through the internet.</p>
      <p>In the first MEDIQA-MAGIC task in 2024 [ 3], we introduced the problem of consumer health
multimodal visual question answering. Participants were given consumer health queries (e.g., “I’ve had this
rash for two weeks what should I do?”), along with a patient-provided image (e.g., photo taken from a
mobile device), and tasked to generate free text responses. The task is congruent to asynchronous clinical
questions that can be posed to doctors through email or chats in real healthcare settings – a care delivery
method shown to be increasing in adoption to lower costs [4]. Given the well-documented rate of
physician burnout [5], such a technology can be applied to assist physician eficiency by pre-generating
draft responses.</p>
      <p>In the second edition of the MEDIQA-MAGIC task at ImageCLEF 2025 [6], we build upon last year’s
dataset, DermaVQA, [7], and its associated challenges [2, 3], extending them with a focus on
closedended multimodal dermatology question answering [8]. In this edition, participants were asked to
identify areas of interest in an image based on the patient’s query (e.g., "the rash on an arm"), as well
as to answer structured closed-ended questions (e.g., "is there single or multiple lesions"). These are
critical subtasks that can be used to improve end-to-end free text response generation, the subject of
the original 2024 challenge.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Task Description and Dataset</title>
      <p>Similar to the previous edition, participants were given a clinical narrative context along with
accompanying images. The task was divided into two relevant sub-parts: (i) segmentation of dermatological
problem regions, and (ii) providing answers to closed-ended questions. The questions, answers, and
answer options were given in both English and Chinese.</p>
      <p>In the first subtask, given each image and the clinical history, participants are tasked with generating
segmentations of the regions of interest for the described dermatological problem. The expected outputs
are binary image files with the same size as the original image.</p>
      <p>In the second subtask, participants were given a patient dermatological query, its accompanying
images, as well as a closed-ended question with accompanying choices – the task is to select the correct
answer to each closed question.</p>
      <p>The dataset was created by using real consumer health users’ queries and images; the question
schema was created in collaboration with two certified dermatologists. In total, closed question schema
- a comprehensive list of clinically relevant, patient-facing questions for dermatological assessments
included a total of 137 questions. More details of this can be found in our corresponding dataset
paper [8]. For the challenge, we tested for a total of 27 questions, which were the most common and
could be answered using both text and images. These corresponded to nine overall questions when
related questions are grouped (e.g., "anatomic region for afected area 1", "anatomic region for afected
area 2"). The answers were labeled by at least three annotators: two medical scribe annotators, and
one biomedical informatics graduate student. Questions and answers were translated into Chinese
by a native Chinese speaker. Further details can be found in the DermaVQA-DAS dataset paper [8].
Congruent with the MEDIQA-M3G edition [3], there was a total of 300, 56, and 100 instances for training,
validation, and test splits, respectively. Each query had on average three images.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Evaluation Methodology</title>
      <p>To leverage multiple gold standard masks for segmentation, we used the majority vote per pixel as the
gold standard for microscore calculations of the Jaccard and Dice indices for ranking. The mean of the
per-instance max and mean for all test instances were also reported.</p>
      <p>Because the same dermatological problem may have multiple sites, there may be related questions
(e.g., "what is the size of the afected area for location 1", "what is the size of the afected area for location
2"). In these cases, the answers to the related questions are collated together. Partial credit was given
when there are partial matches to gold. The evaluation code can be found here: github.com/wyim/
ImageCLEF-MAGIC-2025.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <p>Fifty-three teams registered for the event. A total of 56 valid completed runs were submitted by six
teams. Table 1 provides a list of participating teams and afiliations. This year’s primary participants
came from academic institutions in the United States, Vietnam, and India.</p>
      <p>Table 2 shows results for the segmentation task. Despite being calculated diferently, the Jaccard and
Dice metrics yielded identical rankings. Table 3 shows results for the closed-ended question answering
task.</p>
      <p>
        In the segmentation subtask, all four teams took a fine-tuning approach with diferences in the exact
models employed (e.g., TransUNet, ViT-B, CLIP). The Anastasia team enriched the dataset by performing
image transformation techniques (e.g., rotations, contrast adjustments) and were able to achieve top
performance after including data with all transformations. The IReL, IIT(BHU) team was the only
team that attempted to incorporate textual features. Their strategy used CLIP to embed both text and
visual features then afterwards fed the combined feature vector into a binary classification to predict
the mask. The remaining teams fine-tuned previously trained skin lesion segmentation models; the
H3N1 team used the DermoSegDif [ 15] model, whereas the KLE1 team fine-tuned a Multi-Scale Feature
Fusion Network model [16]. Though these models were trained for skin lesions, it is likely that further
ifne-tuning was required to completely adapt the model to this new dataset.
In the closed-ended question-answering subtask, the top two performing teams H3N1 and DSGT
employed multi-step architectures, including both fine-tuned models and LLM APIs and ensembling
methods. The former divided the task into four parts: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) preprocessing, (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) information enrichment via
image captioning, (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) fine-tuning and external API calls, and (
        <xref ref-type="bibr" rid="ref4">4</xref>
        ) ensembling models from the previous
step. The latter similarly had several layers: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) LLM fine-tuning with diferent models e.g., Qwen and
LLAMA, (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) reasoning layer over output of (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) using Gemini, and (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) an agent layer that additionally
has a RAG to reference the LanceDB dermatology corpus. In contrast, the remaining groups had similar
approaches, which utilized encoders for the images and text. After fusing the text and image features,
the resulting vector was passed to a classification layer.
      </p>
    </sec>
    <sec id="sec-5">
      <title>5. Discussion</title>
      <p>In the segmentation task, the most successful system was able to use data augmentation generated
through image transformation techniques (e.g., color contrast changes). This is promising, as other
teams did experiment with skin lesion segmentation specific models; however, they were not able to
achieve as high results – suggesting more data would be required to adapt those models. The use of
textual inputs was only tested by one group, suggesting that this is an area for future exploration.</p>
      <p>Given the unique opportunity allowed by multiple gold annotations and the variety of system outputs,
we investigated the efect of using multiple gold references in diferent scoring schemas on final system
rankings. To achieve this, we took the full sample of the test set and ordered the samples randomly,
incrementally adding more data until the full test set was covered. Changes in rank were calculated
by taking the L1 norm of the diference between the rank at each step and the final ranking. We
experimented with three sets of gold standards in which each instance is randomly drawn from one of
the gold standards, one gold standard using majority vote by pixel, a gold standard created by taking the
intersect and union of all annotators, and a gold standard generated by the STAPLE algorithm which
generates an estimated ground truth derived from existing gold standards [17]. We calculated Jaccard
and Dice, at a corpus level (e.g., all areas for intersections and unions are added from all instances before
calculation), shown in Figure 2. Even at the second-to-last sample point, the rankings from intersect
did not agree with other calculations. Interestingly, convergence was observed for rand1 and rand3 by
around 200 samples. However, rand2 did not show similar qualities, suggesting this method remains
sensitive to anomalies inherent in taking both random samples from gold and from choosing instances.
The STAPLE algorithm showed the fastest convergence to its final ranking, suggesting that it is a robust
approximation to truth.</p>
      <p>We additionally calculated macro evaluations at the instance level (e.g., Jaccard is calculated for
each image, then averaged across the dataset). For macro evaluations, we could additionally assess the
efect of taking the mean or maximum Jaccard or dice among all of the gold standard masks available
per-instance. As shown in Figure 3. Unlike in micro scoring, intersect converges much faster and union
is more prone to fluctuation. Majority vote, on the other hand, exhibits some fluctuations but converges
at a similar sample point as with micro scoring. This discrepancy can be attributed to allowing large
diferences in one or two instances afecting the entire score for micro - whereas in macro calculations,
large diferences in one instance will not afect more than the weight of one sample. Finally, it is
interesting to observe macro-scoring for Jaccard and dice using per instance mean and max values. For
both cases, we see that their rankings are often diferent from that of the other calculations. Here, again
the STAPLE algorithm showed the fastest convergence to its final ranking.</p>
      <p>Given the multiple gold annotations, it is possible to compute a mean and standard deviation of gold
mask instances, wherein the STAPLE-algorithm computed mask is taken as gold. To quantify how often
systems’ diferences are comparable to gold standard diferences, we calculate the standard deviations
away from gold masks’ averages and take the average over all instances. Table 4 provides the average
standard deviations for jaccard and dice for all submissions. In general, the systems with the best final
scores will have lower standard deviations. The best systems by team Anastasia were at 3.6 standard
deviations from the average mean gold mask score; meanwhile the worst-performing submissions were
at 9 and 19 standard deviations.</p>
      <p>In the closed QA task, three out of nine overall questions had duplicate questions allowing for
multiple site locations (e.g., "1 where is the afected area", "1 where is the afected area", "2 what label
best describes the afected area", "2 what label best describes the afected area", "1 what label best
describes the afected area"). For each submission, we calculate the mean diferences in the number of
unique answers per overall questions. For example, the reference may have QUESTION: 1 where is the
afected area, ANSWER: ARM, QUESTION: 2 where is the afected area, ANSWER: LEG, QUESTION: 3
where is the afected area, ANSWER: N/A. Whereas the system may have ANSWER: LEG, QUESTION:
2 where is the afected area, ANSWER: N/A, QUESTION: 3 where is the afected area, ANSWER: N/A.
In this case, the diference would be 2. A value less than 1 indicates the system gives a smaller number
of answers than the reference on average; a value close to 0 indicates a close agreement on average.
Table 5 shows the results. We see that on average the H3N1 team most often provided less answers than
H3N1
H3N1
H3N1
H3N1</p>
      <p>H3N1
DS@GT MEDIQA-MAGIC
DS@GT MEDIQA-MAGIC
DS@GT MEDIQA-MAGIC
DS@GT MEDIQA-MAGIC
DS@GT MEDIQA-MAGIC</p>
      <p>KLE1
KLE1</p>
      <p>KLE1
Kasukabe Defense Group
Kasukabe Defense Group
Kasukabe Defense Group
DS@GT MEDIQA-MAGIC</p>
      <p>Oggy
0.758
0.751
0.745
0.745
0.736
0.71
0.706
0.692
0.675
0.653
0.57
0.553
0.553
0.537
0.526
0.464
0.374
0.222
reference; whereas the second top team would provide more answers than reference. Most systems
tended to either provide less or more on average consistently across all 3 questions. It is worth noting,
the highest performing system had the closest diference (near 0) for all question categories, suggesting
that their handling of multiple related questions helped their overall performance. Given the spread of
over- and under- answering across submissions, generating the correct number of answers in itself may
be challenging.</p>
      <p>For the closed QA task, we found the best systems included multiple models fine-tuned for the task as
well as some ensembling and aggregation. The use of multi-modal large language models were critically
more successful than the suite of fine-tuned multimodal approaches which relied on a shared embedding
representation then fine-tuned for the classification task. This could be because the current dataset
is relatively small thus the importance of the large language models’ access to external information
became a determining factor.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>In this challenge, participants benchmarked the consumer health dermatological image segmentation
task as well as the closed VQA task. In general, the performances were promising with segmentation
performances at 3.6 standard deviations from gold annotators. Meanwhile, closed QA achieved an
accuracy of 0.76.</p>
      <p>The best performing segmentation systems took fine-tuning approaches along with data augmentation
methods. Here, only one team explored using textual clinical history as input, suggesting that this
area can be further explored. In closed VQA, the best performing teams applied multiple models and
ensembling methods. Successful applications may need to adapt such steps for proper pre- and
postprocessing.</p>
      <p>Here, we report the benchmarks for our segmentation and closed VQA. Exploring the impact of
these subtasks on an end-to-end free text response generation would be an interesting direction for
future studies. Future work includes expanding the dataset to capture more dermatological cases
and demographics. Furthermore, these technologies should be incorporated into real-world clinical
workflows and measured by their ability to increase workflow eficiency.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>The author(s) have not employed any Generative AI tools.
[5] C. A. Sinsky, T. D. Shanafelt, J. A. Ripp, The electronic health record inbox: Recommendations for
relief 37 (2024) 4002–4003.
[6] B. Ionescu, H. Müller, D.-C. Stanciu, A.-G. Andrei, A. Radzhabov, Y. Prokopchuk, Ştefan,
LiviuDaniel, M.-G. Constantin, M. Dogariu, V. Kovalev, H. Damm, J. Rückert, A. Ben Abacha, A. García
Seco de Herrera, C. M. Friedrich, L. Bloch, R. Brüngel, A. Idrissi-Yaghir, H. Schäfer, C. S. Schmidt,
T. M. G. Pakull, B. Bracke, O. Pelka, B. Eryilmaz, H. Becker, W.-W. Yim, N. Codella, R. A. Novoa,
J. Malvehy, D. Dimitrov, R. J. Das, Z. Xie, H. M. Shan, P. Nakov, I. Koychev, S. A. Hicks, S. Gautam,
M. A. Riegler, V. Thambawita, P. Halvorsen, D. Fabre, C. Macaire, B. Lecouteux, D. Schwab,
M. Potthast, M. Heinrich, J. Kiesel, M. Wolter, B. Stein, Overview of imageclef 2025: Multimedia
retrieval in medical, social media and content recommendation applications, in: Experimental
IR Meets Multilinguality, Multimodality, and Interaction, Proceedings of the 16th International
Conference of the CLEF Association (CLEF 2025), Springer Lecture Notes in Computer Science
LNCS, Madrid, Spain, 2025.
[7] W. wai Yim, Y. Fu, Z. Sun, A. B. Abacha, M. Yetisgen-Yildiz, F. Xia, Dermavqa: A multilingual
visual question answering dataset for dermatology, in: International Conference on Medical
Image Computing and Computer-Assisted Intervention, 2024. URL: https://api.semanticscholar.
org/CorpusID:273232728.
[8] W. Yim, Y. Fu, A. Ben Abacha, M. Yetisgen, N. Codella, R. A. Novoa, J. Malvehy, Dermavqa-das:
Dermatology assessment schema (das) and datasets for closed-ended question answering and
segmentation in patient-generated dermatology images, CoRR (2025).
[9] A. D. Karishma Thakrar, Shreyas Basavatia, Ds@gt at mediqa-magic 2025, in: CLEF 2025 Working</p>
      <p>Notes, CEUR Workshop Proceedings, CEUR-WS.org, Madrid, Spain, 2025.
[10] N. P. H. Le, H. P. D. Huy, H. T. D. Nhat, H. T. Minh, H3n1 at mediqa-magic 2025: Dermosegdif
and dermkem for comprehensive dermatology ai, in: CLEF 2025 Working Notes, CEUR Workshop
Proceedings, CEUR-WS.org, Madrid, Spain, 2025.
[11] K. B. Desai, V. Hiregoudar, I. Kulkarni, R. Dhane, P. Desai, S. C, U. Mudenagudi, R. Tabib, The
kasukabe defense group at mediqa-magic 2025: Clinical visual question answering with
resourceeficient multi-modal learning, in: CLEF 2025 Working Notes, CEUR Workshop Proceedings,
CEUR-WS.org, Madrid, Spain, 2025.
[12] T. Le, T. Ngo, K. Nguyen, T. Dang, T. Pham, T. Nguyen, Anastasia at mediqa-magic 2025: A
multi-approach segmentation framework with extensive augmentation, in: CLEF 2025 Working
Notes, CEUR Workshop Proceedings, CEUR-WS.org, Madrid, Spain, 2025.
[13] K. Tewari, A. Verma, S. Pal, Irel, iit(bhu) at mediqa-magic 2025: Tackling multimodal dermatology
with clipseg-based segmentation and bert-swin question answering, in: CLEF 2025 Working Notes,
CEUR Workshop Proceedings, CEUR-WS.org, Madrid, Spain, 2025.
[14] B. Mallanaikar, S. Kekare, P. Desai, S. C, U. Mudenagudi, R. Tabib, A. Savalkar, A. S. Handi, P. Desai,
S. Varur, Kle1 at mediqa-magic 2025, in: CLEF 2025 Working Notes, CEUR Workshop Proceedings,
CEUR-WS.org, Madrid, Spain, 2025.
[15] A. Bozorgpour, Y. Sadegheih, A. Kazerouni, R. Azad, D. Merhof, Dermosegdif: A boundary-aware
segmentation difusion model for skin lesion delineation, 2023. URL: https://arxiv.org/abs/2308.
02959. arXiv:2308.02959.
[16] J. Wang, L. Wei, L. Wang, Q. Zhou, L. Zhu, J. Qin, Boundary-Aware Transformers for Skin Lesion
Segmentation, Springer International Publishing, 2021, p. 206–216. URL: http://dx.doi.org/10.1007/
978-3-030-87193-2_20. doi:10.1007/978-3-030-87193-2_20.
[17] S. Warfield, K. H. Zou, W. M. Wells, Simultaneous truth and performance level estimation (staple):
an algorithm for the validation of image segmentation, IEEE Transactions on Medical Imaging 23
(2004) 903–921. URL: https://api.semanticscholar.org/CorpusID:3025202.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>W.</given-names>
            <surname>Yim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Ben</given-names>
            <surname>Abacha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Codella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. A.</given-names>
            <surname>Novoa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Malvehy</surname>
          </string-name>
          ,
          <article-title>Overview of the mediqa-magic task at imageclef 2025: Multimodal and generative telemedicine in dermatology</article-title>
          ,
          <source>in: CLEF 2025 Working Notes, CEUR Workshop Proceedings</source>
          , CEUR-WS.org, Madrid, Spain,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>W. wai Yim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. B.</given-names>
            <surname>Abacha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Yetisgen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Xia</surname>
          </string-name>
          ,
          <article-title>Overview of the mediqa-magic task at imageclef 2024: Multimodal and generative telemedicine in dermatology</article-title>
          ,
          <source>in: Conference and Labs of the Evaluation Forum</source>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3] W.-w. Yim,
          <string-name>
            <given-names>A. Ben</given-names>
            <surname>Abacha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Xia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Yetisgen</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Krallinger, Overview of the MEDIQA-M3G 2024 shared task on multilingual multimodal medical answer generation</article-title>
          , in: T.
          <string-name>
            <surname>Naumann</surname>
            ,
            <given-names>A. Ben</given-names>
          </string-name>
          <string-name>
            <surname>Abacha</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Bethard</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Roberts</surname>
          </string-name>
          , D. Bitterman (Eds.),
          <source>Proceedings of the 6th Clinical Natural Language Processing Workshop</source>
          , Association for Computational Linguistics, Mexico City, Mexico,
          <year>2024</year>
          , pp.
          <fpage>581</fpage>
          -
          <lpage>589</lpage>
          . URL: https://aclanthology.org/
          <year>2024</year>
          .clinicalnlp-
          <volume>1</volume>
          .55/. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2024</year>
          .clinicalnlp-
          <volume>1</volume>
          .
          <fpage>55</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>T. F.</given-names>
            <surname>Bishop</surname>
          </string-name>
          , M. J. Press,
          <string-name>
            <given-names>J. L.</given-names>
            <surname>Mendelsohn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. P.</given-names>
            <surname>Casalino</surname>
          </string-name>
          ,
          <article-title>Electronic communication improves access, but barriers to its widespread adoption remain 32 (?</article-title>
          ???)
          <volume>10</volume>
          .1377/hlthaf.
          <year>2012</year>
          .
          <volume>1151</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>