<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Overview of the 2025 ImageCLEFtoPicto Task - Investigating the Generation of Pictogram Sequences from Text and Speech</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Cécile Macaire</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Diandra Fabre</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Benjamin Lecouteux</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Didier Schwab</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Univ. Grenoble Alpes</institution>
          ,
          <addr-line>CNRS, Grenoble INP, LIG, 38000 Grenoble</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The automatic generation of a pictogram sequence from either text or speech input is a novel challenge in the NLP community. It has the ability to enhance communication for people with language impairments, relying on Augmentative and Alternative Communication. This paper presents an overview of the second edition of the ImageCLEFtoPicto task. It includes two sub-tasks, Text-to-Picto, whose goal is to produce a comprehensive sequence of pictogram terms given a text input, whereas Speech-to-Picto starts from the speech modality. Compared to last year's edition, the focus is on developing robust translation models across a variety of acoustic domains (read and spontaneous speech) as well as in diferent thematic (medical, everyday life situations). This paper details the task with the datasets and the evaluation metrics, followed by an overview of models and runs submitted by the participating teams and their results. The best team achieved 76.98 and 62.87 sacreBLEU points on Text-to-Picto and Speech-to-Picto subtasks respectively, highlighting significant progress in addressing this still-challenging task.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;ImageCLEF</kwd>
        <kwd>Pictograms</kwd>
        <kwd>Automatic Translation</kwd>
        <kwd>Augmentative and Alternative Communication</kwd>
        <kwd>Multimodal Data</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>The ImageCLEFtoPicto task is part of the ImageCLEF lab1 initiative, which aims to advance the evaluation
of technologies across a range of tasks (annotation, generation, classification). It ofers reusable
benchmarking resources based on a large collection of multimodal data, supporting evaluations in
monolingual, cross-language, and language-independent contexts.</p>
      <p>The toPicto task is the second edition of the task, with the goal of generating a pictogram translation
from two diferent inputs: text or speech. This natural language processing (NLP) challenge introduces
a novel type of multimodal data for training machine learning models. Compared to last year’s edition,
the dataset has been expanded to include a wider variety of acoustic domains, ranging from read to
spontaneous speech and thematic topics, including both medical and everyday-life contexts. Participants
were tasked with building models that are robust across these diverse domains.</p>
      <p>
        Pictogram translation is a new NLP task that has recently gained interest in the research community,
as it can help individuals with language impairments to accurately convey their messages [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. A
communication disorder refers to a disruption in one of the processes that allow speakers to produce and
listeners to understand spoken, written, or signed messages [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. These disorders can be induced before,
during, or after birth (congenital disability) or in adulthood, when language and communication skills are
already developed (acquired disability). In such cases, Augmentative and Alternative Communication
(AAC) can be used. AAC is a set of strategies and methods to supplement or compensate for oral
language, used to efectively convey one’s thoughts, needs, and emotions [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The pictogram is a central
element, a simplified iconic representation of a concept (a single word, multi-word expression, named
entity, or entire sentence) designed to resemble reality. Beyond improving access to language, AAC also
contributes to users’ overall well-being. In a recent study conducted by the Swedish research firm Augur
and commissioned by Tobii Dynavox, "Exploring the benefits of assistive communication" 2, high-tech
AAC doubles the quality of life for users, whether on a physical level (communicating health issues),
social level (building relationships), or psychological level (expressing their emotions and personality).
Despite these strengths, numerous environmental and financial barriers remain, such as the lack of time
to understand and teach the use of an AAC tool for the caregivers. To bridge this gap between oralizing
individuals with no prior knowledge about pictograms and AAC users, tools that automatically translate
speech or text into pictograms are essential. This year’s edition of the ToPicto task seeks to address
this need by fostering the development of robust methods to generate accurate pictogram sequences,
thereby helping reduce communication barriers in multiple domains. The datasets and the scripts to
evaluate the submissions are available in our oficial Hugging Face repository 3.
      </p>
      <p>This paper presents an overview of the ImageCLEFtoPicto task, and is organized as follows. We first
introduce the two sub-tasks in Section 2, Text-to-Picto and Speech-to-Picto. In Section 3, we explain
the creation of the multimodal dataset. The evaluation methodology is presented in Section 4. Section 5
describes the participant results with a discussion, before concluding in Section 6 with some insights
into future directions.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Task description</title>
      <p>The ImageCLEFtoPicto 2025 task consists of two sub-tasks: Text-to-Picto and Speech-to-Picto.
Participants were allowed to submit to one or both sub-tasks, with a maximum of 10 submissions in
total.</p>
      <sec id="sec-2-1">
        <title>2.1. Sub-task 1: From Text to Pictogram Sequence</title>
        <p>The Text-to-Picto sub-task focuses on the automatic generation of a corresponding sequence of
pictogram terms from a French text. This challenge can be viewed as a translation problem, where the
source language is French, and the target language is a sequence of French pictogram terms, each linked
to an ARASAAC pictogram, as illustrated in Figure 1. ARASAAC is an online resource of over 25,000
pictograms under a Creative Commons license (BY-NC-SA), and funded by the Department of Culture,
Sports, and Education of the Government of Aragon (Spain).</p>
        <p>ça fait beaucoup de monde
faire
foule
Text</p>
        <p>Pictogram terms
2https://safecaretechnologies.com/wp-content/uploads/2024/06/AAC_Health_Economic_Study.pdf
3https://huggingface.co/ToPicto
The Speech-to-Picto sub-task focuses on two modalities: speech and pictograms. Unlike traditional
spoken language translation systems that rely on transcribed text, this approach aims to directly map a
speech input to pictogram concepts. An example of this task is illustrated in Figure 2.
cefc-cfpb-1200-2-1326_2.wav
celle-là
faire
foule
Speech</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Dataset Description</title>
      <sec id="sec-3-1">
        <title>3.1. Source Corpora</title>
        <p>
          The benchmarking data are curated from three corpora containing aligned speech, text, and pictogram
sequences: Propicto-commonvoice, Propicto-orféo, and Propicto-eval [
          <xref ref-type="bibr" rid="ref4 ref5 ref6">4, 5, 6</xref>
          ].
        </p>
        <p>
          Propicto-commonvoice is built from the French portion of the CommonVoice version 15 corpus [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ].
It includes 967 hours of recordings of read speech from 17,911 unique speakers. The method described
in Macaire et al. [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] was applied to generate a corresponding pictogram translation for each audio
segment in the form of a token sequence.
        </p>
        <p>
          Propicto-orféo is a corpus of spontaneous speech derived from the Corpus d’Étude pour le Français
Contemporain (CEFC) [8], including a set of 12 source corpora featuring diverse speech situations
(dialogues, meetings, etc.) across various domains. Propicto-orféo comprises 290,036 audio segments
totaling 233 hours. The same method [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] was applied to produce pictogram translations from the
transcriptions of these audio segments.
        </p>
        <p>Propicto-eval is an evaluation corpus specifically designed to assess the performance of pictogram
translation models in a controlled scenario. This multi-speaker read speech corpus incorporates textual
data from children’s stories, everyday life situations, and sentences taken from the medical domain.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Dataset Format</title>
        <p>The data for both sub-tasks is contained in a JSON file, which includes the following information:
Tag</p>
        <p>Definition
Example
id: unique identifier of each utterance cefc-tcof-Acc_del_07-1
src: audio file linked to the ID in .wav format (speech-to-picto) / cefc-tcof-Acc_del_07-1.wav /
text from oral transcription (text-to-picto) tu peux pas savoir
tgt: target of the utterance — sequence of pictogram terms (tokens) toi pouvoir savoir non
pictos: a list of pictogram identifiers linked to each pictogram terms [6625, 35949, 16885, 5526]
(the size is the same as the target output)</p>
        <p>The participant’s goal is to provide a hypothesis (hyp) equivalent to the target (tgt). To visualize the
target as a sequence of pictogram images, we developed an online platform, Visualize-Pictograms4. If a
pictogram token does not exist, the corresponding pictogram image will not be displayed.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Dataset Statistics</title>
        <p>The dataset statistics of the Text-to-Picto and Speech-to-Picto development and test sets are described
in Table 1, along with the distribution of data sources. We randomly selected 10,000 utterances from
Propicto-orféo and Propicto-commonvoice, as well as medical utterances from the Propicto-eval medical
subset.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Evaluation Methodology</title>
      <p>train
The evaluation is conducted using sacreBLEU [9], METEOR [10], and the Picto-term Error Rate
(PictoER) [11, 12]. For all three metrics, the evaluation compares the hypothesis (hyp) provided by the
participant with the target (tgt), i.e., the sequence of pictogram terms. We detail each metric and its
computation below.</p>
      <p>SacreBLEU measures the number of common n-grams (the percentage of overlap) between the
translation n hypothesis (hyp) and the reference translation (tgt). In comparison with BLEU [13],
sacreBLEU takes into account the meaning of words and diferent tokenizations of the same term. The
score corresponds to:
4https://huggingface.co/spaces/ToPicto/Visualize-Pictograms
•  (Brevity Penalty) is a length penalty that discourages translations that are too short,
•  is the modified n-gram precision,
•  is the weight associated with n-grams of length ,
•  is the maximum size of n-grams.</p>
      <p>The modified n-gram precision is the division of the sum of correct n-grams over their total number in
the corpus, with:
 =
∑︀n-gram min(Countn-gram, Countn-gram)</p>
      <p>∑︀n-gram Countn-gram
with:
where:
• Countn-gram is the number of occurrences of a given n-gram in the translation hypothesis,
• Countn-gram is the number of occurrences of the same n-gram in the reference translation.
The length penalty  , with the total number of words  in the reference and  the number of words
in the translation hypothesis, corresponds to:</p>
      <p>BP =
{︃1,</p>
      <p>si  &gt; 
exp (︀ 1 −  )︀ , si  ≤ 
The score therefore fluctuates between 0 and 1, the highest corresponding to an equivalent translation
between hypothesis and reference.</p>
      <p>METEOR performs an alignment between the translation hypothesis and the reference translations,
going beyond simple word matching. It considers not only direct matches but also those based on
synonyms, surface form (words that share the same root), and radical form. The evaluation provides
more granularity in assessing the performance of an overall system, as it captures additional semantic
information that is not encoded within the sacreBLEU score.</p>
      <p>METEOR combines the precision and recall of unigrams, as well as a measure to evaluate the word
order in the translation compared to the reference. Specifically, the metric combines precision  and
recall  of unigrams through a harmonic mean:</p>
      <p>BLEU = BP · exp
︃( 
∑︁  · log 
=1
)︃
(1)
(2)
(3)
(4)
(5)
(6)
A penalty measure for a given alignment is added, with ℎ, the number of contiguous aligned
word segments and  the total number of aligned unigrams:
mean =
10 ×  ×</p>
      <p>+ 9
  = 0.5 ×
︂( chunks )︂ 3
unigrams
METEOR penalizes alignments where the word order difers or when the aligned words are very far
apart in the sentence. This score favors translations that approximate the word order of the reference
sentence. The final score is given by:</p>
      <p>METEOR =  × (1 −  )
The score is a measure between 0 and 1; the higher, the better.</p>
      <p>PictoER is a metric derived from WER. Instead of evaluating the number of errors at the word level,
we focus on the number of errors of tokens, each linked to an ARASAAC pictogram. The score is
defined as follows:</p>
      <p>+  + 
PictoER = (7)

with  the substitutions,  the insertions,  the deletions and  the total number of tokens. A lower
PictoER indicates better model performance.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Participant Results and Discussion</title>
      <p>The registration and submission of participants’ runs were handled by the challenge platform
AI4MediaBench5, developed by AIMultimediaLab6. In this year’s edition, 36 teams registered for
both sub-tasks. For the Text-to-Picto sub-task, 2 teams submitted their work, resulting in a total of 4
runs, whereas the Speech-to-Picto sub-task received only 2 submissions from a single team. Table 3
presents the overall results of both sub-tasks, sorted by the best sacreBLEU score.</p>
      <p>The submissions presented in Table 3 are to be divided into two teams: TEAM1, composed of majahj
and indira, and TEAM2, composed of sudharshan07. Despite interesting results, TEAM2’s paper was
rejected due to its lack of references and information on the architecture.</p>
      <p>TEAM1 presented a study addressing both the Text-to-Picto and Speech-to-Picto subtasks. The
authors built upon one of the submissions from the previous edition of the challenge, presented
in Koushik et al. [14]. TEAM1 fine-tuned a T5 encoder-decoder architecture [ 15] and extended the
experiments proposed by Koushik et al. [14] to diferent T5 sizes (base, small, large) and a larger number
of training epochs. For the Text-to-Picto subtask, the T5-large model achieved the best performance
(ranking first in Table 3). For the Speech-to-Picto subtask, a cascaded architecture was used, combining a
Whisper-based ASR model [16] with the fine-tuned T5-large model. The larger Whisper model (ranking
ifrst in Table 3) performs better than the smaller model (ranking second).</p>
      <p>Both these results highlight the impact of the size of pre-trained models on downstream tasks.</p>
      <p>The authors carefully analyzed the model outputs and identified challenges faced when translating
to pictograms, such as handling proper nouns, tense, and numeric expressions. This work opens
perspectives for next year’s challenge.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion and Perspectives</title>
      <p>The second edition of the ImageCLEFtoPicto task continued last year’s challenge by ofering an expanded
version of the dataset, featuring a variety of acoustic scenarios and domains, including medical and
everyday-life contexts. The challenge included two subtasks: Text-to-Picto and Speech-to-Picto. Despite
5https://ai4media-bench.aimultimedialab.ro/
6https://aimultimedialab.ro/
a high number of registrations, actual participation and submissions were limited. Nevertheless, the best
submission established a solid baseline and methodology for addressing the task, along with an insightful
analysis of the generated pictogram sequences. In the future, we plan to explore new directions, such
as providing an English version of the dataset and introducing a next-pictogram prediction subtask.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>This project was funded by the Agence Nationale de la Recherche (ANR) through the project
PANTAGRUEL (ANR-23-IAS1-0001). This work is also carried out as part of the AugmentIA Chair, led
by Didier Schwab and hosted by the Grenoble INP Foundation, with sponsorship from the Artelia
Group. The chair also receives support from the French government, managed by the National Research
Agency (ANR), under the France 2030 program with reference ANR-23-IACL-0006 (MIAI Cluster). The
pictographic symbols used are the property of the Government of Aragón and have been created by
Sergio Palao for ARASAAC (http://www.arasaac.org), that distributes them under Creative Commons
License BY-NC-SA.</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used GPT-4o-mini in order to: Grammar and spelling
check. After using these tool(s)/service(s), the author(s) reviewed and edited the content as needed and
take(s) full responsibility for the publication’s content.
[8] J.-M. Debaisieux, C. Benzitoun, H.-J. Deulofeu, Le projet ORFEO: Un corpus d’études pour le
français contemporain, Corpus 15 (2016) 91–114. URL: https://hal.science/hal-01449600. doi:10.
4000/corpus.2936.
[9] M. Post, A call for clarity in reporting BLEU scores, in: Proceedings of the Third Conference
on Machine Translation: Research Papers, Association for Computational Linguistics, Belgium,
Brussels, 2018, pp. 186–191. URL: https://www.aclweb.org/anthology/W18-6319.
[10] S. Banerjee, A. Lavie, METEOR: An automatic metric for MT evaluation with improved correlation
with human judgments, in: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation
Measures for Machine Translation and/or Summarization, Association for Computational
Linguistics, Ann Arbor, Michigan, 2005, pp. 65–72. URL: https://www.aclweb.org/anthology/W05-0909.
[11] J. Woodard, J. Nelson, An information theoretic measure of speech recognition performance,
in: Workshop on standardisation for speech I/O technology, Naval Air Development Center,
Warminster, PA, 1982.
[12] A. C. Morris, V. Maier, P. Green, From wer and ril to mer and wil: improved evaluation measures
for connected speech recognition, in: Interspeech 2004, 2004, pp. 2765–2768. doi:10.21437/
Interspeech.2004-668.
[13] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: a method for automatic evaluation of machine
translation, in: P. Isabelle, E. Charniak, D. Lin (Eds.), Proceedings of the 40th Annual Meeting
of the Association for Computational Linguistics, Association for Computational Linguistics,
Philadelphia, Pennsylvania, USA, 2002, pp. 311–318. URL: https://aclanthology.org/P02-1040/.
doi:10.3115/1073083.1073135.
[14] A. Koushik, J. Morrison, P. Mirunalini, et al., A transformer based approach for text-to-picto
generation (2024).
[15] C. Rafel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring
the limits of transfer learning with a unified text-to-text transformer, Journal of Machine Learning
Research 21 (2020) 1–67. URL: http://jmlr.org/papers/v21/20-074.html.
[16] A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, I. Sutskever, Robust speech recognition
via large-scale weak supervision, in: International conference on machine learning, PMLR, 2023,
pp. 28492–28518.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Romski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. A.</given-names>
            <surname>Sevcik</surname>
          </string-name>
          ,
          <article-title>Augmentative communication and early intervention: Myths and realities</article-title>
          ,
          <source>Infants &amp; Young Children</source>
          <volume>18</volume>
          (
          <year>2005</year>
          )
          <fpage>174</fpage>
          -
          <lpage>185</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>L.</given-names>
            <surname>Cummings</surname>
          </string-name>
          ,
          <article-title>Communication disorders: A complex population in healthcare</article-title>
          ,
          <source>Language and Health</source>
          <volume>1</volume>
          (
          <year>2023</year>
          )
          <fpage>12</fpage>
          -
          <lpage>19</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>D. R.</given-names>
            <surname>Beukelman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mirenda</surname>
          </string-name>
          , et al.,
          <source>Augmentative and alternative communication</source>
          , Paul H. Brookes Baltimore,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>C.</given-names>
            <surname>Macaire</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Dion</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Arrigo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Lemaire</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Esperança-Rodier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Lecouteux</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Schwab</surname>
          </string-name>
          ,
          <article-title>A multimodal French corpus of aligned speech, text, and pictogram sequences for speech-to-pictogram machine translation</article-title>
          , in: N.
          <string-name>
            <surname>Calzolari</surname>
            , M.-
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Kan</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Hoste</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Lenci</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Sakti</surname>
          </string-name>
          , N. Xue (Eds.),
          <source>Proceedings of the 2024 Joint International Conference on Computational Linguistics</source>
          ,
          <article-title>Language Resources and Evaluation (LREC-COLING 2024), ELRA</article-title>
          and
          <string-name>
            <given-names>ICCL</given-names>
            ,
            <surname>Torino</surname>
          </string-name>
          , Italia,
          <year>2024</year>
          , pp.
          <fpage>839</fpage>
          -
          <lpage>849</lpage>
          . URL: https://aclanthology.org/
          <year>2024</year>
          .lrec-main.
          <volume>76</volume>
          /.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>C.</given-names>
            <surname>Macaire</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Dion</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Schwab</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Lecouteux</surname>
          </string-name>
          ,
          <string-name>
            <surname>E.</surname>
          </string-name>
          <article-title>Esperança-Rodier, Approches cascade et de bout-en-bout pour la traduction automatique de la parole en pictogrammes</article-title>
          , in: M.
          <string-name>
            <surname>Balaguer</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Bendahman</surname>
          </string-name>
          , L.
          <string-name>
            <surname>-M.</surname>
            Ho-dac, J. Mauclair,
            <given-names>J. G</given-names>
          </string-name>
          <string-name>
            <surname>Moreno</surname>
          </string-name>
          , J. Pinquier (Eds.),
          <source>Actes de la 31ème Conférence sur le Traitement Automatique des Langues Naturelles</source>
          , volume
          <volume>1</volume>
          : articles longs et prises de position,
          <source>ATALA and AFPC</source>
          , Toulouse, France,
          <year>2024</year>
          , pp.
          <fpage>22</fpage>
          -
          <lpage>35</lpage>
          . URL: https://aclanthology. org/
          <year>2024</year>
          .jeptalnrecital-taln.2/.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>C.</given-names>
            <surname>Macaire</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Dion</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Schwab</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Lecouteux</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Esperança-Rodier</surname>
          </string-name>
          ,
          <article-title>Towards speech-to-pictograms translation</article-title>
          ,
          <source>in: Interspeech</source>
          <year>2024</year>
          ,
          <year>2024</year>
          , pp.
          <fpage>857</fpage>
          -
          <lpage>861</lpage>
          . doi:
          <volume>10</volume>
          .21437/Interspeech.2024-
          <volume>490</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>R.</given-names>
            <surname>Ardila</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Branson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Davis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kohler</surname>
          </string-name>
          , J. Meyer, M. Henretty,
          <string-name>
            <given-names>R.</given-names>
            <surname>Morais</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Saunders</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Tyers</surname>
          </string-name>
          , G. Weber,
          <article-title>Common voice: A massively-multilingual speech corpus</article-title>
          , in: N.
          <string-name>
            <surname>Calzolari</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Béchet</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Blache</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Choukri</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Cieri</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Declerck</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Goggi</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Isahara</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Maegaard</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Mariani</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Mazo</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Moreno</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Odijk</surname>
          </string-name>
          , S. Piperidis (Eds.),
          <source>Proceedings of the Twelfth Language Resources and Evaluation Conference</source>
          , European Language Resources Association, Marseille, France,
          <year>2020</year>
          , pp.
          <fpage>4218</fpage>
          -
          <lpage>4222</lpage>
          . URL: https://aclanthology.org/
          <year>2020</year>
          .lrec-
          <volume>1</volume>
          .520/.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>