1. Introduction

Veras Audire et Reddere Voces: A Corpus of Prosodically-Correct Latin Poetic Audio from Large-Language-Model T TS

Michele Ciletti

0 0 University of Foggia, Department of Humanities , 71121 Foggia , Italy

2025

Latin verse moves to the pulse of vowel quantity and stress, yet students and researchers still struggle to hear that rhythm because high-quality recordings are rare and expensive. General-purpose text-to-speech models, rich in English and Romance data, flatten the long-short alternation that defines classical metres. This paper introduces a fully open, expertly validated corpus that demonstrates how far prompt engineering alone can push contemporary large language models toward metrically faithful Latin speech. Drawing on Pedecerto's XML scansions, two emblematic passages were selected: the first 100 verses of Vergil's Aeneid (pure hexameter) and the opening elegiac epistula of Ovid's Heroides (elegiac couplets). Each line was syllabified, marked for ictus, elided where compulsory, and orthographically nudged into forms modern TTS engines pronounce reliably. The pre-processed verse, printed verbatim inside a concise system prompt, was rendered ten times by gpt-4o-mini-tts; human experts in Latin phonology then audited every take for segmental accuracy, stress placement, elision, and pacing. The accepted ifles were loudness-normalised and concatenated with uniform verse pauses, yielding roughly 24 minutes of continuous yet metrically autonomous recitation. The release bundles (1) the original Pedecerto XML, (2) classical and TTS-ready transcriptions, and (3) per-line and stitched mp3 audio, all under CC-BY 4.0 and archived on Zenodo. Beyond serving as classroom audio or accessibility material, the aligned data provide a test-bed for prosody-aware speech synthesis, few-shot fine-tuning, and quantitative metrics research. An analysis of error patterns, such as cross-lingual accent drift, cluster mispronunciation, and length-stress trade-ofs, ofers concrete heuristics for steering future models without costly retraining. The workflow, implemented entirely with easily accessible APIs and lightweight scripts, could be readily transferable to Ancient Greek, Classical Arabic, or any verse tradition equipped with digital scansion. In short: the dead can be made to speak - rhythmically, reproducibly, and in the open.

eol>Latin prosody dataset poetry text-to-speech

1. Introduction

training data and therefore transfer English or Romance stress patterns almost wholesale.

Latin verse obeys laws of rhythm that difer markedly The previous generation of text-to-speech (TTS) sysfrom those governing most modern poetry. Whereas tems, exemplified by architectures like Tacotron 2 [2], English metres depend on patterns of stress, classical typically followed a two-stage process, converting text Latin organises lines around the alternation of long and to an intermediate representation (a mel spectrogram) short syllables [1]. The term prosody itself stems from before a separate vocoder synthesized the audio. While Greek prosoidia, which first referred to a tune sung to capable of high-fidelity output, these models required music, then to the pronunciation of individual syllables. extensive, language-specific training data, leaving low

Convincing spoken performances of Latin poetry that resource languages like Latin underserved. However, the students can consult remain remarkably scarce. Gram- emergence of large language models (LLMs) with direct mars and handbooks describe reconstructed pronuncia- audio-generation capabilities ofers a new paradigm. Untions with care, still few recordings reproduce the quan- like earlier architectures that required costly fine-tuning, titative rhythm that defines classical metres. Recent ad- these models can be steered through carefully crafted vances in neural text-to-speech have brought modern prompts. Models such as gpt-4o-mini-tts [3] employ languages to broadcast quality, but Latin has been left end-to-end architectures that learn from vast, multilinon the margins: general-purpose models have little or no gual datasets and can be conditioned directly through in-context learning. This prompt engineering approach CLiC-it 2025: Eleventh Italian Conference on Computational Linguis- allows for fine-grained control over pronunciation, pactics, September 24 — 26, 2025, Cagliari, Italy ing, and emphasis without retraining the model, opening * Corresponding author. a promising avenue for generating prosodically-correct $ michele.ciletti@gmail.com (M. Ciletti) speech in low-resource languages. To demonstrate the h00tt0p9s-:0//0w04w-w38.r2e9s-e8a8r6c6hg(Mat.e.Cnielte/tptir)ofile/Michele-Ciletti (M. Ciletti) viability of this approach, this paper introduces Veras © 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License Audire et Reddere Voces, a fully open and expertly valAttribution 4.0 International (CC BY 4.0). idated corpus of prosodically-accurate Latin poetic au- 2.3. Prompt-based Prosody in Large dio. The corpus contains 216 lines of verse - the first Language Models 100 hexameters of Vergil’s Aeneid and the opening 116 lines (58 elegiac couplets) of Ovid’s Heroides - providing Large language models that decode speech directly from a total of nearly 24 minutes of metrically autonomous text have begun to internalise prosodic patterns. Archirecitation. The primary contributions of this work are tectures such as VALL-E and ZM-Text-TTS train on vast threefold: the release of a high-quality, aligned dataset multilingual collections; their outputs preserve speaker of Latin poetic audio, TTS-ready transcriptions, and met- identity and sentence melody, yet metre remains hard to rical annotations under an open license; the presentation control [13]. One pragmatic strategy is to preprocess the of a reproducible workflow that shows how prompt en- poem itself: mark ictic syllables with capital letters and gineering alone can steer a general-purpose LLM toward diacritics, resolve compulsory elisions, and substitute unmetrically faithful speech, ofering a low-cost alternative familiar graphemes with spellings that the model already to model retraining; and the production of a valuable pronounces reliably (chiefly English, with Italian convenresource for pedagogy and a test-bed for future research tions for a selection of elements). During synthesis these in controllable speech synthesis for historical languages. visible cues bias the duration and stress predictors without any retraining of the model. The approach follows PRESENT’s [13] principle of steering prosody through 2. Theoretical Background input representation rather than through explicit feature vectors.

2.1. Latin Prosody

Classical verse draws its rhythm from vowel quantity and from the consonantal context that can lengthen a syllable [4]. Six-foot dactylic hexameter and the coupled hexameter–pentameter of the elegiac distich are the metres most familiar to students. Rules such as muta cum liquida allow optional resolution; pervasive elision removes a vowel at word boundaries; and the location of the caesura shapes phrasing. Because no recordings survive from antiquity, quantity must be inferred from orthography, comparative Romance evidence, metrical practice, and statements by ancient grammarians. Absolute certainty is impossible, which explains why modern classrooms often substitute a stress-based reading, even though Latin stress itself follows a moraic algorithm. Any speech-generation system has therefore to decide which principle it will privilege: quantity, stress, or a compromise.

2.4. Pedagogical and Inclusive Perspectives

High-quality recordings produced by trained classicists are time-consuming and costly. Automatic generation, once trustworthy, would make spoken Latin more accessible in schools, in digital humanities research, and for visually impaired learners. Surveys in the field call for FAIR corpora that combine text, audio, and metadata [14]. By publishing aligned verse–audio pairs, the present work answers that demand in part. Stress-centred recitation also lowers the entry barrier for students whose native languages do not contrast vowel length, while still preserving a perceptible rhythmic pulse consistent with traditional metrics.

3. Methodology 2.2. Digital Latin Resources Over roughly thirty years Latin has acquired a consider

able amount of Natural Language Processing resources [5] [6]. Tokenisers, lemmatisers, and treebanks are dis- The audio that accompanies the dataset was derived from tributed through CLTK [7], Stanza [8], and Universal two chosen passages: the first hundred hexameters of Dependencies [9]. Prosodic annotation is rarer. Pede- Vergil’s Aeneid and the opening elegiac epistle of Ovid’s Certo [ 10 ] marks quantity, feet, and caesurae for more Heroides. These segments supply, on the one hand, a pure than 240,000 dactylic lines; its XML export underlies the run of dactylic hexameter and, on the other, the alternacorpus described here. Other scanners cover particular tion of hexameter and pentameter typical of the elegiac metres: for instance CLTK modules for hexameter and couplet. Machine–readable scansion was taken from the hendecasyllable, Anceps for Senecan trimeter [ 11 ], or XML export of the Pedecerto project [ 10 ]. Each <line> Loquax for syllabification and IPA transliteration [12]. element preserves the metrical category, the canonical foot pattern and, for every word, a sy attribute that enumerates syllables while marking the ictus with an upper-case character. The import script retained verse boundaries, foot sequence, ictic flags, elision hints and

3.1. Source Texts and Metrical Gold Standard

<line name="1" meter="H" pattern="DDSS"> <word sy="1A1b" wb="CF">Arma</word> ... <word sy="2c3A" wb="CM">cano,</word> <word sy="3T4A" wb="CM">Troiae</word> ...

</line> word–boundary information; all other metadata were Prompt engineering began with an extensive style discarded. A fragment of the XML illustrates the struc- sheet, eventually distilled to three imperatives: speak ture: slowly, articulate every syllable, and obey the marked stresses. Re-printing the fully processed verse inside the system prompt, exactly as it should be spoken, noticeably improved alignment between text and realisation.

Because stochastic sampling introduces variation, ten readings were requested for every line. The final system prompt was:

This is a Latin poetical verse. Pronounce it

rhythmically, slowly and with emphasis, articulating each syllable and correctly stressing them. Pronounce it like this: [pre-processed verse]

3.2. Pre-processing and Orthographic Adaptation

Each verse underwent an iterative pipeline before it was ever passed to the speech engine. Syllabification relied on the rule-based module distributed with the Classical Lan- The addition of the word "slowly", explicitly telling the guage Toolkit [7]; diphthongs and enclitics are already model to recite the verses at a relaxed pace, proved to covered in that implementation. The vowel of every ictic be particularly useful in ensuring that each syllable was syllable received a grave accent and the complete sylla- correctly articulated. ble was converted to capitals. Obligatory elisions were realised as graphic mergers (quoque et therefore became 3.4. Human Validation and Error quoquet) according to the Pedecerto wb flag. A comma Annotation was inserted where the metre demands a caesura unless the manuscript already ofered punctuation at that posi- Specialists in Latin phonology audited every recording. tion. Early trials separated syllables with hyphens, but Errors were marked on spans and classified as segmenthe additional markers produced no audible advantage tal substitution or ictus misplacement. Feedback after and the idea was dropped. each experimental round guided small adjustments to

A second pass substituted graphemes that tend to mis- the pre-processing routine and to the wording of the lead English-trained acoustic models. Before front vowels prompt. Acceptance was granted when a line contained 〈c〉 was rewritten as 〈k〉, 〈qu〉 became 〈kw〉, the diph- no error of stress or elision and no more than minor thongs 〈ae〉 and 〈oe〉 were rendered 〈ai〉 and 〈oi〉, and segmental deviations; under this criterion at least one palatal 〈g〉 was expanded to 〈gh〉. The resulting string satisfactory rendition was eventually found for each of approximates a classical pronunciation yet stays within the autonomous lines. the alphabetic habits of contemporary TTS systems.

To keep prosodic control local, each line was synthe- 3.5. Mastering and Packaging sised in isolation; the rhythm inside a verse must be coherent, whereas a small pause between verses is both acceptable and expected in performance.

For every verse the reviewers selected the highest-scoring

ifle. Selected waveforms were loudness-normalised and concatenated with an 800 ms silence, yielding two continuous recitations that preserve per-line rhythmic autonomy. Alongside the audio the repository contains:

3.3. Speech Generation and Iterative Refinement

Two technological families were explored. Conventional sequence-to-sequence TTS engines, such as Tacotron 2 [2], Kokoro [15], OpenAI’s tts-1-hd [ 16 ], ofer little room for instruction: stress was frequently misplaced and vowel length erratic, particularly when the Latin token resembled a common English form. Multimodal large language models with an integrated audio decoder fared better because the system prompt can be used to impose a prosodic policy. Several models in the GPT-4o and Gemini lines were evaluated; gpt-4o-mini-tts [3] delivered the most consistent timing and segmental clarity. • the original Pedecerto XML fragments, • the full text of the chosen passages, • the pre-processed lines which have been given as input to the TTS model.

All artefacts are released under an open licence and

have been deposited on Zenodo together with a DOI, ensuring long-term accessibility and citability [17].

For a more thorough discussion of the methodological choices, from model selection to human evaluation, the reader is referred to a previous publication [18].

4. Results: Description of the Released Corpus The outcome of the workflow is an aligned collection

of Latin poetic audio accompanied by the textual and metrical information required for downstream work in speech technology, pedagogy and quantitative metrics. teaching. The repository is organised around three sections: for each poem, a text file contains its original lines, another one has the pre-processed text that was fed to the TTS model, an XML file contains the Pedecerto metrical annotations and a set of mp3 files represent the audio output, stored both individually and as groups.

Table 1 gives an overview of the material. invested in its creation has generated several observations that matter beyond the immediate goal of reciting Vergil and Ovid. Three strands of evidence stand out: the behaviour of the language model during synthesis, the practical tricks that secured acceptable output, and the prospective uses of the aligned data in research and

5.1. Accents, Cross-lingual Interference, and what the Model really “Knows” When the decoder was left to its own devices it tended

to interpret individual words through the accent template of whichever modern language ofered the closest orthographic match. As a consequence, passages domi4.1. Audio Layer nated by vocabulary shared with present-day Romance received an intonation reminiscent of Italian, while lines For each verse ten independent readings were decoded. rich in loanwords familiar to English appeared with a After expert screening one rendition was retained as the markedly anglophone timbre. Spanish patterns surfaced canonical file. Recordings are stored as mp3 files. Silences less often, but, for example, whenever words ended in at verse boundaries have been standardised to 800 ms; no -rant the cadence was audibly Iberian. These drifts rarely fades or noise reductions were applied, so that the signal broke the quantitative rhythm prescribed by hexameter keeps its original spectral profile. or pentameter; they did, however, blur vowel quality, especially in the mid-front and mid-back zones. The phe4.2. Text and Prosodic Annotation nomenon confirms that the model encodes a multilingual phonology for conversational prose, and stresses again how little purely Latin data the underlying training set must contain.

The reference transcription follows the orthography em

ployed during synthesis (grave accents on ictic vowels, upper-case ictic syllables, adapted spellings for c, qu, g, ae, oe) so that users can reproduce or extend the experiments without reverse engineering. A parallel file restores classical spelling for readers who prefer a diplomatic text. Pedecerto syllable scansions, foot divisions and caesura marks are displayed in a separate XML file.

4.3. Availability and Licensing All components are released under CC-BY 4.0. The Zenodo record bundles the audio, textual content, and annotations [17]. 5. Discussion The present corpus was assembled in order to facilitate prosodically faithful speech synthesis, yet the labour 5.2. Problematic Phonotactics, Orthographic Workarounds and

Caesurae Several consonant clusters led to systematic errors. Final -nx in coniunx or initial tl- in Tlepolemus were clipped or resolved into epenthetic vowels, presumably because the sequences are rare in the speech material seen during pretraining. In other cases the word was recast according to a high-frequency modern homograph: -um often came out as @m, betraying an English proper-name template.

Two heuristics mitigated these slips. First, lengthening the grapheme that carries the metrical ictus often persuaded the model to anchor stress correctly; cano became caano in the prompt, which silenced the temptation to favour the English reading. Second, replacing rare di- 5.4. Classroom Impact and graphs with phonetically transparent ones, already docu- Reproducibility mented previously, reduced segmental substitutions by a third. The interventions are admittedly ad hoc, yet In a teaching environment the recordings serve two comthey illustrate how a handful of hand-crafted rules can plementary roles. Students can listen to a metrically serve where large retraining runs are impossible. While regular rendition before attempting their own, and inimprovements can certainly be made in regards to the structors can use the annotated text as input to alternative overall flow of the generated verses, it remains one of the voices or slower tempos. Since every script that produced most efective features: the addition of commas to mark the audio leverages easily-accessible APIs, replication is caesurae, in particular, proved to be useful in ensuring straightforward. Such transparency matters especially that the synthetic voices followed a precise pattern. Fig- for assessment settings where students must know exure 1 shows the Mel spectrograms of the first twenty lines actly which variant counts as the reference. of the opening epistula of the Heroides, generated to visualize the verses’ rhythm. The spectrograms clearly show 5.5. Open Data and Transfer to Other that black bars (representing pauses) are always present Languages in the middle of the bottom tiles (pentameters), while they are more spaced out in the top tiles (hexameters). Latin is only one among many historical or minoritised This is due to the obligatory caesura that occurs after languages whose sound patterns are absent from mainthe third arsis of each pentameter, precisely in the mid- stream speech technology. The workflow described here, dle of the verse, while hexameters present more varied licensed permissively and documented line by line, could structures. be cloned for Ancient Greek, Old Occitan, Classical Arabic, or any verse tradition that already enjoys digital scan5.3. From Corpus to Model Improvement sion. Riemenschneider and Frank [5] argue that large language models can be useful tools for Classical Philology; releasing small but expertly annotated sets therefore aims to accelerate progress. Open repositories also lower the entry cost for community contributors who may wish to supply alternative voices, extended passages, or corrected quantities.

Because each verse is aligned with a verified audio, the collection can function as a fine-tuning set for both autoregressive and non-autoregressive TTS systems. A model trained on metrically correct examples should internalize the prosodic rules more reliably than a generalpurpose system forced to extrapolate from modern language data. Even large language models themselves could benefit from exposure to annotated Latin verse during continued pretraining, potentially reducing the need for elaborate preprocessing in subsequent applications. Future work could examine whether freezing the acoustic front-end while training only the variance adapters sufifces to introduce length contrast in addition to stress.

5.6. Towards Length-sensitive Synthesis Stress was easier to enforce than absolute vowel length.

The present system approximates quantity indirectly through slower pacing on ictic syllables, yet it cannot keep a fixed ratio between heavy and light vowels. Lam et al. [13] report that explicit duration tokens unlock such control in English; integrating a similar mechanism with the current prompt-based strategy is an obvious next step. Ultimately, a synthesis pipeline that diferentiates both stress and quantity would let classicists test competing reconstructions of Latin phonology "in silico", converting theoretical statements into audible hypotheses. the provider change access policies, identical reproduction could become impossible. Finally, quantity was approximated through slower pacing on metrically strong syllables. The approach yields a rhythm that experienced listeners recognise, yet it falls short of enforcing a fixed heavy-to-light duration ratio, the gold standard in phonetic work on quantitative metres [4].

7. Conclusion 5.7. Outlook Refining the preprocessing scripts, automating error spot

ting, and expanding the text base are immediate priorities. Nevertheless, even in its present form the corpus already supports experiments in few-shot prosody transfer, quantitative metrics, and accessible pedagogy. The value of such resources lies also in the demonstration that highquality data can be gathered with modest equipment, provided that domain knowledge and iterative verification guide the process. Open, reproducible corpora therefore remain the necessary foundation on which future work for classical languages will build.

6. Limitations

The study has introduced an openly licensed, line-aligned corpus that brings classical Latin verse within reach of modern text-to-speech technology. By combining Pedecerto’s machine-readable scansion with a small set of orthographic substitutions and a concise prosodic prompt, the workflow coerced a general-purpose large language model into producing intelligible, metrically coherent recitations. Systematic human screening guaranteed that the released audio reflects the intended rhythm at a level suitable for both pedagogy and computational research.

The resulting dataset ofers three immediate avenues of use. Teachers can deploy the files as accessible classroom material, learners may rehearse passages while receiving instant acoustic feedback, and speech engineers now possess a clean test bed for experiments in prosody conditioning. Beyond these practical gains, the project demonstrates that domain knowledge, when encoded explicitly in the input, still matters even in an era of ever larger pretrained models. Prompt design, although sometimes dismissed as a stop-gap measure, revealed itself here as a cost-efective alternative to full retraining.

Future work will have to broaden the metrical and generic range, increase speaker diversity and explore direct duration control. A longer term ambition is to fold the current resource into a multilingual library of verse corpora, so that comparative metrics across the Indo-European tradition become feasible. The dataset and annotations supplied with this release aim to render such extensions straightforward.

The corpus was assembled with the intention of demonstrating what present-day language models can already achieve when prompt engineering is combined with careful human verification. Precisely because the focus lay on a proof of concept, several boundaries were accepted that restrict the scope of the resource. The most visible limitation concerns size. Only two passages, albeit canonical ones, entered the pipeline; together they furnish a little under twenty-four minutes of speech. For certain experiments in prosody transfer that duration sufices, yet quantitative studies of acoustic variance or full finetuning of an end-to-end synthesiser usually require at least an order of magnitude more material.

Closely related is the question of stylistic breadth. The Aeneid and the Heroides difer in metre, tone and lexicon, but both belong to the same literary period and represent the same formal register. Comedy, forensic oratory or Late Latin hymns remain untested. Consequently, the substitution rules that helped the model through epic and elegiac vocabulary might fail when confronted with colloquial forms, post-Classical spellings or heavy Greek loanwords.

Another constraint derives from the decision to rely on a single synthetic voice. Because speaker identity never changes, the corpus cannot inform studies that investigate how metre interacts with timbre or gendered Acknowledgments pitch ranges. Similarly, only one variant of reconstructed pronunciation is encoded. Alternative schools that pre- The author thanks the entire Pedecerto team for annotatfer the ecclesiastical pronunciation will find no exam- ing and sharing their XML scansions. Sincere gratitude ples that match their conventions. Validation, indispens- is extended to the CLTK community, whose open-source able for quality control, introduces its own bias. Judge- tools simplified syllabification and phonological checks. ments about short hesitations or barely perceptible vowel Colleagues at the University of Foggia donated hours to colouring can difer across traditions; a panel of experts the auditory review of candidate recordings; their experdrawn from a wider set of institutions might have re- tise shaped both the preprocessing rules and the acceptained or rejected a slightly diferent subset of takes. tance thresholds. Any remaining inaccuracies are the

Technical choices add further caveats. Recordings sole responsibility of the author. were mastered to mp3 for ease of distribution, which entails lossy compression. The prompts are public, but the underlying model weights remain proprietary; should

Declaration on Generative AI During the preparation of this work, the author(s) did not use any generative AI tools or services.

doi:10 .57967/hf/4329.

[16] OpenAI , Openai tts-1 -hd model documentation , [1]

B. W.

Fortson IV , Latin prosody and metrics, A 2025 . URL: https://platform.openai.com/docs/

companion to the Latin language (

2011 ) 92 - 104 . models/tts-1-hd, accessed: 2025 -06-29. [2]

Shen ,

Pang ,

R. J.

Weiss ,

Schuster ,

Jaitly , [17]

Ciletti , Veras audire et reddere voces: A corpus

Ryan , R. A.

Saurous , Y.

Agiomyrgiannakis , Y.

Wu , language-model tts, 2025 . URL: https://doi.org/

Natural tts synthesis by conditioning wavenet on 10 .5281/zenodo.15677356. doi: 10 .5281/zenodo.

mel spectrogram predictions , 2018 . URL: https:// 15677356.

arxiv.org/abs/1712.05884. arXiv: 1712 . 05884 . [18]

Ciletti , Prompting the muse: Generating [3]

Hurst ,

Lerer ,

A. P.

Goucher , A.

Perelman, prosodically-correct Latin speech with large lan-

Hayes ,

Radford , et al., Gpt-4o system card , (Eds.), Proceedings of the 63rd Annual Meet-

arXiv preprint arXiv:2410.21276 ( 2024 ). ing of the Association for Computational Lin [4]

W. S.

Allen , Vox Latina: a guide to the pronunciation guistics (Volume 4 : Student Research Workshop),

of classical Latin , Cambridge University Press, 1989 . Association for Computational Linguistics , Vi[5]

Riemenschneider ,

Frank , Exploring large lan- enna, Austria, 2025 , pp. 740 - 745 . URL: https://

guage models for classical philology, arXiv preprint aclanthology .org/ 2025 .acl-srw. 48 /. doi: 10 .18653/

arXiv:2305.13698 ( 2023 ). v1/ 2025 .acl-srw. 48 . [6]

McGillivray , Methods in Latin computational

linguistics , volume 1 , Brill , 2013 . [7]

K. P.

Johnson , P. J. Burns , J. Stewart , T. Cook,

guages, in: Proceedings of the 59th annual meeting

the 11th international joint conference on natural

language processing: System demonstrations , 2021 ,

pp. 20 - 29 . [8]

Qi ,

Zhang ,

Bolton , C. D. Man-

arXiv: 2003 . 07082 ( 2020 ). [9] M.-C. De Marnefe , C. D.

Manning , J.

Nivre , D. Ze-

guistics 47 ( 2021 ) 255 - 308 . [10]

Colombi ,

Mondin ,

Tessarolo , A . Bacianini,

rica Latina Digitale ( 2011 ). [11]

Fedchin ,

P. J.

Burns ,

Chaudhuri ,

J. P.

Dexter ,

Journal of Philology 143 ( 2022 ) 475 - 503 . [12]

Court , Loquax: Nlp framework for phonology,

https://github.com/mattlianje/loquax, 2025 . GitHub

repository. [13]

Lam ,

Zhang ,

N. F.

Chen ,

Sisman , D. Herre-

IEEE Signal Processing Letters ( 2025 ). [14] M. De Sisto , L. Hernández-Lorenzo , J. De la Rosa,

Digital Scholarship in the Humanities 39 ( 2024 ) 500 -

521. [15] Hexgrad , Kokoro-82m (revision d8b4fc7) , 2025 .

URL: https://huggingface.co/hexgrad/Kokoro-82M.