<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Veras Audire et Reddere Voces: A Corpus of Prosodically-Correct Latin Poetic Audio from Large-Language-Model T TS</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Michele Ciletti</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Foggia, Department of Humanities</institution>
          ,
          <addr-line>71121 Foggia</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>Latin verse moves to the pulse of vowel quantity and stress, yet students and researchers still struggle to hear that rhythm because high-quality recordings are rare and expensive. General-purpose text-to-speech models, rich in English and Romance data, flatten the long-short alternation that defines classical metres. This paper introduces a fully open, expertly validated corpus that demonstrates how far prompt engineering alone can push contemporary large language models toward metrically faithful Latin speech. Drawing on Pedecerto's XML scansions, two emblematic passages were selected: the first 100 verses of Vergil's Aeneid (pure hexameter) and the opening elegiac epistula of Ovid's Heroides (elegiac couplets). Each line was syllabified, marked for ictus, elided where compulsory, and orthographically nudged into forms modern TTS engines pronounce reliably. The pre-processed verse, printed verbatim inside a concise system prompt, was rendered ten times by gpt-4o-mini-tts; human experts in Latin phonology then audited every take for segmental accuracy, stress placement, elision, and pacing. The accepted ifles were loudness-normalised and concatenated with uniform verse pauses, yielding roughly 24 minutes of continuous yet metrically autonomous recitation. The release bundles (1) the original Pedecerto XML, (2) classical and TTS-ready transcriptions, and (3) per-line and stitched mp3 audio, all under CC-BY 4.0 and archived on Zenodo. Beyond serving as classroom audio or accessibility material, the aligned data provide a test-bed for prosody-aware speech synthesis, few-shot fine-tuning, and quantitative metrics research. An analysis of error patterns, such as cross-lingual accent drift, cluster mispronunciation, and length-stress trade-ofs, ofers concrete heuristics for steering future models without costly retraining. The workflow, implemented entirely with easily accessible APIs and lightweight scripts, could be readily transferable to Ancient Greek, Classical Arabic, or any verse tradition equipped with digital scansion. In short: the dead can be made to speak - rhythmically, reproducibly, and in the open.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Latin</kwd>
        <kwd>prosody</kwd>
        <kwd>dataset</kwd>
        <kwd>poetry</kwd>
        <kwd>text-to-speech</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>training data and therefore transfer English or Romance
stress patterns almost wholesale.</p>
      <p>Latin verse obeys laws of rhythm that difer markedly The previous generation of text-to-speech (TTS)
sysfrom those governing most modern poetry. Whereas tems, exemplified by architectures like Tacotron 2 [2],
English metres depend on patterns of stress, classical typically followed a two-stage process, converting text
Latin organises lines around the alternation of long and to an intermediate representation (a mel spectrogram)
short syllables [1]. The term prosody itself stems from before a separate vocoder synthesized the audio. While
Greek prosoidia, which first referred to a tune sung to capable of high-fidelity output, these models required
music, then to the pronunciation of individual syllables. extensive, language-specific training data, leaving
low</p>
      <p>Convincing spoken performances of Latin poetry that resource languages like Latin underserved. However, the
students can consult remain remarkably scarce. Gram- emergence of large language models (LLMs) with direct
mars and handbooks describe reconstructed pronuncia- audio-generation capabilities ofers a new paradigm.
Untions with care, still few recordings reproduce the quan- like earlier architectures that required costly fine-tuning,
titative rhythm that defines classical metres. Recent ad- these models can be steered through carefully crafted
vances in neural text-to-speech have brought modern prompts. Models such as gpt-4o-mini-tts [3] employ
languages to broadcast quality, but Latin has been left end-to-end architectures that learn from vast,
multilinon the margins: general-purpose models have little or no gual datasets and can be conditioned directly through
in-context learning. This prompt engineering approach
CLiC-it 2025: Eleventh Italian Conference on Computational Linguis- allows for fine-grained control over pronunciation,
pactics, September 24 — 26, 2025, Cagliari, Italy ing, and emphasis without retraining the model, opening
* Corresponding author. a promising avenue for generating prosodically-correct
$ michele.ciletti@gmail.com (M. Ciletti) speech in low-resource languages. To demonstrate the
 h00tt0p9s-:0//0w04w-w38.r2e9s-e8a8r6c6hg(Mat.e.Cnielte/tptir)ofile/Michele-Ciletti (M. Ciletti) viability of this approach, this paper introduces Veras
© 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License Audire et Reddere Voces, a fully open and expertly
valAttribution 4.0 International (CC BY 4.0).
idated corpus of prosodically-accurate Latin poetic au- 2.3. Prompt-based Prosody in Large
dio. The corpus contains 216 lines of verse - the first Language Models
100 hexameters of Vergil’s Aeneid and the opening 116
lines (58 elegiac couplets) of Ovid’s Heroides - providing Large language models that decode speech directly from
a total of nearly 24 minutes of metrically autonomous text have begun to internalise prosodic patterns.
Archirecitation. The primary contributions of this work are tectures such as VALL-E and ZM-Text-TTS train on vast
threefold: the release of a high-quality, aligned dataset multilingual collections; their outputs preserve speaker
of Latin poetic audio, TTS-ready transcriptions, and met- identity and sentence melody, yet metre remains hard to
rical annotations under an open license; the presentation control [13]. One pragmatic strategy is to preprocess the
of a reproducible workflow that shows how prompt en- poem itself: mark ictic syllables with capital letters and
gineering alone can steer a general-purpose LLM toward diacritics, resolve compulsory elisions, and substitute
unmetrically faithful speech, ofering a low-cost alternative familiar graphemes with spellings that the model already
to model retraining; and the production of a valuable pronounces reliably (chiefly English, with Italian
convenresource for pedagogy and a test-bed for future research tions for a selection of elements). During synthesis these
in controllable speech synthesis for historical languages. visible cues bias the duration and stress predictors
without any retraining of the model. The approach follows
PRESENT’s [13] principle of steering prosody through
2. Theoretical Background input representation rather than through explicit feature
vectors.</p>
      <sec id="sec-1-1">
        <title>2.1. Latin Prosody</title>
        <p>Classical verse draws its rhythm from vowel quantity and
from the consonantal context that can lengthen a syllable
[4]. Six-foot dactylic hexameter and the coupled
hexameter–pentameter of the elegiac distich are the metres most
familiar to students. Rules such as muta cum liquida allow
optional resolution; pervasive elision removes a vowel at
word boundaries; and the location of the caesura shapes
phrasing. Because no recordings survive from antiquity,
quantity must be inferred from orthography, comparative
Romance evidence, metrical practice, and statements by
ancient grammarians. Absolute certainty is impossible,
which explains why modern classrooms often substitute
a stress-based reading, even though Latin stress itself
follows a moraic algorithm. Any speech-generation system
has therefore to decide which principle it will privilege:
quantity, stress, or a compromise.</p>
      </sec>
      <sec id="sec-1-2">
        <title>2.4. Pedagogical and Inclusive</title>
      </sec>
      <sec id="sec-1-3">
        <title>Perspectives</title>
        <p>High-quality recordings produced by trained classicists
are time-consuming and costly. Automatic generation,
once trustworthy, would make spoken Latin more
accessible in schools, in digital humanities research, and
for visually impaired learners. Surveys in the field call
for FAIR corpora that combine text, audio, and metadata
[14]. By publishing aligned verse–audio pairs, the present
work answers that demand in part. Stress-centred
recitation also lowers the entry barrier for students whose
native languages do not contrast vowel length, while still
preserving a perceptible rhythmic pulse consistent with
traditional metrics.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>3. Methodology</title>
      <sec id="sec-2-1">
        <title>2.2. Digital Latin Resources</title>
        <sec id="sec-2-1-1">
          <title>Over roughly thirty years Latin has acquired a consider</title>
          <p>
            able amount of Natural Language Processing resources
[5] [6]. Tokenisers, lemmatisers, and treebanks are dis- The audio that accompanies the dataset was derived from
tributed through CLTK [7], Stanza [8], and Universal two chosen passages: the first hundred hexameters of
Dependencies [9]. Prosodic annotation is rarer. Pede- Vergil’s Aeneid and the opening elegiac epistle of Ovid’s
Certo [
            <xref ref-type="bibr" rid="ref1">10</xref>
            ] marks quantity, feet, and caesurae for more Heroides. These segments supply, on the one hand, a pure
than 240,000 dactylic lines; its XML export underlies the run of dactylic hexameter and, on the other, the
alternacorpus described here. Other scanners cover particular tion of hexameter and pentameter typical of the elegiac
metres: for instance CLTK modules for hexameter and couplet. Machine–readable scansion was taken from the
hendecasyllable, Anceps for Senecan trimeter [
            <xref ref-type="bibr" rid="ref15">11</xref>
            ], or XML export of the Pedecerto project [
            <xref ref-type="bibr" rid="ref1">10</xref>
            ]. Each &lt;line&gt;
Loquax for syllabification and IPA transliteration [12]. element preserves the metrical category, the canonical
foot pattern and, for every word, a sy attribute that
enumerates syllables while marking the ictus with an
upper-case character. The import script retained verse
boundaries, foot sequence, ictic flags, elision hints and
          </p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>3.1. Source Texts and Metrical Gold Standard</title>
        <p>&lt;line name="1" meter="H" pattern="DDSS"&gt;
&lt;word sy="1A1b" wb="CF"&gt;Arma&lt;/word&gt;
...
&lt;word sy="2c3A" wb="CM"&gt;cano,&lt;/word&gt;
&lt;word sy="3T4A" wb="CM"&gt;Troiae&lt;/word&gt;
...</p>
        <p>&lt;/line&gt;
word–boundary information; all other metadata were Prompt engineering began with an extensive style
discarded. A fragment of the XML illustrates the struc- sheet, eventually distilled to three imperatives: speak
ture: slowly, articulate every syllable, and obey the marked
stresses. Re-printing the fully processed verse inside the
system prompt, exactly as it should be spoken,
noticeably improved alignment between text and realisation.</p>
        <p>Because stochastic sampling introduces variation, ten
readings were requested for every line. The final system
prompt was:</p>
        <sec id="sec-2-2-1">
          <title>This is a Latin poetical verse. Pronounce it</title>
          <p>rhythmically, slowly and with emphasis,
articulating each syllable and correctly
stressing them. Pronounce it like this:
[pre-processed verse]</p>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>3.2. Pre-processing and Orthographic</title>
      </sec>
      <sec id="sec-2-4">
        <title>Adaptation</title>
        <p>Each verse underwent an iterative pipeline before it was
ever passed to the speech engine. Syllabification relied on
the rule-based module distributed with the Classical Lan- The addition of the word "slowly", explicitly telling the
guage Toolkit [7]; diphthongs and enclitics are already model to recite the verses at a relaxed pace, proved to
covered in that implementation. The vowel of every ictic be particularly useful in ensuring that each syllable was
syllable received a grave accent and the complete sylla- correctly articulated.
ble was converted to capitals. Obligatory elisions were
realised as graphic mergers (quoque et therefore became 3.4. Human Validation and Error
quoquet) according to the Pedecerto wb flag. A comma Annotation
was inserted where the metre demands a caesura unless
the manuscript already ofered punctuation at that posi- Specialists in Latin phonology audited every recording.
tion. Early trials separated syllables with hyphens, but Errors were marked on spans and classified as
segmenthe additional markers produced no audible advantage tal substitution or ictus misplacement. Feedback after
and the idea was dropped. each experimental round guided small adjustments to</p>
        <p>A second pass substituted graphemes that tend to mis- the pre-processing routine and to the wording of the
lead English-trained acoustic models. Before front vowels prompt. Acceptance was granted when a line contained
〈c〉 was rewritten as 〈k〉, 〈qu〉 became 〈kw〉, the diph- no error of stress or elision and no more than minor
thongs 〈ae〉 and 〈oe〉 were rendered 〈ai〉 and 〈oi〉, and segmental deviations; under this criterion at least one
palatal 〈g〉 was expanded to 〈gh〉. The resulting string satisfactory rendition was eventually found for each of
approximates a classical pronunciation yet stays within the autonomous lines.
the alphabetic habits of contemporary TTS systems.</p>
        <p>To keep prosodic control local, each line was synthe- 3.5. Mastering and Packaging
sised in isolation; the rhythm inside a verse must be
coherent, whereas a small pause between verses is both
acceptable and expected in performance.</p>
        <sec id="sec-2-4-1">
          <title>For every verse the reviewers selected the highest-scoring</title>
          <p>ifle. Selected waveforms were loudness-normalised and
concatenated with an 800 ms silence, yielding two
continuous recitations that preserve per-line rhythmic
autonomy. Alongside the audio the repository contains:</p>
        </sec>
      </sec>
      <sec id="sec-2-5">
        <title>3.3. Speech Generation and Iterative</title>
      </sec>
      <sec id="sec-2-6">
        <title>Refinement</title>
        <p>
          Two technological families were explored. Conventional
sequence-to-sequence TTS engines, such as Tacotron
2 [2], Kokoro [15], OpenAI’s tts-1-hd [
          <xref ref-type="bibr" rid="ref2">16</xref>
          ], ofer little
room for instruction: stress was frequently misplaced
and vowel length erratic, particularly when the Latin
token resembled a common English form. Multimodal
large language models with an integrated audio decoder
fared better because the system prompt can be used to
impose a prosodic policy. Several models in the GPT-4o
and Gemini lines were evaluated; gpt-4o-mini-tts
[3] delivered the most consistent timing and segmental
clarity.
• the original Pedecerto XML fragments,
• the full text of the chosen passages,
• the pre-processed lines which have been given as
input to the TTS model.
        </p>
        <sec id="sec-2-6-1">
          <title>All artefacts are released under an open licence and</title>
          <p>have been deposited on Zenodo together with a DOI,
ensuring long-term accessibility and citability [17].</p>
          <p>For a more thorough discussion of the methodological
choices, from model selection to human evaluation, the
reader is referred to a previous publication [18].</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4. Results: Description of the</title>
    </sec>
    <sec id="sec-4">
      <title>Released Corpus</title>
      <sec id="sec-4-1">
        <title>The outcome of the workflow is an aligned collection</title>
        <p>of Latin poetic audio accompanied by the textual and
metrical information required for downstream work in
speech technology, pedagogy and quantitative metrics. teaching.
The repository is organised around three sections: for
each poem, a text file contains its original lines, another
one has the pre-processed text that was fed to the TTS
model, an XML file contains the Pedecerto metrical
annotations and a set of mp3 files represent the audio output,
stored both individually and as groups.</p>
        <p>Table 1 gives an overview of the material.
invested in its creation has generated several
observations that matter beyond the immediate goal of reciting
Vergil and Ovid. Three strands of evidence stand out:
the behaviour of the language model during synthesis,
the practical tricks that secured acceptable output, and
the prospective uses of the aligned data in research and</p>
        <sec id="sec-4-1-1">
          <title>5.1. Accents, Cross-lingual Interference, and what the Model really “Knows”</title>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>When the decoder was left to its own devices it tended</title>
        <p>to interpret individual words through the accent
template of whichever modern language ofered the closest
orthographic match. As a consequence, passages
domi4.1. Audio Layer nated by vocabulary shared with present-day Romance
received an intonation reminiscent of Italian, while lines
For each verse ten independent readings were decoded. rich in loanwords familiar to English appeared with a
After expert screening one rendition was retained as the markedly anglophone timbre. Spanish patterns surfaced
canonical file. Recordings are stored as mp3 files. Silences less often, but, for example, whenever words ended in
at verse boundaries have been standardised to 800 ms; no -rant the cadence was audibly Iberian. These drifts rarely
fades or noise reductions were applied, so that the signal broke the quantitative rhythm prescribed by hexameter
keeps its original spectral profile. or pentameter; they did, however, blur vowel quality,
especially in the mid-front and mid-back zones. The
phe4.2. Text and Prosodic Annotation nomenon confirms that the model encodes a multilingual
phonology for conversational prose, and stresses again
how little purely Latin data the underlying training set
must contain.</p>
      </sec>
      <sec id="sec-4-3">
        <title>The reference transcription follows the orthography em</title>
        <p>ployed during synthesis (grave accents on ictic vowels,
upper-case ictic syllables, adapted spellings for c, qu, g, ae,
oe) so that users can reproduce or extend the experiments
without reverse engineering. A parallel file restores
classical spelling for readers who prefer a diplomatic text.
Pedecerto syllable scansions, foot divisions and caesura
marks are displayed in a separate XML file.</p>
        <sec id="sec-4-3-1">
          <title>4.3. Availability and Licensing</title>
        </sec>
      </sec>
      <sec id="sec-4-4">
        <title>All components are released under CC-BY 4.0. The Zenodo record bundles the audio, textual content, and annotations [17].</title>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Discussion</title>
      <sec id="sec-5-1">
        <title>The present corpus was assembled in order to facilitate prosodically faithful speech synthesis, yet the labour</title>
        <sec id="sec-5-1-1">
          <title>5.2. Problematic Phonotactics,</title>
        </sec>
        <sec id="sec-5-1-2">
          <title>Orthographic Workarounds and</title>
          <p>Caesurae
Several consonant clusters led to systematic errors. Final
-nx in coniunx or initial tl- in Tlepolemus were clipped or
resolved into epenthetic vowels, presumably because the
sequences are rare in the speech material seen during
pretraining. In other cases the word was recast according to
a high-frequency modern homograph: -um often came
out as @m, betraying an English proper-name template.</p>
          <p>Two heuristics mitigated these slips. First, lengthening
the grapheme that carries the metrical ictus often
persuaded the model to anchor stress correctly; cano became
caano in the prompt, which silenced the temptation to
favour the English reading. Second, replacing rare di- 5.4. Classroom Impact and
graphs with phonetically transparent ones, already docu- Reproducibility
mented previously, reduced segmental substitutions by
a third. The interventions are admittedly ad hoc, yet In a teaching environment the recordings serve two
comthey illustrate how a handful of hand-crafted rules can plementary roles. Students can listen to a metrically
serve where large retraining runs are impossible. While regular rendition before attempting their own, and
inimprovements can certainly be made in regards to the structors can use the annotated text as input to alternative
overall flow of the generated verses, it remains one of the voices or slower tempos. Since every script that produced
most efective features: the addition of commas to mark the audio leverages easily-accessible APIs, replication is
caesurae, in particular, proved to be useful in ensuring straightforward. Such transparency matters especially
that the synthetic voices followed a precise pattern. Fig- for assessment settings where students must know
exure 1 shows the Mel spectrograms of the first twenty lines actly which variant counts as the reference.
of the opening epistula of the Heroides, generated to
visualize the verses’ rhythm. The spectrograms clearly show 5.5. Open Data and Transfer to Other
that black bars (representing pauses) are always present Languages
in the middle of the bottom tiles (pentameters), while
they are more spaced out in the top tiles (hexameters). Latin is only one among many historical or minoritised
This is due to the obligatory caesura that occurs after languages whose sound patterns are absent from
mainthe third arsis of each pentameter, precisely in the mid- stream speech technology. The workflow described here,
dle of the verse, while hexameters present more varied licensed permissively and documented line by line, could
structures. be cloned for Ancient Greek, Old Occitan, Classical
Arabic, or any verse tradition that already enjoys digital
scan5.3. From Corpus to Model Improvement sion. Riemenschneider and Frank [5] argue that large
language models can be useful tools for Classical
Philology; releasing small but expertly annotated sets therefore
aims to accelerate progress. Open repositories also lower
the entry cost for community contributors who may wish
to supply alternative voices, extended passages, or
corrected quantities.</p>
          <p>Because each verse is aligned with a verified audio, the
collection can function as a fine-tuning set for both
autoregressive and non-autoregressive TTS systems. A
model trained on metrically correct examples should
internalize the prosodic rules more reliably than a
generalpurpose system forced to extrapolate from modern
language data. Even large language models themselves could
benefit from exposure to annotated Latin verse during
continued pretraining, potentially reducing the need for
elaborate preprocessing in subsequent applications.
Future work could examine whether freezing the acoustic
front-end while training only the variance adapters
sufifces to introduce length contrast in addition to stress.</p>
        </sec>
        <sec id="sec-5-1-3">
          <title>5.6. Towards Length-sensitive Synthesis</title>
        </sec>
      </sec>
      <sec id="sec-5-2">
        <title>Stress was easier to enforce than absolute vowel length.</title>
        <p>The present system approximates quantity indirectly
through slower pacing on ictic syllables, yet it cannot
keep a fixed ratio between heavy and light vowels. Lam et
al. [13] report that explicit duration tokens unlock such
control in English; integrating a similar mechanism with
the current prompt-based strategy is an obvious next step.
Ultimately, a synthesis pipeline that diferentiates both
stress and quantity would let classicists test competing
reconstructions of Latin phonology "in silico", converting
theoretical statements into audible hypotheses.
the provider change access policies, identical
reproduction could become impossible. Finally, quantity was
approximated through slower pacing on metrically strong
syllables. The approach yields a rhythm that experienced
listeners recognise, yet it falls short of enforcing a fixed
heavy-to-light duration ratio, the gold standard in
phonetic work on quantitative metres [4].</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>7. Conclusion</title>
      <sec id="sec-6-1">
        <title>5.7. Outlook</title>
        <sec id="sec-6-1-1">
          <title>Refining the preprocessing scripts, automating error spot</title>
          <p>ting, and expanding the text base are immediate priorities.
Nevertheless, even in its present form the corpus already
supports experiments in few-shot prosody transfer,
quantitative metrics, and accessible pedagogy. The value of
such resources lies also in the demonstration that
highquality data can be gathered with modest equipment,
provided that domain knowledge and iterative verification
guide the process. Open, reproducible corpora therefore
remain the necessary foundation on which future work
for classical languages will build.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>6. Limitations</title>
      <p>The study has introduced an openly licensed, line-aligned
corpus that brings classical Latin verse within reach of
modern text-to-speech technology. By combining
Pedecerto’s machine-readable scansion with a small set of
orthographic substitutions and a concise prosodic prompt,
the workflow coerced a general-purpose large language
model into producing intelligible, metrically coherent
recitations. Systematic human screening guaranteed that
the released audio reflects the intended rhythm at a level
suitable for both pedagogy and computational research.</p>
      <p>The resulting dataset ofers three immediate avenues
of use. Teachers can deploy the files as accessible
classroom material, learners may rehearse passages while
receiving instant acoustic feedback, and speech engineers
now possess a clean test bed for experiments in prosody
conditioning. Beyond these practical gains, the project
demonstrates that domain knowledge, when encoded
explicitly in the input, still matters even in an era of ever
larger pretrained models. Prompt design, although
sometimes dismissed as a stop-gap measure, revealed itself
here as a cost-efective alternative to full retraining.</p>
      <p>Future work will have to broaden the metrical and
generic range, increase speaker diversity and explore
direct duration control. A longer term ambition is to
fold the current resource into a multilingual library of
verse corpora, so that comparative metrics across the
Indo-European tradition become feasible. The dataset
and annotations supplied with this release aim to render
such extensions straightforward.</p>
      <p>The corpus was assembled with the intention of
demonstrating what present-day language models can already
achieve when prompt engineering is combined with
careful human verification. Precisely because the focus lay
on a proof of concept, several boundaries were accepted
that restrict the scope of the resource. The most visible
limitation concerns size. Only two passages, albeit
canonical ones, entered the pipeline; together they furnish a
little under twenty-four minutes of speech. For certain
experiments in prosody transfer that duration sufices,
yet quantitative studies of acoustic variance or full
finetuning of an end-to-end synthesiser usually require at
least an order of magnitude more material.</p>
      <p>Closely related is the question of stylistic breadth. The
Aeneid and the Heroides difer in metre, tone and lexicon,
but both belong to the same literary period and represent
the same formal register. Comedy, forensic oratory or
Late Latin hymns remain untested. Consequently, the
substitution rules that helped the model through epic
and elegiac vocabulary might fail when confronted with
colloquial forms, post-Classical spellings or heavy Greek
loanwords.</p>
      <p>Another constraint derives from the decision to rely
on a single synthetic voice. Because speaker identity
never changes, the corpus cannot inform studies that
investigate how metre interacts with timbre or gendered Acknowledgments
pitch ranges. Similarly, only one variant of reconstructed
pronunciation is encoded. Alternative schools that pre- The author thanks the entire Pedecerto team for
annotatfer the ecclesiastical pronunciation will find no exam- ing and sharing their XML scansions. Sincere gratitude
ples that match their conventions. Validation, indispens- is extended to the CLTK community, whose open-source
able for quality control, introduces its own bias. Judge- tools simplified syllabification and phonological checks.
ments about short hesitations or barely perceptible vowel Colleagues at the University of Foggia donated hours to
colouring can difer across traditions; a panel of experts the auditory review of candidate recordings; their
experdrawn from a wider set of institutions might have re- tise shaped both the preprocessing rules and the
acceptained or rejected a slightly diferent subset of takes. tance thresholds. Any remaining inaccuracies are the</p>
      <p>Technical choices add further caveats. Recordings sole responsibility of the author.
were mastered to mp3 for ease of distribution, which
entails lossy compression. The prompts are public, but
the underlying model weights remain proprietary; should</p>
      <p>Declaration on Generative AI
During the preparation of this work, the author(s) did not use any generative AI tools or services.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <source>doi:10</source>
          .57967/hf/4329.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [16]
          <string-name>
            <surname>OpenAI</surname>
          </string-name>
          , Openai tts-1
          <article-title>-hd model documentation</article-title>
          , [1]
          <string-name>
            <given-names>B. W.</given-names>
            <surname>Fortson</surname>
          </string-name>
          <string-name>
            <surname>IV</surname>
          </string-name>
          ,
          <article-title>Latin prosody</article-title>
          and metrics,
          <string-name>
            <surname>A</surname>
          </string-name>
          <year>2025</year>
          . URL: https://platform.openai.com/docs/
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <article-title>companion to the Latin language (</article-title>
          <year>2011</year>
          )
          <fpage>92</fpage>
          -
          <lpage>104</lpage>
          . models/tts-1-hd, accessed:
          <fpage>2025</fpage>
          -06-29. [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Pang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. J.</given-names>
            <surname>Weiss</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Schuster</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Jaitly</surname>
          </string-name>
          , [17]
          <string-name>
            <given-names>M.</given-names>
            <surname>Ciletti</surname>
          </string-name>
          ,
          <article-title>Veras audire et reddere voces: A corpus</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Ryan</surname>
            ,
            <given-names>R. A.</given-names>
          </string-name>
          <string-name>
            <surname>Saurous</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Agiomyrgiannakis</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Wu</surname>
          </string-name>
          , language-model tts,
          <year>2025</year>
          . URL: https://doi.org/
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <article-title>Natural tts synthesis by conditioning wavenet on 10</article-title>
          .5281/zenodo.15677356. doi:
          <volume>10</volume>
          .5281/zenodo.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <source>mel spectrogram predictions</source>
          ,
          <year>2018</year>
          . URL: https:// 15677356.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          arxiv.org/abs/1712.05884. arXiv:
          <volume>1712</volume>
          .
          <fpage>05884</fpage>
          . [18]
          <string-name>
            <given-names>M.</given-names>
            <surname>Ciletti</surname>
          </string-name>
          , Prompting the muse: Generating [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Hurst</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lerer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. P.</given-names>
            <surname>Goucher</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          <article-title>Perelman, prosodically-correct Latin speech with large lan-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <given-names>A.</given-names>
            <surname>Hayes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          , et al.,
          <source>Gpt-4o system card</source>
          , (Eds.),
          <source>Proceedings of the 63rd Annual</source>
          Meet-
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <source>arXiv preprint arXiv:2410.21276</source>
          (
          <year>2024</year>
          ).
          <article-title>ing of the Association for Computational Lin</article-title>
          [4]
          <string-name>
            <given-names>W. S.</given-names>
            <surname>Allen</surname>
          </string-name>
          , Vox Latina:
          <article-title>a guide to the pronunciation guistics</article-title>
          (Volume
          <volume>4</volume>
          : Student Research Workshop),
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <source>of classical Latin</source>
          , Cambridge University Press,
          <year>1989</year>
          .
          <article-title>Association for Computational Linguistics</article-title>
          , Vi[5]
          <string-name>
            <given-names>F.</given-names>
            <surname>Riemenschneider</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Frank</surname>
          </string-name>
          , Exploring large lan- enna, Austria,
          <year>2025</year>
          , pp.
          <fpage>740</fpage>
          -
          <lpage>745</lpage>
          . URL: https://
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <article-title>guage models for classical philology, arXiv preprint aclanthology</article-title>
          .org/
          <year>2025</year>
          .acl-srw.
          <volume>48</volume>
          /. doi:
          <volume>10</volume>
          .18653/
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <source>arXiv:2305.13698</source>
          (
          <year>2023</year>
          ). v1/
          <year>2025</year>
          .acl-srw.
          <volume>48</volume>
          . [6]
          <string-name>
            <given-names>B.</given-names>
            <surname>McGillivray</surname>
          </string-name>
          , Methods in Latin computational
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <surname>linguistics</surname>
          </string-name>
          , volume
          <volume>1</volume>
          ,
          <string-name>
            <surname>Brill</surname>
          </string-name>
          ,
          <year>2013</year>
          . [7]
          <string-name>
            <given-names>K. P.</given-names>
            <surname>Johnson</surname>
          </string-name>
          , P. J.
          <string-name>
            <surname>Burns</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Stewart</surname>
          </string-name>
          , T. Cook,
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          guages,
          <source>in: Proceedings of the 59th annual meeting</source>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <article-title>the 11th international joint conference on natural</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <source>language processing: System demonstrations</source>
          ,
          <year>2021</year>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          pp.
          <fpage>20</fpage>
          -
          <lpage>29</lpage>
          . [8]
          <string-name>
            <given-names>P.</given-names>
            <surname>Qi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bolton</surname>
          </string-name>
          ,
          <string-name>
            <surname>C. D.</surname>
          </string-name>
          Man-
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          arXiv:
          <year>2003</year>
          .
          <volume>07082</volume>
          (
          <year>2020</year>
          ). [9]
          <string-name>
            <surname>M.-C. De Marnefe</surname>
            ,
            <given-names>C. D.</given-names>
          </string-name>
          <string-name>
            <surname>Manning</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Nivre</surname>
          </string-name>
          , D. Ze-
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <source>guistics 47</source>
          (
          <year>2021</year>
          )
          <fpage>255</fpage>
          -
          <lpage>308</lpage>
          . [10]
          <string-name>
            <given-names>E.</given-names>
            <surname>Colombi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Mondin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Tessarolo</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Bacianini,
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <string-name>
            <surname>rica Latina Digitale</surname>
          </string-name>
          (
          <year>2011</year>
          ). [11]
          <string-name>
            <given-names>A.</given-names>
            <surname>Fedchin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. J.</given-names>
            <surname>Burns</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Chaudhuri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>Dexter</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <source>Journal of Philology</source>
          <volume>143</volume>
          (
          <year>2022</year>
          )
          <fpage>475</fpage>
          -
          <lpage>503</lpage>
          . [12]
          <string-name>
            <given-names>M.</given-names>
            <surname>Court</surname>
          </string-name>
          ,
          <article-title>Loquax: Nlp framework for phonology,</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          https://github.com/mattlianje/loquax,
          <year>2025</year>
          . GitHub
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          repository. [13]
          <string-name>
            <given-names>P.</given-names>
            <surname>Lam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. F.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Sisman</surname>
          </string-name>
          , D. Herre-
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <string-name>
            <surname>IEEE Signal Processing Letters</surname>
          </string-name>
          (
          <year>2025</year>
          ). [14]
          <string-name>
            <surname>M. De Sisto</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Hernández-Lorenzo</surname>
          </string-name>
          , J. De la Rosa,
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          <source>Digital Scholarship in the Humanities</source>
          <volume>39</volume>
          (
          <year>2024</year>
          )
          <fpage>500</fpage>
          -
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          521. [15]
          <string-name>
            <surname>Hexgrad</surname>
          </string-name>
          , Kokoro-82m
          <source>(revision d8b4fc7)</source>
          ,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>URL: https://huggingface.co/hexgrad/Kokoro-82M.</mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>