<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>International Scientific Workshop on Applied Information Technologies and Artificial Intelligence Systems,
December</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Singing voice synthesis via latent flow differentiable digital signal processing⋆ matching and</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Serhii Lupenko</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maksym Klishch</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vasyl Yatsyshyn</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Oleh Pastukh</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Faculty of Computer Science, Opole University of Technology</institution>
          ,
          <addr-line>45-758 Opole</addr-line>
          ,
          <country country="PL">Poland</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Ternopil Ivan Puluj National Technical University</institution>
          ,
          <addr-line>Ruska Street 56, T46001 Ternopil</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>1</volume>
      <fpage>8</fpage>
      <lpage>19</lpage>
      <abstract>
        <p>Recent advances in singing voice synthesis (SVS) have achieved high perceptual quality through variational and diffusion-based generative frameworks. However, diffusion models require many iterative steps for inference, while variational approaches may suffer from a mismatch between prior and posterior distributions, affecting pitch accuracy and pronunciation. We propose a flow matching-based latent generative model that integrates a DDSP autoencoder with a latent flow predictor for efficient and expressive singing voice synthesis. By combining spectral parameter modeling with continuous latent flow transformation, the system achieves high-fidelity waveform generation with reduced computational cost. Experimental results demonstrate that the proposed model attains comparable perceptual quality to state-of-the-art baselines while using fewer parameters and achieving faster inference.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Singing voice synthesis</kwd>
        <kwd>flow matching</kwd>
        <kwd>differentiable digital signal processing</kwd>
        <kwd>acoustic modelling</kwd>
        <kwd>1</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Singing Voice Synthesis (SVS) aims to generate expressive singing from musical scores and lyrics.
Unlike Text-to-Speech, SVS must precisely reproduce pitch, rhythm, and expressive nuances such
as vibrato and legato.</p>
      <p>Despite their conceptual similarity, SVS presents a set of challenges that are substantially more
complex than those in TTS. First, the pitch contour in singing spans a much wider dynamic range
and requires frame-level accuracy: even small deviations in fundamental frequency 0 can lead to
perceptually unnatural or dissonant results. Second, the temporal structure of singing is governed
by musical rhythm and note duration rather than by natural speech prosody. This introduces a
strong dependency between phoneme alignment and musical timing, where mismatched durations
or onsets can severely distort lyrical intelligibility. Third, expressive performance characteristics
such as vibrato, portamento, and dynamic loudness variation are essential for naturalness, yet are
difficult to model using standard text-to-speech architectures. Moreover, singing datasets are
typically smaller and less diverse than speech corpora, limiting the robustness of purely
datadriven approaches.</p>
      <p>
        Over the past two decades, singing voice synthesis has evolved from concatenative to neural
generative paradigms. Early systems such as Vocaloid [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and UTAU relied on concatenative
playback of recorded phonemes, offering manual control but limited expressiveness. Statistical
models like Sinsy [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] introduced hidden Markov modeling for note durations and pitch, enabling
automation at the cost of oversmoothed timbre. With deep learning, architectures such as
XiaoiceSing [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and DeepSinger [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] adopted transformer-based encoders for non-autoregressive
prediction, improving pitch and rhythm stability.
      </p>
      <p>
        Recent neural SVS systems address these issues through variational, diffusion, or flow-based
generative modeling. While architectures, such as VISinger [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], achieve high quality and can
generalize with limited data, they often suffer from training-inference mismatch between the
posterior (audio) and prior (musical score) distributions, which can lead to inaccurate pitch and
mispronunciations [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Diffusion models, including HiddenSinger [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and DiffSinger [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], deliver
state-of-the-art fidelity but require iterative denoising in high-dimensional acoustic space, leading
to slow inference and high computational cost.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Works</title>
      <sec id="sec-2-1">
        <title>2.1. Singing voice synthesis models</title>
        <p>Recent research in SVS has progressed through the integration of deep learning, neural audio
codecs, diffusion processes, flow-based methods, and digital signal processing (DSP) techniques.
The evolution of SVS models reflects a shift from deterministic acoustic modeling toward
probabilistic and latent generative paradigms.</p>
        <p>
          One of the systems of this generation, XiaoiceSing [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ], employs a FastSpeech-style architecture
that combines phoneme, positional, and musical features to predict phoneme durations and the
fundamental frequency 0. The model performs non-autoregressive acoustic prediction while
incorporating rhythmic and melodic conditioning.
        </p>
        <p>
          The VISinger series [
          <xref ref-type="bibr" rid="ref5 ref8">5, 8</xref>
          ] is built upon the VITS architecture (Variational Inference with
adversarial learning for end-to-end Text-to-Speech). It integrates variational latent modeling, flow
transformations, and adversarial training to jointly model acoustic features and waveform
reconstruction. VISinger 2 extends this framework by introducing differentiable digital signal
processing (DDSP) blocks and a HiFi-GAN vocoder, explicitly separating harmonic and noise
components. The system operates in an end-to-end manner, synthesizing 44.1 kHz audio, and was
trained for 500 k steps on 5 hours of data.
        </p>
        <p>
          DiffSinger [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] adopts diffusion probabilistic modeling for generating mel-spectrograms. The
model transforms Gaussian noise into target spectrograms through a conditional diffusion process,
and its shallow diffusion mechanism reduces the number of denoising steps to improve generation
efficiency.
        </p>
        <p>
          HiddenSinger [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] combines a neural audio codec with a latent diffusion model. The system
encodes singing audio into a low-dimensional latent space, performs generation within this space,
and reconstructs full-band audio through a neural decoder. It was trained on 150 hours of data for 3
million steps with a batch size of 32. The variant HiddenSinger-U supports semi-supervised
training on unpaired singing data.
        </p>
        <p>
          Multilingual and zero-shot adaptation is explored in TCSinger 2 [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], which enables cross-lingual
and cross-singer synthesis without additional fine-tuning. Meanwhile, TechSinger [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] employs the
flow matching paradigm for controllable SVS, allowing explicit manipulation of vocal techniques
such as vibrato and breathiness across multiple languages.
        </p>
        <p>Overall, contemporary SVS systems encompass a wide spectrum of architectures—from
nonautoregressive models (XiaoiceSing) and VITS-based variational frameworks (VISinger,
VISinger 2) to diffusion-based (DiffSinger), latent-diffusion (HiddenSinger), and flow-based
(TechSinger) approaches — advancing toward multilinguality, controllability, and expressive
singing synthesis.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Recent flow matching based TTS models</title>
        <p>Recent advances in text-to-speech (TTS) modeling demonstrate a steady transition from
autoregressive architectures to flow-based and latent generative approaches aimed at improving
synthesis efficiency, prosodic coherence, and controllability. Several works explore different
formulations of flow matching and diffusion processes for speech generation.</p>
        <p>
          MetaTTS [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] integrates pre-training strategies into TTS and examines how large-scale flow
matching can enhance generalization and naturalness. This study represents one of the first
attempts to apply flow matching at the scale of foundation speech models, bridging pre-trained
acoustic representations with generative modeling.
        </p>
        <p>
          Building upon this line, F5-TTS [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] investigates long-form speech synthesis, emphasizing
textspeech alignment and temporal consistency. Using the flow matching framework, it maintains
coherent prosody across extended utterances, highlighting the suitability of flow-based methods
for narrative and expressive speech generation.
        </p>
        <p>
          Matcha-TTS [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] proposes a streamlined conditional flow matching architecture optimized for
lowlatency synthesis. The model reduces inference time while preserving competitive audio
quality, indicating its potential for real-time applications.
        </p>
        <p>
          A complementary direction is explored in LatentSpeech [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], which applies latent diffusion
modeling to TTS. Speech is generated within a compact latent space rather than directly in the
spectral or temporal domain, reducing computational requirements while retaining acoustic detail.
This approach demonstrates the efficiency gains achievable through latent-space diffusion.
        </p>
        <p>
          VoiceFlow [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] extends the flow-based paradigm by employing rectified flow matching,
reformulating the underlying dynamics to minimize integration steps and inference cost. The
resulting model achieves comparable perceptual quality with improved computational efficiency.
        </p>
        <p>
          Finally, ProsodyFlow [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] integrates conditional flow matching with prosody modeling
informed by large speech language models. This combination allows explicit control over
highlevel prosodic dimensions such as intonation, rhythm, and stress, illustrating how linguistic
conditioning can enhance expressiveness in neural TTS.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Proposed method</title>
      <p>This section presents the proposed latent-conditioned architecture for singing voice synthesis,
which combines a flow-based latent predictor with a DDSP-based autoencoder for waveform
reconstruction, as illustrated in Figure 1.</p>
      <sec id="sec-3-1">
        <title>3.1. Autoencoder</title>
        <p>
          Architecture adopts a differential DSP based autoencoder. Following the approach of HiddenSinger
[
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], we integrate Residual Vector Quantization (RVQ) blocks into the bottleneck to enable compact
latent representation of input features.
3.1.1. Encoder
Given an input waveform  ∈ ℝ, where  is the number of audio samples, the encoder (·) extracts
a latent feature sequence:
        </p>
        <p>K 1
eh (t )=∑
k=1 k</p>
        <p>sin (φ k (t )) ,
where  is the number of harmonics and () is the instantaneous phase of the -th harmonic,
given by:</p>
        <p>t
φ k (t )=2π k ∫ F0 (τ ) d τ . (5)</p>
        <p>0</p>
        <p>The excitation signal is then passed through a frequency-domain filter to generate the harmonic
signal:</p>
        <p>xh= ℱ −1((1− A ) S ℱ (eh)) ,
where ℱ and ℱ−1 denote the forward and inverse short-time Fourier transforms (STFT) with a
window length of 2048 samples and 75% overlap,  is the aperiodicity coefficient, and  is the
spectral shaping filter.</p>
        <p>The initial excitation for the noise component is defined as Gaussian white noise:
(1)
(4)
(6)
(7)
(8)
where  denotes the feature dimension and  is the number of encoded frames.</p>
        <p>The encoder transforms the raw waveform into a time-aligned latent representation that
summarizes its relevant acoustic structure. This representation retains information about spectral
shape, energy, and temporal evolution while discarding fine-grained sample-level details that are
unnecessary for higher-level modeling.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.1.2. Residual Vector Quantization</title>
        <p>RVQ approximates a signal  ∈ ℝd as the sum of  quantized vectors, each selected from a codebook
(),  = 1, 2, … , , of fixed size :</p>
        <p>L
q ( z )=∑ qi (ri) , (2)</p>
        <p>i=1
where  : ℝd → () is the quantization function for the -th codebook (), and  is the residual
vector at the -th quantization step:
i−1
r1= z , ri= z−∑ q j (r j) . (3)</p>
        <p>j=1</p>
        <p>This structure enables high-fidelity reconstruction with compact codebooks, establishes a
discrete prior over the latent space, and imposes structural constraints on the latent manifold.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.1.3. Decoder and vocoder</title>
        <p>
          The decoder is adapted from the architecture proposed in [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ], adopting a Emformer-blocks [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ].
The vocoder follows the design principles of [
          <xref ref-type="bibr" rid="ref17 ref19">17, 19</xref>
          ], combining differentiable signal synthesis
with spectral parameter decoding. It operates on three feature components: the fundamental
frequency 0, the spectral envelope , and the aperiodicity .
        </p>
        <p>
          The excitation signal for the harmonic component is defined as the sum of sawtooth-like signals
[
          <xref ref-type="bibr" rid="ref20">20</xref>
          ]:
        </p>
        <p>en (t )∼ N (0 , 1) .</p>
        <p>xn= ℱ −1( AS ℱ (en)) .</p>
        <p>The noise signal is then generated by applying the same spectral filter with the aperiodicity
weighting:</p>
        <p>Finally, the synthesized singing voice signal is obtained as a weighted sum of the harmonic and
noise components:</p>
        <p>where 0 = max(−1, 0) is a non-negative pseudo-inverse of  applied elementwise to avoid
negative reconstruction artifacts.</p>
        <p>For aperiodicity, the decoder predicts a compressed representation  with 16 channels, which is
linearly interpolated along the frequency axis to obtain the full 513-dimensional aperiodicity 
used by the differentiable signal synthesis module.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.1.4. Training criteria</title>
        <p>The autoencoder model is optimized using a combination of spectral reconstruction, vector
quantization, and adversarial objectives. Each term contributes to different aspects of perceptual
and structural fidelity in the generated waveform.</p>
        <p>Reconstruction loss. To ensure accurate waveform reconstruction, we adopt a multi-resolution
STFT loss:
ℒ STFT = E ∑6 ‖Si ( x )−Si ( x^i)‖22+‖log Si ( x )−log Si ( x^i)‖22 , (12)
i=1
where S represents normalized STFT with size of FFT and window size 25+ and hop lenght 25%.</p>
        <p>
          Vector quantization commitment loss. For residual vector quantization, the commitment objective
encourages stable codebook utilization [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ]:
ℒ RVQ= E ∑L ‖ri−qi (ri)‖22 . (13)
        </p>
        <p>i=1</p>
        <p>
          Discriminative loss. Following [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ], we adopt both a multi-period discriminator (MPD) and a
multiscale discriminator (MSD) to enhance the perceptual quality of the generated waveform. The
MPD is designed to capture pitch-synchronous periodic structures across multiple temporal
periods, while the MSD operates on differently scaled versions of the waveform to assess spectral
consistency and long-term coherence. Both discriminators are jointly trained with the generator
using least-squares adversarial objectives and a feature-matching term.
        </p>
        <p>N N</p>
        <p>
          The generator  and discriminators {D(Mi)PD}i=p1 , {D(Mj)SD}j=s 1 , are optimized using least-squares
adversarial losses [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ]. For a target waveform  and its generated counterpart x^=G (⋅) , the
objectives are defined as follows.
        </p>
        <p>The generator loss is expressed as:
Sc=log10( MS +ε ) ,
(10)
(11)
(14)
where ℎ and  are coefficients that control the relative contribution and amplitude of the
harmonic and noise components, respectively.</p>
        <p>
          Following [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ] we adopt compression/decompression for  and .
        </p>
        <p>The linear spectral envelope  is mapped to the log-mel domain using a mel filterbank :
and reconstructed during synthesis as:
ℒ G= E x^ [∑ ( D(Mi)PD ( x^ )−1) +∑ ( D(Mj)SD ( x^ )−1) ].</p>
        <p>N p 2 N s 2
i=1 j=1
The discriminator loss is defined symmetrically as:
ℒ D= E x , x^ [∑ (( D(Mi)PD ( x )−1) +( D(Mi)PD ( x^ )) )+∑ (( D(Mj)SD ( x )−1) +( D(Mj)SD ( x^ )) )]. (15)
N p 2 2 N s 2 2
i=1 j=1</p>
      </sec>
      <sec id="sec-3-5">
        <title>3.2. Flow matching</title>
        <p>
          Flow matching [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ] is an approach for learning a direct deterministic mapping from a simple prior
distribution to a complex target data distribution using a vector field. Unlike diffusion models,
where the transformation process is governed by stochastic differential equations (SDEs), Flow
matching relies on a deterministic ordinary differential equation (ODE) that defines a continuous
trajectory between distributions.
        </p>
        <p>
          We adopts flow matching to predict latent embeddings  by modeling a time-dependent vector
field (|), which describes the transformation of the base distribution 0 = (0, 1) into a target
distribution 1 that approximates the latent data manifold:
where  : [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ] × ℝ× → ℝ× is a time-dependent flow function defined as:
d ψ t ( z∣μ )
dt
        </p>
        <p>=vt ( ψ t ( z∣μ )) ,
ψ t ( z∣μ )=(1−t ) z +tz1 .</p>
        <p>
          Here, 1 denotes a sample from the data distribution in latent space, and  ∈ [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ] parameterizes
the transformation path from the prior to the target distribution.
        </p>
        <p>
          Model is training using reparametrized optimal-transport conditional flow matching [25]:
To couple the conditioning pathway with the latent target, we also align encoder features  to
the target latent 1 via an auxiliary MSE:
1 2
ℒ CFM = Et∼ U [
          <xref ref-type="bibr" rid="ref1">0 ,1</xref>
          ], z1∼ q 1−t ‖z1−vt ( zt∣μ )‖2 .
        </p>
        <p>2
ℒ feature= E‖μ − z1‖2 .</p>
        <p>This auxiliary coupling encourages the encoder features to carry predictive information about
the target latent while preserving the flow-matching dynamics.</p>
      </sec>
      <sec id="sec-3-6">
        <title>3.3. Time lag and note duration</title>
        <p>In singing voice synthesis, the temporal alignment between musical notes and phonemes plays a
critical role in maintaining the naturalness and intelligibility of the generated performance.
Following the approach proposed in [26], we model the time-lag and note duration as two separate
prediction tasks.</p>
      </sec>
      <sec id="sec-3-7">
        <title>3.3.1. Time-lag model</title>
        <p>The time-lag represents the offset between the onset of a musical note in the score and the actual
beginning of the corresponding phoneme in the singing performance. This offset primarily arises
because consonants often precede the note onset, while vowels align more closely with the note
boundary. Let gn denote the reference time-lag for the -th note and g^n the predicted value. The
model is trained to minimize the mean squared error between predicted and reference lags:
ℒ lag= E‖g− g^‖22 .</p>
        <p>During synthesis, the predicted lag shifts the onset of each note, providing a corrected effective
duration L^n for subsequent phoneme allocation.</p>
      </sec>
      <sec id="sec-3-8">
        <title>3.3.2. Duration model</title>
        <p>The duration model predicts the length of each phoneme within a note, given musical and phonetic
features. For the -th note containing  phonemes, the model outputs the expected phoneme
durations d^nk under the constraint that their total sum equals the adjusted note length L^n :
(16)
(17)
(18)
(19)
(20)</p>
        <p>This objective encourages accurate phoneme-level timing, ensuring that predicted durations
remain consistent with the musical score.</p>
        <p>ℒ dur= E‖d −d^‖22 .
(22)</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiment</title>
      <sec id="sec-4-1">
        <title>4.1. Datasets</title>
        <p>For training, we used the Tohoku Kiritan dataset [27], which contains 50 songs with a total
duration of approximately 3.5 hours. The first three songs were used for testing, the next three for
validation, and the remaining 44 for training.</p>
        <p>All audio signals were downsampled to 24 kHz and normalized to -26 dB. The dataset was
segmented into 4 second samples, with segmentation boundaries aligned to word-level timing and
natural pauses to preserve linguistic and prosodic coherence.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Training</title>
        <p>We trained the model using the Adam optimizer [28] with a learning rate of 10−5, 1 = 0.8, and
2 = 0.999 for 250k steps. All experiments were conducted on a single NVIDIA RTX 4060 GPU with
a batch size of 16 and fp16.</p>
        <p>All components of the proposed system are optimized jointly to achieve accurate temporal
alignment, spectral reconstruction, and perceptual fidelity. The overall training objective integrates
the note- and phoneme-level temporal models with the latent and waveform reconstruction
modules. To ensure stable convergence, the discriminators are introduced after 20k steps of
autoencoder pretraining, allowing the generator to learn coarse spectral structures before
adversarial feedback is applied. Full joint optimization of the encoder, decoder, flow predictor, and
discriminators begins after 200k steps, enabling coordinated fine-tuning of all modules under both
reconstruction and adversarial losses.</p>
        <p>The total loss function is defined as
ℒ total=λ lag ℒ lag+λ dur ℒ dur +λ STFT ℒ STFT +λ RVQ ℒ RVQ +
+ λ GAN ℒ GAN +λ FM ℒ FM +λ feature ℒ feature ,
(23)
where each  denotes a weighting coefficient that balances the contribution of the
corresponding term.</p>
        <p>The loss weights are empirically set to lag = 0.02, dur = 0.02, STFT = 1.0,
RVQ = 0.5, GAN = 0.5, FM = 1.0, and feature = 0.1 with the values selected based on validation
performance across objective metrics (detailed in Section 5).</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Implementation details</title>
      </sec>
      <sec id="sec-4-4">
        <title>4.3.1. Autoencoder</title>
        <p>Following the design of the DDSP architecture [29], we reuse their 0- and z-encoder components,
while omitting the loudness encoder, as our model does not rely on loudness features. In our
formulation, overall signal amplitude and perceived loudness are expected to be implicitly
represented within the latent variable . The 0 encoder employs a pretrained CREPE pitch
estimator [30]. For the z-encoder, we adopt the same feature extraction and network structure as
DDSP, which maps MFCC-based representations to a compact latent embedding () through a
GRU layer.</p>
        <p>The RVQ module is implemented with 8 quantizers, each comprising a codebook of 1024 entries
with dimension 128.</p>
      </sec>
      <sec id="sec-4-5">
        <title>4.3.2. Latent Generator</title>
        <p>
          Following the architecture of Matcha-TTS [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ], the latent generator predicts the denoised latent
sequence conditioned on time  and features . It consists of six convolutional-attention blocks
with residual connections and explicit time conditioning at each layer.
        </p>
        <p>
          Each block combines a residual 1D convolution (kernel size 7) for local context modeling and an
Emformer [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] layer for efficient long-range temporal attention. In contrast to Matcha-TTS, which
adopts Transformer blocks from BigVGAN [31], we replace them with Emformer layers to improve
efficiency and support streaming inference. A sinusoidal time embedding with dimension 64 is
concatenated to the input of each block, allowing the model to represent the continuous diffusion
or flow-matching timestep. Skip connections between early and deep layers facilitate information
flow. The output features are projected by a multilayer perceptron to match the target latent
dimensionality of 129 channels per frame, where 128 dimensions correspond to the latent
representation  and one additional channel represents the fundamental frequency 0.
        </p>
        <p>
          Following Matcha-TTS [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] the conditioning features are first processed by an encoder that
maps linguistic and prosodic inputs into a continuous representation . This representation serves
as the conditioning variable for the latent generator, providing frame-level context about phonetic
content, duration, and pitch.
        </p>
        <p>For inference, the final latent trajectory is obtained by numerically integrating () using the
first-order Euler method.</p>
      </sec>
      <sec id="sec-4-6">
        <title>4.4. Baselines</title>
        <p>We used two singing voice synthesis systems for comparison: HiddenSinger and ViSinger2. For
ViSinger2, we adopted the official implementation provided in the ESPNet toolkit [32], following
the default configuration released by the authors. The HiddenSinger model was reimplemented
according to the architecture and training setup described in the original paper to ensure consistent
preprocessing and evaluation conditions. For inference, 50 denoising steps were employed to
generate the final outputs.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <p>To evaluate the proposed model, we conducted both subjective and objective assessments on a held
out test set of the singing voice dataset. The comparison includes two baseline systems, ViSinger2
and HiddenSinger, along with an autoencoder reconstruction reference and the ground-truth
recordings. For our model, we report performance under two inference configurations: a single
flow-matching step (1 Step) and with eight steps (8 Steps).</p>
      <sec id="sec-5-1">
        <title>5.1. Subjective Evaluation</title>
        <p>Subjective quality was evaluated on a subset of 80 synthesized samples using five independent
listeners. Each participant rated the perceptual quality on a five-point Mean Opinion Score (MOS)
scale, where higher scores indicate more natural and pleasant sound.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Objective Evaluation</title>
        <p>Objective evaluation was performed to assess spectral and prosodic accuracy. Mel-cepstral
distortion (MCD, in dB) quantifies spectral deviation between the synthesized and reference
signals. The 0 root mean square error (0-RMSE) measures the pitch contour deviation, and the
voiced/unvoiced (/) 1-score reflects the accuracy of voicing decisions. As reference 0 and
/ features, we used those extracted from the target recordings with the WORLD vocoder [33].</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Computational Efficiency</title>
        <p>To evaluate computational efficiency, we also report peak memory usage during inference and the
real-time factor (RTF). Peak memory usage (in GB) indicates the maximum GPU memory
consumption per sample generation, while RTF measures the ratio of synthesis time to audio
duration.</p>
        <p>Table 1 summarizes the quantitative and perceptual results across all compared systems.</p>
        <p>Figure 2 presents a direct comparison between the target and predicted mel-spectrograms. This
visualization highlights howclosely the synthesized output follows the temporal and harmonic
structures of the reference recording. We include this figure to qualitatively illustrate the model’s
ability to reconstruct pitch contours, note onsets, and harmonic formant structures that are not
fully captured by objective metrics alone.</p>
      </sec>
      <sec id="sec-5-4">
        <title>5.4. Ablation study</title>
        <p>
          To evaluate the contribution of the main architectural components, we design two ablation
comparisons aligned with widely used alternatives in neural singing voice synthesis. The first
compares our 8-step flow matching generator with a diffusion-based decoder trained under
identical data and conditioning settings. The second replaces our DDSP-based acoustic decoder
with commonly used GAN vocoders, including SiFiGAN [34] and HiFiGAN [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ].
        </p>
        <p>All ablation systems are trained using the same dataset, optimization schedule, and text–pitch
conditioning to ensure comparability. Evaluation is performed on the same held-out test subset as
the main experiments.</p>
        <p>Table 2 reports subjective evaluation results for the ablation settings.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Discussion</title>
      <p>The results in Table 1 demonstrate that the proposed model achieves a favorable trade-off between
quality and efficiency. Compared to ViSinger2 and HiddenSinger, our approach requires 28% fewer
parameters and less than half of the peak GPU memory during inference, while maintaining
comparable perceptual quality. In particular, the 8-step configuration attains a MOS of 3.76,
approaching ViSinger2 (3.81), yet operates with nearly twice the speed, as indicated by the lower
RTF. The 1-step version achieves real-time generation (RTF &lt; 0.02), confirming that the
flowmatching formulation enables efficient synthesis with only a few iterative updates of the velocity
field , rather than hundreds of stochastic denoising steps required by diffusion models.</p>
      <p>Objective metrics further show consistent spectral and prosodic accuracy. The slight increase
in 0-RMSE and V/UV 1 compared to the baselines suggests minor pitch and voicing deviations,
which may stem from the reduced latent dimensionality. However, the lower MCD for the 8-step
configuration indicates improved spectral coherence and harmonic balance, particularly under
multi-step refinement that progressively aligns the predicted latent with the target manifold.</p>
      <p>
        While diffusion and flow-based generative models both rely on stochastic sampling, excessive
stochasticity can sometimes introduce pitch or timing variability across runs, slightly degrading
consistency. Our method mitigates this by learning smoother velocity fields through flow
matching, requiring far fewer steps than diffusion-based denoising while retaining stochastic
flexibility for expressive control. It is also important to note that the original HiddenSinger [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] was
trained on a substantially larger dataset (approximately 150 hours of recording), whereas our
evaluation targets smaller datasets to assess generalization in low-resource scenarios. This
difference in data scale likely contributes to the observed gap in subjective quality.
      </p>
      <p>The faster and competitive performance of our system stems from three main design factors.
First, the DDSP-based decoder operates directly on compact spectral parameters – the envelope 
and aperiodicity  – rather than full-resolution spectrograms, reducing computational cost while
preserving perceptual detail. Second, modeling in the latent space substantially lowers the
prediction dimensionality, allowing the flow-matching network to learn smoother dynamics with
fewer parameters. Third, the flow-matching objective enables efficient iterative generation by
estimating continuous velocity fields instead of performing high-step stochastic denoising.
Although the individual contribution of each factor has not been explicitly analyzed, their
combined effect enables the proposed system to achieve near state-of-the-art quality with faster
synthesis and significantly reduced memory usage.</p>
      <p>In addition, Figure 2 qualitatively supports these findings by showing that the predicted
melspectrogram closely matches the reference in both temporal and spectral structure. The
alignment of formants and harmonic trajectories indicates that the proposed model effectively
captures the finegrained frequency evolution of the singing voice, complementing the quantitative
metrics in Table 1.</p>
      <p>The ablation results show that the flow matching generator achieves higher perceptual quality
than the diffusion-based alternative under identical training conditions, while avoiding the latency
overhead associated with multi-step sampling. Furthermore, the DDSP-based autoencoder
demonstrates perceptual performance comparable to SiFiGAN and superior to HiFiGAN.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusions</title>
      <p>In this paper, we presented a singing voice synthesis model architecture that integrates a
DDSPbased autoencoder with a flow-matching latent predictor. The proposed model effectively combines
explicit signal decomposition with continuous latent flow modeling, enabling expressive and
highfidelity singing voice generation. Experimental results demonstrate that our system achieves
comparable perceptual quality to state-of-the-art approaches while requiring significantly fewer
parameters. These results suggest that incorporating flow-based latent modeling within a DDSP
framework provides an efficient and interpretable alternative for neural singing synthesis. Future
work will focus on extending the proposed method toward multi-singer and cross-lingual
scenarios, as well as exploring finer control over expressive parameters such as vibrato and
dynamics.</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used GPT and Gemini in order to: Grammar and
spelling check. After using these tools/services, the authors reviewed and edited the content as
needed and take full responsibility for the publication’s content.
[25] T. Luo, X. Miao, W. Duan, WaveFM: A high-fidelity and efficient vocoder based on flow
matching, arXiv preprint arXiv:2503.16689 (2025). doi:10.48550/arXiv.2503.16689.
[26] Y. Hono, K. Hashimoto, K. Oura, Y. Nankaku, K. Tokuda, Sinsy: A deep neural network-based
singing voice synthesis system, IEEE/ACM Trans. Audio Speech Lang. Process. 29 (2021)
2803–2815. doi:10.1109/TASLP.2021.3104165.
[27] I. Ogawa, M. Morise, Tohoku Kiritan singing database: A singing database for statistical
parametric singing synthesis using Japanese pop songs, Acoust. Sci. Technol. 42 (2021)
140–145. doi:10.1250/ast.42.140.
[28] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, arXiv preprint
arXiv:1412.6980 (2017). doi:10.48550/arXiv.1412.6980.
[29] J. Engel, L. Hantrakul, C. Gu, A. Roberts, DDSP: Differentiable digital signal processing, arXiv
preprint arXiv:2001.04643 (2020). doi:10.48550/arXiv.2001.04643.
[30] J. W. Kim, J. Salamon, P. Li, J. P. Bello, CREPE: A convolutional representation for pitch
estimation, arXiv preprint arXiv:1802.06182 (2018). doi:10.48550/arXiv.1802.06182.
[31] S. Lee, W. Ping, B. Ginsburg, B. Catanzaro, S. Yoon, BigVGAN: A universal neural vocoder
with large-scale training, arXiv preprint arXiv:2206.04658 (2023).
doi:10.48550/arXiv.2206.04658.
[32] S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y. Unno, N. Soplin, J. Heymann, M.</p>
      <p>Wiesner, N. Chen, A. Renduchintala, T. Ochiai, ESPnet: End-to-end speech processing toolkit,
arXiv preprint arXiv:1804.00015 (2018). doi:10.48550/arXiv.1804.00015.
[33] M. Morise, F. Yokomori, K. Ozawa, WORLD: A vocoder-based high-quality speech synthesis
system for real-time applications, IEICE Trans. Inf. Syst. 99 (2016) 1877–1884.
doi:10.1587/transinf.2015EDP7457.
[34] R. Yoneyama, Y.-C. Wu, T. Toda, Source-filter hifi-gan: Fast and pitch controllable
high-fidelity neural vocoder, arXiv preprint arXiv:2210.15533 (2023).
doi:10.48550/arXiv.2210.15533.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>H.</given-names>
            <surname>Kenmochi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Ohshita</surname>
          </string-name>
          ,
          <article-title>Vocaloid - Commercial singing synthesizer based on sample concatenation</article-title>
          ,
          <source>in: Proceedings of the Interspeech</source>
          <year>2007</year>
          : 8th Annual Conference of the International Speech Communication Association, International Speech Communication Association, Lausanne, Switzerland,
          <year>2007</year>
          , pp.
          <fpage>4009</fpage>
          -
          <lpage>4010</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>K.</given-names>
            <surname>Oura</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mase</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Yamada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Muto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Nankaku</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Tokuda</surname>
          </string-name>
          ,
          <article-title>Recent development of the HMM-based singing voice synthesis system - Sinsy</article-title>
          , in
          <source>: Proceedings of the 7th ISCA Workshop on Speech Synthesis</source>
          , SSW-7 '2010, International Speech Communication Association, Lausanne, Switzerland,
          <year>2010</year>
          , pp.
          <fpage>211</fpage>
          -
          <lpage>216</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>P.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Luan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Tan</surname>
          </string-name>
          , L. Zhou,
          <article-title>XiaoiceSing: A high-quality and integrated singing voice synthesis system</article-title>
          , arXiv preprint arXiv:
          <year>2006</year>
          .
          <volume>06261</volume>
          (
          <year>2020</year>
          ). doi:
          <volume>10</volume>
          .48550/arXiv.
          <year>2006</year>
          .
          <volume>06261</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Qin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Luan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhao</surname>
          </string-name>
          , T.-Y. Liu,
          <article-title>DeepSinger: Singing voice synthesis with data mined from the web</article-title>
          , arXiv preprint arXiv:
          <year>2007</year>
          .
          <volume>04590</volume>
          (
          <year>2020</year>
          ). doi:
          <volume>10</volume>
          .48550/arXiv.
          <year>2007</year>
          .
          <volume>04590</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , J. Cong,
          <string-name>
            <given-names>H.</given-names>
            <surname>Xue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Zhu</surname>
          </string-name>
          , M. Bi,
          <article-title>ViSinger: Variational inference with adversarial learning for end-to-end singing voice synthesis</article-title>
          ,
          <source>arXiv preprint arXiv:2110.08813</source>
          (
          <year>2022</year>
          ). doi:
          <volume>10</volume>
          .48550/arXiv.2110.08813.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.-S.</given-names>
            <surname>Hwang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.-H.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.-W.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <article-title>HiddenSinger: High-quality singing voice synthesis via neural audio codec and latent diffusion models</article-title>
          ,
          <source>Neural Netw</source>
          .
          <volume>181</volume>
          (
          <year>2025</year>
          ). doi:
          <volume>10</volume>
          .1016/j.neunet.
          <year>2024</year>
          .
          <volume>106762</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <surname>Z. Zhao,</surname>
          </string-name>
          <article-title>DiffSinger: Singing voice synthesis via shallow diffusion mechanism</article-title>
          ,
          <source>arXiv preprint arXiv:2105.02446</source>
          (
          <year>2022</year>
          ). doi:
          <volume>10</volume>
          .48550/arXiv.2105.02446.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , H. Xue,
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , C. Gong,
          <article-title>ViSinger 2: High-fidelity end-to-end singing voice synthesis enhanced by digital signal processing synthesizer</article-title>
          ,
          <source>arXiv preprint arXiv:2211.02903</source>
          (
          <year>2022</year>
          ). doi:
          <volume>10</volume>
          .48550/arXiv.2211.02903.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , W. Guo,
          <string-name>
            <given-names>C.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Jin</surname>
          </string-name>
          ,
          <string-name>
            <surname>Z. Zhao,</surname>
          </string-name>
          <article-title>TCSinger 2: Customizable multilingual zero-shot singing voice synthesis</article-title>
          ,
          <source>arXiv preprint arXiv:2505.14910</source>
          (
          <year>2025</year>
          ). doi:
          <volume>10</volume>
          .48550/arXiv.2505.14910.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>W.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , C. Pan,
          <string-name>
            <given-names>R.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Hong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <surname>Z. Zhao,</surname>
          </string-name>
          <article-title>TechSinger: Technique-controllable multilingual singing voice synthesis via flow matching</article-title>
          ,
          <source>arXiv preprint arXiv:2502.12572</source>
          (
          <year>2025</year>
          ). doi:
          <volume>10</volume>
          .48550/arXiv.2502.12572.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A. H.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Vyas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Tjandra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.-N.</given-names>
            <surname>Hsu</surname>
          </string-name>
          ,
          <article-title>Generative pre-training for speech with flow matching</article-title>
          ,
          <source>arXiv preprint arXiv:2310.16338</source>
          (
          <year>2024</year>
          ). doi:
          <volume>10</volume>
          .48550/arXiv.2310.16338.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Niu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <fpage>F5</fpage>
          -TTS:
          <article-title>A fairytaler that fakes fluent and faithful speech with flow matching</article-title>
          ,
          <source>arXiv preprint arXiv:2410.06885</source>
          (
          <year>2025</year>
          ). doi:
          <volume>10</volume>
          .48550/arXiv.2410.06885.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>S.</given-names>
            <surname>Mehta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Tu</surname>
          </string-name>
          , J. Beskow, É. Székely,
          <string-name>
            <given-names>G. E.</given-names>
            <surname>Henter</surname>
          </string-name>
          ,
          <string-name>
            <surname>Matcha-TTS</surname>
          </string-name>
          :
          <article-title>A fast TTS architecture with conditional flow matching</article-title>
          ,
          <source>arXiv preprint arXiv:2309.03199</source>
          (
          <year>2024</year>
          ). doi:
          <volume>10</volume>
          .48550/arXiv.2309.03199.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>H.</given-names>
            <surname>Lou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Paik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. D.</given-names>
            <surname>Haghighi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Hu</surname>
          </string-name>
          , L. Yao,
          <article-title>LatentSpeech: Latent diffusion for text-to-speech generation</article-title>
          ,
          <source>arXiv preprint arXiv:2412.08117</source>
          (
          <year>2024</year>
          ). doi:
          <volume>10</volume>
          .48550/arXiv.2412.08117.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <article-title>VoiceFlow: Efficient text-to-speech with rectified flow matching</article-title>
          ,
          <source>arXiv preprint arXiv:2309.05027</source>
          (
          <year>2024</year>
          ). doi:
          <volume>10</volume>
          .48550/arXiv.2309.05027.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y. Wang,</surname>
          </string-name>
          <article-title>ProsodyFlow: High-fidelity text-to-speech through conditional flow matching and prosody modeling with large speech language models</article-title>
          ,
          <source>in: Proceedigs of the The 31st International Conference on Computational Linguistics</source>
          , COLING '
          <year>2025</year>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Philadelphia, Pennsylvania, USA,
          <year>2025</year>
          , pp.
          <fpage>7748</fpage>
          -
          <lpage>7753</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>P.</given-names>
            <surname>Agrawal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Koehler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Xiu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Serai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <article-title>Ultra-lightweight neural differential DSP vocoder for high-quality speech synthesis</article-title>
          ,
          <source>arXiv preprint arXiv:2401.10460</source>
          (
          <year>2024</year>
          ). doi:
          <volume>10</volume>
          .48550/arXiv.2401.10460.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.-F.</given-names>
            <surname>Yeh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Seltzer</surname>
          </string-name>
          , Emformer:
          <article-title>Efficient memory transformer-based acoustic model for low latency streaming speech recognition</article-title>
          , arXiv preprint arXiv:
          <year>2010</year>
          .
          <volume>10759</volume>
          (
          <year>2020</year>
          ). doi:
          <volume>10</volume>
          .48550/arXiv.
          <year>2010</year>
          .
          <volume>10759</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>S.</given-names>
            <surname>Nercessian</surname>
          </string-name>
          , Differentiable World synthesizer
          <article-title>-based neural vocoder with application to endto-end audio style transfer</article-title>
          ,
          <source>arXiv preprint arXiv:2208.07282</source>
          (
          <issue>2923</issue>
          ). doi:
          <volume>10</volume>
          .48550/arXiv.2208.07282.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>D.-Y. Wu</surname>
            , W.-Y. Hsiao,
            <given-names>F.-R.</given-names>
          </string-name>
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          <string-name>
            <surname>Friedman</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Jackson</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Bruzenak</surname>
            , Y.-W. Liu,
            <given-names>Y.-H.</given-names>
          </string-name>
          <string-name>
            <surname>Yang</surname>
          </string-name>
          ,
          <article-title>DDSP-based singing vocoders: A new subtractive-based synthesizer and a comprehensive evaluation</article-title>
          ,
          <source>arXiv preprint arXiv:2208.04756</source>
          (
          <year>2022</year>
          ). doi:
          <volume>10</volume>
          .48550/arXiv.2208.04756.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>A.</given-names>
            <surname>Défossez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Copet</surname>
          </string-name>
          , G. Synnaeve,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Adi</surname>
          </string-name>
          ,
          <article-title>High-fidelity neural audio compression</article-title>
          ,
          <source>arXiv preprint arXiv:2210.13438</source>
          (
          <year>2022</year>
          ). doi:
          <volume>10</volume>
          .48550/arXiv.2210.13438.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>J.</given-names>
            <surname>Kong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kim</surname>
          </string-name>
          , J. Bae, HiFi-GAN:
          <article-title>Generative adversarial networks for efficient and high-fidelity speech synthesis</article-title>
          , arXiv preprint arXiv:
          <year>2010</year>
          .
          <volume>05646</volume>
          (
          <year>2020</year>
          ). doi:
          <volume>10</volume>
          .48550/arXiv.
          <year>2010</year>
          .
          <volume>05646</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>X.</given-names>
            <surname>Mao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. Y. K.</given-names>
            <surname>Lau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. P.</given-names>
            <surname>Smolley</surname>
          </string-name>
          ,
          <article-title>Least squares generative adversarial networks</article-title>
          ,
          <source>arXiv preprint arXiv:1611.04076</source>
          (
          <year>2017</year>
          ). doi:
          <volume>10</volume>
          .48550/arXiv.1611.04076.
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lipman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. T. Q.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Ben-Hamu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Nickel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <article-title>Flow matching for generative modeling</article-title>
          ,
          <source>arXiv preprint arXiv:2210.02747</source>
          (
          <year>2023</year>
          ). doi:
          <volume>10</volume>
          .48550/arXiv.2210.02747.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>