Abstract.
Text-to-audio generation aims to synthesize corresponding audio clips conditioned on textual inputs. Recent text-to-audio generation systems typically rely on Variational Autoencoders (VAEs), performing text-to-latent and latent-to-waveform synthesis within the VAE latent space. Although VAEs excel at compression and reconstruction, their latent representations inherently encode low-level acoustic details rather than semantically discriminative information, leading to entangled event semantics and complicating the training of text-to-latent prediction. In contrast, the semantic latents extracted by the semantic representation encoder possess well-structured discriminability, making text-to-latent mappings easier to establish. Accordingly, we discard the VAE module and propose SARA, a novel semantic driven generative autoencoder. It leverages semantic latents and adopts a generative architecture rather than conventional reconstruction-based training for alleviating distortion. SARA ultimately enables audio generation to be performed directly in the semantically rich latent space, yielding a novel text-to-audio generation system that is entirely independent of acousitc-based VAEs. As a result, our text-to-audio generation system achieves the lowest Fréchet Distance and Fréchet Audio Distance on both the AudioCaps test set and the cross-domain Clotho test set. Beyond improved generation performance, SARA also serves as a promising attempt towards unifying audio understanding and generation within a shared latent space.
TTA Model Comparison |
|||
|---|---|---|---|