SARA

Semantic Representation Generative Autoencoder for Text-to-Audio Generation without Variational Autoencoder

| Paper | Code |

Zeyu Xie1,2, Chenxing Li2, Qiao Jin1, Xuenan Xu3, Guanrou Yang2,4, Wenfu Wang2, Mengyue Wu4, Dong Yu2, Yuexian Zou1,*

1ADSP Lab, Peking University 2Tencent 3Shanghai AI lab 4X-LANCE Lab, SJTU

zeyuxie25@stu.pku.edu.cn

Abstract.

Text-to-audio generation aims to synthesize corresponding audio clips conditioned on textual inputs. Recent text-to-audio generation systems typically rely on Variational Autoencoders (VAEs), performing text-to-latent and latent-to-waveform synthesis within the VAE latent space. Although VAEs excel at compression and reconstruction, their latent representations inherently encode low-level acoustic details rather than semantically discriminative information, leading to entangled event semantics and complicating the training of text-to-latent prediction. In contrast, the semantic latents extracted by the semantic representation encoder possess well-structured discriminability, making text-to-latent mappings easier to establish. Accordingly, we discard the VAE module and propose SARA, a novel semantic driven generative autoencoder. It leverages semantic latents and adopts a generative architecture rather than conventional reconstruction-based training for alleviating distortion. SARA ultimately enables audio generation to be performed directly in the semantically rich latent space, yielding a novel text-to-audio generation system that is entirely independent of acousitc-based VAEs. As a result, our text-to-audio generation system achieves the lowest Fréchet Distance and Fréchet Audio Distance on both the AudioCaps test set and the cross-domain Clotho test set. Beyond improved generation performance, SARA also serves as a promising attempt towards unifying audio understanding and generation within a shared latent space.


Navigation Bar.
Pipeline Comparison Visualization TTA Comparison

SARA Construction


Pipeline

An overview of the SARA autoencoder training, text-to-latent DiT training, and inference pipeline. Blue arrow SARA training: the input audio is fed into SARA semantic encoder to extract semantic latents, which serve as conditions to train SARA flow-matching-based deocder for latent-to-waveform prediction. Red arrow Generative audio DiT training: the input text is processed by a text encoder to obtain textual features, which are used to train the DiT model for text-to-semantic-latents generating. Black arrow Downstream task inference: equipped with SARA, both audio generation and understanding tasks can be performed within the same semantic latent space.



Comparison

The SARA pioneers the generation of waveforms directly from semantic latents, thereby bridging understanding-oriented representations and generation tasks. Left Three sub-tasks from the HEAR benchmark are employed to evaluate the latent representations, in which linear classifiers are trained on fixed latents. The semantic latents exhibit a more discriminative semantic structure than the acoustic VAE latents used in previous work. Right For the downstream text-to-audio task, a text-to-latent model predicts latents conditioned on input text. The predicted latents are then fed into SARA for audio synthesis, yielding superior performance.



Visualization


Visualization of different latents on HEAR-ESC50, where the 10 most frequent categories are presented. Each audio feature is aggregated by mean pooling along the temporal axis and projected into 2D space via t-SNE. Compared to VAE acoustic latents used in baseline models, semantic latents exhibit a more discriminative structure and superior semantic disentanglement.

Generated Samples



TTA Model Comparison