SemanticVocoder

Bridging Audio Generation and Audio Understanding via Semantic Latents

| Paper | Code |

Zeyu Xie1,2, Chenxing Li2, Qiao Jin1, Xuenan Xu3, Guanrou Yang2,4, Wenfu Wang2, Mengyue Wu4, Dong Yu2, Yuexian Zou1,*

1ADSP Lab, Peking University 2Tencent 3Shanghai AI lab 4X-LANCE Lab, SJTU

zeyuxie25@stu.pku.edu.cn

Abstract.

Recent audio generation models typically rely on Variational Autoencoders (VAEs) and perform generation within the VAE latent space. Although VAEs excel at compression and reconstruction, their latents inherently encode low-level acoustic details rather than semantically discriminative information, leading to entangled event semantics and complicating the training of generative models. To address these issues, we discard VAE acoustic latents and introduce semantic encoder latents, thereby proposing SemanticVocoder, a generative vocoder that directly synthesizes waveforms from semantic latents. Equipped with SemanticVocoder, our text-to-audio generation model achieves a Fréchet Distance of 12.823 and a Fréchet Audio Distance of 1.709 on the AudioCaps test set, as the introduced semantic latents exhibit superior discriminability compared to acoustic VAE latents. Beyond improved generation performance, it also serves as a promising attempt towards unifying audio understanding and generation within a shared semantic space.


Navigation Bar.
Pipeline Comparison Visualization TTA Comparison

SemanticVocoder Construction


Pipeline

An overview of SemanticVocoder training, downstream TTA training, and downstream task inference. Blue arrow SemanticVocoder training: the input audio is fed into a semantic encoder to extract semantic latents, which serve as conditions to train the flow-matching network for waveform prediction. Red arrow Generative audio DiT training: the input text is processed by a text encoder to obtain textual features, which are used to train the DiT model for generating semantic latents. Black arrow Downstream task inference: equipped with SemanticVocoder, both audio generation and understanding tasks can be performed within the same semantic latent space.



Comparison

The SemanticVocoder pioneers the generation of waveforms directly from semantic latents, thereby bridging understanding-oriented representations and generation tasks. Left Three sub-tasks from the HEAR benchmark are employed to evaluate the latent representations, in which linear classifiers are trained on fixed latents. The semantic latents exhibit a more discriminative semantic structure than the acoustic VAE latents used in previous work. Right For the downstream text-to-audio task, a text-to-latent model predicts latents conditioned on input text. The predicted latents are then fed into SemanticVocoder for audio synthesis, yielding superior performance.



Visualization


Visualization of different latents on HEAR-ESC50, where the 10 most frequent categories are presented. Each audio feature is aggregated by mean pooling along the temporal axis and projected into 2D space via t-SNE. Compared to VAE acoustic latents used in baseline models, semantic latents exhibit a more discriminative structure and superior semantic disentanglement.

Generated Samples



TTA Model Comparison