SemanticVocoder

An overview of SemanticVocoder training, downstream TTA training, and downstream task inference. Blue arrow SemanticVocoder training: the input audio is fed into a semantic encoder to extract semantic latents, which serve as conditions to train the flow-matching network for waveform prediction. Red arrow Generative audio DiT training: the input text is processed by a text encoder to obtain textual features, which are used to train the DiT model for generating semantic latents. Black arrow Downstream task inference: equipped with SemanticVocoder, both audio generation and understanding tasks can be performed within the same semantic latent space.

The SemanticVocoder pioneers the generation of waveforms directly from semantic latents, thereby bridging understanding-oriented representations and generation tasks. Left Three sub-tasks from the HEAR benchmark are employed to evaluate the latent representations, in which linear classifiers are trained on fixed latents. The semantic latents exhibit a more discriminative semantic structure than the acoustic VAE latents used in previous work. Right For the downstream text-to-audio task, a text-to-latent model predicts latents conditioned on input text. The predicted latents are then fed into SemanticVocoder for audio synthesis, yielding superior performance.

Visualization

Visualization of different latents on HEAR-ESC50, where the 10 most frequent categories are presented. Each audio feature is aggregated by mean pooling along the temporal axis and projected into 2D space via t-SNE. Compared to VAE acoustic latents used in baseline models, semantic latents exhibit a more discriminative structure and superior semantic disentanglement.

SemanticVocoder

Bridging Audio Generation and Audio Understanding via Semantic Latents

SemanticVocoder Construction

Comparison

Visualization

Generated Samples

TTA Model Comparison