Abstract.
Recent advancements in audio generation have enabled the creation of high-fidelity audio clips from free-form textual descriptions. However, temporal relationships, a critical feature for audio content, are currently underrepresented in mainstream models due to a lack of precise control. Specifically, users cannot accurately control the timestamps of sound events using free-form text. We acknowledge that a significant factor is the absence of high-quality, temporally-aligned audio-text datasets, which are essential for training models with temporal control. The more detailed the annotations, the better the models can understand the precise relationship between audio outputs and temporal textual prompts. Therefore, we present a strongly aligned audio-text dataset, AudioTime. It provides text annotations rich in temporal information such as timestamps, duration, frequency, and ordering, covering almost all aspects of temporal control. We also provide a comprehensive metric STEAM to evaluate the temporal control performance of models. The data and evaluation scripts are available at AudioTime and STEAMtool respectively. Demo Navigation.
The audio contains two sequential events, with each event occurring a random number of times. The alignment signal is the sequence of event occurrences. The metadata records the on- & off-set timestamp of events.
The audio contains several events, each occurring once. The alignment signal is the duration of each event. The metadata records the duration of event occurrences.
Any event can occur any number of times. The alignment signal is the frequency of event occurrences. The metadata records the onset timestamp of events.
Any event can occur any number of times. The alignment signal is the on- & off-set timestamps of event occurrences. The metadata records the on- & off-set timestamp of events.
To better test and compare the temporal controllability of generative models, we propose STEAM (Strongly TEmporally-Aligned evaluation Metric).
STEAM is a text-based metric that evaluates whether the generated audio segments meet the control requirements specified by the input text.
An audio-text grounding model is employed to detect the onset and offset of events in generated audio.
STEAM assesses control performance based on detected timestamps and the control signal provided by the input free text.