AudioTime

A Temporally-aligned Audio-text Benchmark Dataset

|Paper|Code|

Zeyu Xie1,2, Xuenan Xu1, Zhizheng Wu2,3, Mengyue Wu1,

1X-LANCE Lab, Shanghai Jiao Tong University
2Shanghai AI Lab, 3Chinese University of Hong Kong, Shenzhen

Abstract.

Recent advancements in audio generation have enabled the creation of high-fidelity audio clips from free-form textual descriptions. However, temporal relationships, a critical feature for audio content, are currently underrepresented in mainstream models due to a lack of precise control. Specifically, users cannot accurately control the timestamps of sound events using free-form text. We acknowledge that a significant factor is the absence of high-quality, temporally-aligned audio-text datasets, which are essential for training models with temporal control. The more detailed the annotations, the better the models can understand the precise relationship between audio outputs and temporal textual prompts. Therefore, we present a strongly aligned audio-text dataset, AudioTime. It provides text annotations rich in temporal information such as timestamps, duration, frequency, and ordering, covering almost all aspects of temporal control. We also provide a comprehensive metric STEAM to evaluate the temporal control performance of models. The data and evaluation scripts are available at AudioTime and STEAMtool respectively.

Demo Navigation.
(a) Ordering (b) Duration (c) Frequency (d) Timestamp (evaluation) STEAM

AudioTime Dataset Construction Pipeline


Dataset construction pipeline: (1) data curation and filtration; (2) simulating audio and organizing metadata; (3) generating captions using LLM agents.

(a) Ordering

The audio contains two sequential events, with each event occurring a random number of times. The alignment signal is the sequence of event occurrences. The metadata records the on- & off-set timestamp of events.

    Caption        Waveform       Metadata    Audio
A buzzer sounds initially, followed by a busy signal repeating three times. Your Image "Buzzer": [[0.871,1.637]],"Busy signal": [[4.32,5.297],[5.654,6.631],[7.951,8.928]]
A mosquito buzzes twice, followed by the sound of a drill. Your Image "Mosquito": [[1.358,2.799], [4.164,5.573]], "Drill": [[7.159,10.0]]
An air horn blasts three times, followed by two knocks. Your Image "Air horn, truck horn": [[0.925,1.468],[2.053,2.596],[2.963,3.506]],"Knock": [[5.795,5.998], [6.773,6.976]]




(b) Duration

The audio contains several events, each occurring once. The alignment signal is the duration of each event. The metadata records the duration of event occurrences.

    Caption        Waveform       Metadata    Audio
A neigh and whinny occurred for 1.811 seconds, followed by a train whistle lasting 3.677 seconds. Your Image "Neigh, whinny": [1.811],"Train whistle": [3.677]
A rumble occurred for 7.606 seconds. Your Image "Rumble": [7.606]
A squeak lasted for 0.302 seconds, followed by a quack for 0.314 seconds. Your Image "Squeak": [0.302],"Quack": [0.314]




(c) Frequency

Any event can occur any number of times. The alignment signal is the frequency of event occurrences. The metadata records the onset timestamp of events.

    Caption        Waveform       Metadata    Audio
Sanding occurs once, followed by throat clearing twice. Your Image "Sanding": [1.735],"Throat clearing": [6.347,8.794]
A whip cracks twice, followed by a whoosh sound once, and ends with a slam occurring once. Your Image "Whip": [0.973,2.783],"Whoosh, swoosh, swish": [5.758],"Slam":[9.068]
The spray occurred twice, followed by one instance of barking. Your Image "Spray": [1.393,3.159],"Bark": [6.451]




(d) Timestamp

Any event can occur any number of times. The alignment signal is the on- & off-set timestamps of event occurrences. The metadata records the on- & off-set timestamp of events.

    Caption        Waveform       Metadata    Audio
An electric shaver buzzes from 1.056 to 5.158 seconds, followed by a jackhammer pounding from 7.66 to 10 seconds. Your Image "Electric shaver, electric razor": [[1.056,5.158]],"Jackhammer": [[7.66,10.0]]
A cap gun fires from 1.045 to 1.438 seconds and again from 2.46 to 2.775 seconds, followed by a door sound from 4.91 to 5.288 seconds. Your Image "Cap gun": [[1.045,1.438],[2.46,2.775]],"Door": [[4.91,5.288]]
A vehicle horn honks from 1.004 to 1.996 seconds, followed by baby laughter twice, from 4.261 to 5.26 and 5.799 to 7.796 seconds. Your Image "Vehicle horn, car horn, honking, toot": [[1.004,1.996]],"Baby laughter": [[4.261,5.26],[5.799,7.796]]




STEAM

To better test and compare the temporal controllability of generative models, we propose STEAM (Strongly TEmporally-Aligned evaluation Metric). STEAM is a text-based metric that evaluates whether the generated audio segments meet the control requirements specified by the input text. An audio-text grounding model is employed to detect the onset and offset of events in generated audio. STEAM assesses control performance based on detected timestamps and the control signal provided by the input free text.

The testing script is available at STEAMtool. We test some currently influential TTA generation models.

Acknowledgement

The demo website template was obtained from Make-An-Audio-2, and we appreciate their open-source contribution!