AudioTime

A Temporally-aligned Audio-text Benchmark Dataset

Zeyu Xie^1,2, Xuenan Xu¹, Zhizheng Wu^2,3, Mengyue Wu¹,

¹X-LANCE Lab, Shanghai Jiao Tong University

²Shanghai AI Lab, ³Chinese University of Hong Kong, Shenzhen

Abstract.

Recent advancements in audio generation have enabled the creation of high-fidelity audio clips from free-form textual descriptions. However, temporal relationships, a critical feature for audio content, are currently underrepresented in mainstream models due to a lack of precise control. Specifically, users cannot accurately control the timestamps of sound events using free-form text. We acknowledge that a significant factor is the absence of high-quality, temporally-aligned audio-text datasets, which are essential for training models with temporal control. The more detailed the annotations, the better the models can understand the precise relationship between audio outputs and temporal textual prompts. Therefore, we present a strongly aligned audio-text dataset, AudioTime. It provides text annotations rich in temporal information such as timestamps, duration, frequency, and ordering, covering almost all aspects of temporal control. We also provide a comprehensive metric STEAM to evaluate the temporal control performance of models. The data and evaluation scripts are available at AudioTime and STEAMtool respectively.

Demo Navigation.

(a) Ordering (b) Duration (c) Frequency (d) Timestamp (evaluation) STEAM

AudioTime Dataset Construction Pipeline

Dataset construction pipeline: (1) data curation and filtration; (2) simulating audio and organizing metadata; (3) generating captions using LLM agents.

(a) Ordering

The audio contains two sequential events, with each event occurring a random number of times. The alignment signal is the sequence of event occurrences. The metadata records the on- & off-set timestamp of events.

Caption	Waveform	Metadata	Audio
A buzzer sounds initially, followed by a busy signal repeating three times.		"Buzzer": [[0.871,1.637]],"Busy signal": [[4.32,5.297],[5.654,6.631],[7.951,8.928]]
A mosquito buzzes twice, followed by the sound of a drill.		"Mosquito": [[1.358,2.799], [4.164,5.573]], "Drill": [[7.159,10.0]]
An air horn blasts three times, followed by two knocks.		"Air horn, truck horn": [[0.925,1.468],[2.053,2.596],[2.963,3.506]],"Knock": [[5.795,5.998], [6.773,6.976]]

(b) Duration

The audio contains several events, each occurring once. The alignment signal is the duration of each event. The metadata records the duration of event occurrences.

Caption	Waveform	Metadata	Audio
A neigh and whinny occurred for 1.811 seconds, followed by a train whistle lasting 3.677 seconds.		"Neigh, whinny": [1.811],"Train whistle": [3.677]
A rumble occurred for 7.606 seconds.		"Rumble": [7.606]
A squeak lasted for 0.302 seconds, followed by a quack for 0.314 seconds.		"Squeak": [0.302],"Quack": [0.314]

(c) Frequency

Any event can occur any number of times. The alignment signal is the frequency of event occurrences. The metadata records the onset timestamp of events.

Caption	Waveform	Metadata	Audio
Sanding occurs once, followed by throat clearing twice.		"Sanding": [1.735],"Throat clearing": [6.347,8.794]
A whip cracks twice, followed by a whoosh sound once, and ends with a slam occurring once.		"Whip": [0.973,2.783],"Whoosh, swoosh, swish": [5.758],"Slam":[9.068]
The spray occurred twice, followed by one instance of barking.		"Spray": [1.393,3.159],"Bark": [6.451]

(d) Timestamp

Any event can occur any number of times. The alignment signal is the on- & off-set timestamps of event occurrences. The metadata records the on- & off-set timestamp of events.

Caption	Waveform	Metadata	Audio
An electric shaver buzzes from 1.056 to 5.158 seconds, followed by a jackhammer pounding from 7.66 to 10 seconds.		"Electric shaver, electric razor": [[1.056,5.158]],"Jackhammer": [[7.66,10.0]]
A cap gun fires from 1.045 to 1.438 seconds and again from 2.46 to 2.775 seconds, followed by a door sound from 4.91 to 5.288 seconds.		"Cap gun": [[1.045,1.438],[2.46,2.775]],"Door": [[4.91,5.288]]
A vehicle horn honks from 1.004 to 1.996 seconds, followed by baby laughter twice, from 4.261 to 5.26 and 5.799 to 7.796 seconds.		"Vehicle horn, car horn, honking, toot": [[1.004,1.996]],"Baby laughter": [[4.261,5.26],[5.799,7.796]]

STEAM

To better test and compare the temporal controllability of generative models, we propose STEAM (Strongly TEmporally-Aligned evaluation Metric). STEAM is a text-based metric that evaluates whether the generated audio segments meet the control requirements specified by the input text. An audio-text grounding model is employed to detect the onset and offset of events in generated audio. STEAM assesses control performance based on detected timestamps and the control signal provided by the input free text.

(a) Ordering: To determine whether the audio generates events A and B in the specified order, quantified by the error rate. Specifically, the onset of A should occur before B. If there is overlap, it should not exceed half the duration of the shorter event (otherwise, they are considered simultaneous).

(b) Duration / (c) frequency: Calculate the absolute error between the event duration/frequency in the generated audio and the value specified in the text, averaged over the total number of events.

(d) Timestamp: To measure the accuracy of controlling audio timestamps. Segment F1, a common metric in sound event detection tasks, is calculated using the detected and specified on- & off-set timestamp.

The testing script is available at STEAMtool. We test some currently influential TTA generation models.

Acknowledgement

The demo website template was obtained from Make-An-Audio-2, and we appreciate their open-source contribution!