Image

Welcome to PicoAudio: enabling Precise tIming and frequency COntrollability of audio events

Here showcases the timing control performance of PicoAudio on four test sets.

Enjoy your browsing!

HuggingFace online inferencePaperCode

(a). Onset-controllable single-event. Specify one sound event in the generated audio, with timing controlled by on- & off-set.

Text input:  dog barking at 0.562-2.562, 4.25-6.25

Simulated audio (Ground truth)


Your Image

PicoAudio (Ours)


Your Image

Amphion


Your Image

AudioLDM2


Your Image

Text input:  cow mooing at 0.958-3.582, 5.272-7.896

Simulated audio (Ground truth)


Your Image

PicoAudio (Ours)


Your Image

Amphion


Your Image

AudioLDM2


Your Image

Text input:  door knocking at 3.219-5.346, 7.058-9.185

Simulated audio (Ground truth)


Your Image

PicoAudio (Ours)


Your Image

Amphion


Your Image

AudioLDM2


Your Image

Text input:  door slamming at 2.809-4.22, 6.263-8.502

Simulated audio (Ground truth)


Your Image

PicoAudio (Ours)


Your Image

Amphion


Your Image

AudioLDM2


Your Image

Text input:  duck quacking at 0.089-2.089, 4.166-5.984

Simulated audio (Ground truth)


Your Image

PicoAudio (Ours)


Your Image

Amphion


Your Image

AudioLDM2


Your Image

Text input:  explosion at 2.521-5.273

Simulated audio (Ground truth)


Your Image

PicoAudio (Ours)


Your Image

Amphion


Your Image

AudioLDM2


Your Image

Text input:  gunshot at 1.931-3.931, 4.716-6.716, 7.891-9.891

Simulated audio (Ground truth)


Your Image

PicoAudio (Ours)


Your Image

Amphion


Your Image

AudioLDM2


Your Image

Text input:  sheep goat bleating at 0.26-2.26, 3.592-5.592, 7.325-9.325

Simulated audio (Ground truth)


Your Image

PicoAudio (Ours)


Your Image

Amphion


Your Image

AudioLDM2


Your Image

Text input:  car horn honking at 1.566-4.25, 6.473-9.434

Simulated audio (Ground truth)


Your Image

PicoAudio (Ours)


Your Image

Amphion


Your Image

AudioLDM2


Your Image

Text input:  burping belching at 0.323-3.28, 4.07-6.229, 7.049-9.208

Simulated audio (Ground truth)


Your Image

PicoAudio (Ours)


Your Image

Amphion


Your Image

AudioLDM2


Your Image

(b). Frequency-controllable single-event. Specify one sound event in the generated audio, with timing controlled by occurrence frequency.

For practicality and generality, we also define 'a burst of consecutive short events' as 'one occurrence', such as 'a burst of dog barking' and 'a burst of knocking sounds'.

Text input:  door slamming three times

Simulated audio (Ground truth)


Your Image

PicoAudio (Ours)


Your Image

Amphion


Your Image

AudioLDM2


Your Image

Text input:  sheep goat bleating two times

Simulated audio (Ground truth)


Your Image

PicoAudio (Ours)


Your Image

Amphion


Your Image

AudioLDM2


Your Image

Text input:  thump thud two times

Simulated audio (Ground truth)


Your Image

PicoAudio (Ours)


Your Image

Amphion


Your Image

AudioLDM2


Your Image

Text input:  burping belching two times

Simulated audio (Ground truth)


Your Image

PicoAudio (Ours)


Your Image

Amphion


Your Image

AudioLDM2


Your Image

Text input:  car horn honking three times

Simulated audio (Ground truth)


Your Image

PicoAudio (Ours)


Your Image

Amphion


Your Image

AudioLDM2


Your Image

(c). Onset-controllable. Specify multiple sound events in the generated audio, with timing controlled by on- & off-set.

Text input:  cow mooing at 1.573-4.373 and gunshot at 7.482-9.482

Simulated audio (Ground truth)


Your Image

PicoAudio (Ours)


Your Image

Amphion


Your Image

AudioLDM2


Your Image

Text input:  duck quacking at 3.246-5.246 and cat meowing at 7.245-8.635

Simulated audio (Ground truth)


Your Image

PicoAudio (Ours)


Your Image

Amphion


Your Image

AudioLDM2


Your Image

Text input:  gunshot at 0.007-2.007 and spraying at 4.251-5.047

Simulated audio (Ground truth)


Your Image

PicoAudio (Ours)


Your Image

Amphion


Your Image

AudioLDM2


Your Image

Text input:  thump thud at 0.593-2.893 and cow mooing at 4.617-7.241

Simulated audio (Ground truth)


Your Image

PicoAudio (Ours)


Your Image

Amphion


Your Image

AudioLDM2


Your Image

Text input:  whistling at 0.12-4.714 and tapping clicking clanking at 0.731-4.171

Simulated audio (Ground truth)


Your Image

PicoAudio (Ours)


Your Image

Amphion


Your Image

AudioLDM2


Your Image

(d). Frequency-controllable. Specify multiple sound events in the generated audio, with timing controlled by occurrence frequency.

For practicality and generality, we also define 'a burst of consecutive short events' as 'one occurrence', such as 'a burst of dog barking' and 'a burst of knocking sounds'.

Text input:  cat meowing two times and whistling one times and explosion two times

Simulated audio (Ground truth)


Your Image

PicoAudio (Ours)


Your Image

Amphion


Your Image

AudioLDM2


Your Image

Text input:  door slamming three times and tapping clicking clanking one times

Simulated audio (Ground truth)


Your Image

PicoAudio (Ours)


Your Image

Amphion


Your Image

AudioLDM2


Your Image

Text input:  explosion two times and duck quacking two times

Simulated audio (Ground truth)


Your Image

PicoAudio (Ours)


Your Image

Amphion


Your Image

AudioLDM2


Your Image

Text input:  gunshot one times and duck quacking one times and tapping clicking clanking one times

Simulated audio (Ground truth)


Your Image

PicoAudio (Ours)


Your Image

Amphion


Your Image

AudioLDM2


Your Image

Text input:  sheep goat bleating one times and door knocking two times

Simulated audio (Ground truth)


Your Image

PicoAudio (Ours)


Your Image

Amphion


Your Image

AudioLDM2


Your Image