| Paper | Evaluation Code | Data |
Abstract.
The rapid development of generative audio raises ethical and security concerns stemming from forged data, making deepfake sound detection an important safeguard against the malicious use of such technologies. Although prior studies have explored this task, existing methods largely focus on binary classification and fall short in explaining how manipulations occur, tracing where the sources originated, or generalizing to unseen sources—thereby limiting the explainability and reliability of detection. To address these limitations, we present FakeSound2, a benchmark designed to advance deepfake sound detection beyond binary accuracy. FakeSound2 evaluates models across three dimensions: localization, traceability, and generalization, covering 6 manipulation types and 12 diverse sources. Experimental results show that although current systems achieve high classification accuracy, they struggle to recognize forged pattern distributions and provide reliable explanations. By highlighting these gaps, FakeSound2 establishes a comprehensive benchmark that reveals key challenges and aims to foster robust, explainable, and generalizable approaches for trustworthy audio authentication.
1. Clip-Generation |
|||
---|---|---|---|
Original Caption | A man talking while wood clanks on a metal pan followed by gravel crunching as food and oil sizzle. |
Target Caption | A man talking while wood clanks on a metal pan followed by gravel crunching as food and oil sizzle. |
Inserted event | None |
Fake Onset-Offset | Entire Clip (0-10s) |
Original Audio | Manipulated Audio | ||
Original Spectra | ![]() |
Manipulated Spectra | ![]() |
Original Caption | A man is speaking while typing. |
Target Caption | A man is speaking while typing. |
Inserted event | None |
Fake Onset-Offset | Entire Clip (0-10s) |
Original Audio | Manipulated Audio | ||
Original Spectra | ![]() |
Manipulated Spectra | ![]() |
2. Clip-Editing |
|||
Original Caption | Someone burps and then laughs. |
Target Caption | Laughs while bleating. |
Inserted event | Bleating |
Fake Onset-Offset | Entire Clip (0-10s) |
Original Audio | Manipulated Audio | ||
Original Spectra | ![]() |
Manipulated Spectra | ![]() |
Original Caption | A woman talks and a baby whispers. |
Target Caption | A baby whispers while a man speaks. |
Inserted event | A man speaks |
Fake Onset-Offset | Entire Clip (0-10s) |
Original Audio | Manipulated Audio | ||
Original Spectra | ![]() |
Manipulated Spectra | ![]() |
3. Frame-Inpainting |
|||
Original Caption | A group of people laughing followed by farting. |
Target Caption | A group of people laughing followed by farting. |
Inserted event | None |
Fake Onset-Offset | 3.943-7.56s |
Original Audio | Manipulated Audio | ||
Original Spectra | ![]() |
Manipulated Spectra | ![]() |
Original Caption | A man speaking as rain lightly falls followed by thunder. |
Target Caption | A man speaking as rain lightly falls followed by thunder. |
Inserted event | None |
Fake Onset-Offset | 0.244-6.872s |
Original Audio | Manipulated Audio | ||
Original Spectra | ![]() |
Manipulated Spectra | ![]() |
4. Frame-Separation |
|||
Original Caption | A man talking followed by a series of belches. |
Target Caption | A series of belches. |
Inserted event | None |
Fake Onset-Offset | 0.12-0.92s |
Original Audio | Manipulated Audio | ||
Original Spectra | ![]() |
Manipulated Spectra | ![]() |
Original Caption | Birds chirping and bees buzzing. |
Target Caption | Bees buzzing. |
Inserted event | None |
Fake Onset-Offset | 0.0-1.68_2.16-10.0s |
Original Audio | Manipulated Audio | ||
Original Spectra | ![]() |
Manipulated Spectra | ![]() |
5. Frame-Splice |
|||
Original Caption | A man speaking as rain lightly falls followed by thunder |
Target Caption | a man speaking as rain lightly falls followed by thunder and a series of gunshots fire. |
Inserted event | A series of gunshots fire |
Fake Onset-Offset | 4.12-5.92s |
Original Audio | Manipulated Audio | ||
Original Spectra | ![]() |
Manipulated Spectra | ![]() |
Original Caption | A man speaking followed by another man speaking with some rustling |
Target Caption | A man speaking followed by another man speaking with some rustling as a baby laughing. |
Inserted event | A baby laughing |
Fake Onset-Offset | 2.6-5.56s |
Original Audio | Manipulated Audio | ||
Original Spectra | ![]() |
Manipulated Spectra | ![]() |
Original Caption | Birds chirp and a duck quacks followed by a dog barking. |
Target Caption | Birds chirp and a duck quacks followed by a dog barking and a woman talking. |
Inserted event | A woman talking |
Fake Onset-Offset | 0.0-1.52s |
Original Audio | Manipulated Audio | ||
Original Spectra | ![]() |
Manipulated Spectra | ![]() |
6. Frame-Addition |
|||
Original Caption | A man and woman laughing followed by a man shouting then a woman laughing as a child laughs. |
Target Caption | a man and woman laughing followed by a man shouting then a woman laughing as a child laughs and a baby cries. |
Inserted event | A baby cries |
Fake Onset-Offset | 2.85-4.37s |
Original Audio | Manipulated Audio | ||
Original Spectra | ![]() |
Manipulated Spectra | ![]() |
Original Caption | A man speaks with some clinking and clanking. |
Target Caption | A man speaks with some clinking and clanking while music plays in the foreground. |
Inserted event | Music plays in the foreground |
Fake Onset-Offset | 5.78-8.38s |
Original Audio | Manipulated Audio | ||
Original Spectra | ![]() |
Manipulated Spectra | ![]() |