FakeSound2_icon

FakeSound2

A Benchmark for Explainable and Generalizable Deepfake Sound Detection

| Paper | Evaluation Code | Data |

Zeyu Xie 1, Xuenan Xu 2, Yongkang Yin 1, Chenxing Li 3, Mengyue Wu 2, Yuexian Zou 1

1 ADSP Lab, Peking University 2 X-LANCE Lab, Shanghai Jiao Tong University 3 Tencent AI Lab

zeyuxie25@stu.pku.edu.cn

Abstract.

The rapid development of generative audio raises ethical and security concerns stemming from forged data, making deepfake sound detection an important safeguard against the malicious use of such technologies. Although prior studies have explored this task, existing methods largely focus on binary classification and fall short in explaining how manipulations occur, tracing where the sources originated, or generalizing to unseen sources—thereby limiting the explainability and reliability of detection. To address these limitations, we present FakeSound2, a benchmark designed to advance deepfake sound detection beyond binary accuracy. FakeSound2 evaluates models across three dimensions: localization, traceability, and generalization, covering 6 manipulation types and 12 diverse sources. Experimental results show that although current systems achieve high classification accuracy, they struggle to recognize forged pattern distributions and provide reliable explanations. By highlighting these gaps, FakeSound2 establishes a comprehensive benchmark that reveals key challenges and aims to foster robust, explainable, and generalizable approaches for trustworthy audio authentication.


Navigation Bar.
Data Construction⇒ 1. Task Description 2. Pipeline 3. Data Statistics   |   Analysis⇒ 4. Visualization 5. Model Limitations

Deepfake Samples.
Clip-wise⇒ 1. Genereration 2. Editing  |  Frame-wise⇒ 3. Inpainting 4. Separation 5. Splice 6. Addition

FakeSound2 Construction


1. Task Description


The explainable deepfake detection need identify the manipulation method (how), localize the temporal positions of forgery (when), and trace the accountable sources (where).



2. Manipulation Pipeline


(I) A Text-to-Audio grounding model temporally localizes sound events and masks segments; (II) Based on the task type, the LLM generates the target captions and insert event descriptions; (III) Depending on the task requirements, relevant models or scripts are invoked to generate synthetic audio segments; frame-wise forgeries are seamlessly spliced according to the masked regions.



3. Data Statistics


Statistics

Constructed through an automated pipeline leveraging the AudioCaps dataset, it encompasses 6 distinct manipulation types derived from 12 audio sources (11 synthetic and 1 genuine). The dataset ultimately comprises a training set of 369,929 samples and a test set of 5,553 samples. Researchers can flexibly partition the test subset to evaluate generalization, provided that out-of-domain (OOD) sources remain completely unseen during training.

Metadata

Analysis


4. Feature Visualization


Overall, the current model exhibits reliable separability between authentic and manipulated audio, enabling effective binary classification. However, certain forged audio categories demonstrate substantial overlap in the feature space, which diminishes explainability for source attribution.



5. Limitations and Challenges of Current Models


Although current models perform well in binary classification, they exhibit the following shortcomings:

Deepfake Samples


1. Clip-Generation

Original Caption A man talking while wood clanks
on a metal pan followed by gravel
crunching as food and oil sizzle.

  Target Caption   A man talking while wood clanks
on a metal pan followed by gravel
crunching as food and oil sizzle.

Inserted event None



Fake Onset-Offset Entire Clip (0-10s)



Original Audio Manipulated Audio
Original Spectra
Manipulated Spectra
Original Caption A man is speaking while typing.



  Target Caption   A man is speaking while typing.



Inserted event None



Fake Onset-Offset Entire Clip (0-10s)



Original Audio Manipulated Audio
Original Spectra
Manipulated Spectra

2. Clip-Editing

Original Caption Someone burps and then laughs.



  Target Caption   Laughs while bleating.



Inserted event Bleating



Fake Onset-Offset Entire Clip (0-10s)



Original Audio Manipulated Audio
Original Spectra
Manipulated Spectra
Original Caption A woman talks and a baby whispers.



  Target Caption   A baby whispers while a man speaks.



Inserted event A man speaks



Fake Onset-Offset Entire Clip (0-10s)



Original Audio Manipulated Audio
Original Spectra
Manipulated Spectra

3. Frame-Inpainting

Original Caption A group of people laughing
followed by farting.


  Target Caption   A group of people laughing
followed by farting.


Inserted event None



Fake Onset-Offset 3.943-7.56s



Original Audio Manipulated Audio
Original Spectra
Manipulated Spectra
Original Caption A man speaking as rain lightly
falls followed by thunder.


  Target Caption   A man speaking as rain lightly
falls followed by thunder.


Inserted event None



Fake Onset-Offset 0.244-6.872s



Original Audio Manipulated Audio
Original Spectra
Manipulated Spectra

4. Frame-Separation

Original Caption A man talking followed by
a series of belches.


  Target Caption   A series of belches.



Inserted event None



Fake Onset-Offset 0.12-0.92s



Original Audio Manipulated Audio
Original Spectra
Manipulated Spectra
Original Caption Birds chirping and bees buzzing.



  Target Caption   Bees buzzing.



Inserted event None



Fake Onset-Offset 0.0-1.68_2.16-10.0s



Original Audio Manipulated Audio
Original Spectra
Manipulated Spectra

5. Frame-Splice

Original Caption A man speaking as rain lightly
falls followed by thunder


  Target Caption   a man speaking as rain lightly
falls followed by thunder
and a series of gunshots fire.

Inserted event A series of gunshots fire



Fake Onset-Offset 4.12-5.92s



Original Audio Manipulated Audio
Original Spectra
Manipulated Spectra
Original Caption A man speaking followed by
another man speaking with
some rustling

  Target Caption   A man speaking followed by
another man speaking with
some rustling as a baby laughing.

Inserted event A baby laughing



Fake Onset-Offset 2.6-5.56s



Original Audio Manipulated Audio
Original Spectra
Manipulated Spectra
Original Caption Birds chirp and a duck quacks
followed by a dog barking.


  Target Caption   Birds chirp and a duck quacks
followed by a dog barking
and a woman talking.

Inserted event A woman talking



Fake Onset-Offset 0.0-1.52s



Original Audio Manipulated Audio
Original Spectra
Manipulated Spectra

6. Frame-Addition

Original Caption A man and woman laughing
followed by a man shouting
then a woman laughing as
a child laughs.
  Target Caption   a man and woman laughing
followed by a man shouting
then a woman laughing as
a child laughs and a baby cries.
Inserted event A baby cries



Fake Onset-Offset 2.85-4.37s



Original Audio Manipulated Audio
Original Spectra
Manipulated Spectra
Original Caption A man speaks with some
clinking and clanking.


  Target Caption   A man speaks with some
clinking and clanking while
music plays in the foreground.

Inserted event Music plays in the foreground



Fake Onset-Offset 5.78-8.38s



Original Audio Manipulated Audio
Original Spectra
Manipulated Spectra




Acknowledgement

This dataset leverages multiple audio generation models, and we appreciate their open-source contributions.