FakeSound2

A Benchmark for Explainable and Generalizable Deepfake Sound Detection

Zeyu Xie ¹, Xuenan Xu ², Yongkang Yin ¹, Chenxing Li ³, Mengyue Wu ², Yuexian Zou ¹

¹ ADSP Lab, Peking University ² X-LANCE Lab, Shanghai Jiao Tong University ³ Tencent AI Lab

zeyuxie25@stu.pku.edu.cn

Abstract.

The rapid development of generative audio raises ethical and security concerns stemming from forged data, making deepfake sound detection an important safeguard against the malicious use of such technologies. Although prior studies have explored this task, existing methods largely focus on binary classification and fall short in explaining how manipulations occur, tracing where the sources originated, or generalizing to unseen sources—thereby limiting the explainability and reliability of detection. To address these limitations, we present FakeSound2, a benchmark designed to advance deepfake sound detection beyond binary accuracy. FakeSound2 evaluates models across three dimensions: localization, traceability, and generalization, covering 6 manipulation types and 12 diverse sources. Experimental results show that although current systems achieve high classification accuracy, they struggle to recognize forged pattern distributions and provide reliable explanations. By highlighting these gaps, FakeSound2 establishes a comprehensive benchmark that reveals key challenges and aims to foster robust, explainable, and generalizable approaches for trustworthy audio authentication.

Data Construction⇒ 1. Task Description 2. Pipeline 3. Data Statistics | Analysis⇒ 4. Visualization 5. Model Limitations

Deepfake Samples.

Clip-wise⇒ 1. Genereration 2. Editing | Frame-wise⇒ 3. Inpainting 4. Separation 5. Splice 6. Addition

FakeSound2 Construction

1. Task Description

The explainable deepfake detection need identify the manipulation method (how), localize the temporal positions of forgery (when), and trace the accountable sources (where).

2. Manipulation Pipeline

(I) A Text-to-Audio grounding model temporally localizes sound events and masks segments; (II) Based on the task type, the LLM generates the target captions and insert event descriptions; (III) Depending on the task requirements, relevant models or scripts are invoked to generate synthetic audio segments; frame-wise forgeries are seamlessly spliced according to the masked regions.

3. Data Statistics

Statistics

Constructed through an automated pipeline leveraging the AudioCaps dataset, it encompasses 6 distinct manipulation types derived from 12 audio sources (11 synthetic and 1 genuine). The dataset ultimately comprises a training set of 369,929 samples and a test set of 5,553 samples. Researchers can flexibly partition the test subset to evaluate generalization, provided that out-of-domain (OOD) sources remain completely unseen during training.

Metadata

The annotation is organized as:

                    // Dictionary Structure

                    {

                        "fake_type": "Inpainting", // Type of manipulation (e.g., inpainting, editing, ...)

                        "model": "AudioLDM2", // Source of the manipulated audio (e.g., AudioLDM2, Tango2, ...)

                        "audio_id": "xxxxxxxxx", // file ID

                        "onset_offset": "2-3_5.5-6", // Temporal timestamps of the forged segments

                        "filepath": "data/test/audio/inpainting_ldm2/xxxxxxxxx.wav", // Filepath to the audio file
  
                        "target_caption": "a man talks, a dog barks", // Target caption generated by LLMs or rule-based scripts

                        "add_event / splice_event": "a dog barks", // Selected event to be inserted (Optional, in addition, splice, editing)

                        "delete_event": "a sprayer sprays", // Deleted event in original caption (Optional, in separation, addition, splice, editing)

                  }

Analysis

4. Feature Visualization

Overall, the current model exhibits reliable separability between authentic and manipulated audio, enabling effective binary classification. However, certain forged audio categories demonstrate substantial overlap in the feature space, which diminishes explainability for source attribution.

5. Limitations and Challenges of Current Models

Although current models perform well in binary classification, they exhibit the following shortcomings:

Limitation⇒ Difficulty in distinguishing between deepfake audio generated by similar manipulation frameworks.
Challenge⇒ enable models to capture fine-grained distributional discrepancies.

Limitation⇒ Limited generalization capability.
Challenge⇒ enable models to characterize the true data distribution of authentic audio rather than memorizing artificial forgery patterns.

Deepfake Samples

1. Clip-Generation
Original Caption	A man talking while wood clanks on a metal pan followed by gravel crunching as food and oil sizzle.	Target Caption	A man talking while wood clanks on a metal pan followed by gravel crunching as food and oil sizzle.
Inserted event	None	Fake Onset-Offset	Entire Clip (0-10s)
Original Audio		Manipulated Audio
Original Spectra		Manipulated Spectra
Original Caption	A man is speaking while typing.	Target Caption	A man is speaking while typing.
Inserted event	None	Fake Onset-Offset	Entire Clip (0-10s)
Original Audio		Manipulated Audio
Original Spectra		Manipulated Spectra

2. Clip-Editing
Original Caption	Someone burps and then laughs.	Target Caption	Laughs while bleating.
Inserted event	Bleating	Fake Onset-Offset	Entire Clip (0-10s)
Original Audio		Manipulated Audio
Original Spectra		Manipulated Spectra
Original Caption	A woman talks and a baby whispers.	Target Caption	A baby whispers while a man speaks.
Inserted event	A man speaks	Fake Onset-Offset	Entire Clip (0-10s)
Original Audio		Manipulated Audio
Original Spectra		Manipulated Spectra

3. Frame-Inpainting
Original Caption	A group of people laughing followed by farting.	Target Caption	A group of people laughing followed by farting.
Inserted event	None	Fake Onset-Offset	3.943-7.56s
Original Audio		Manipulated Audio
Original Spectra		Manipulated Spectra
Original Caption	A man speaking as rain lightly falls followed by thunder.	Target Caption	A man speaking as rain lightly falls followed by thunder.
Inserted event	None	Fake Onset-Offset	0.244-6.872s
Original Audio		Manipulated Audio
Original Spectra		Manipulated Spectra

4. Frame-Separation
Original Caption	A man talking followed by a series of belches.	Target Caption	A series of belches.
Inserted event	None	Fake Onset-Offset	0.12-0.92s
Original Audio		Manipulated Audio
Original Spectra		Manipulated Spectra
Original Caption	Birds chirping and bees buzzing.	Target Caption	Bees buzzing.
Inserted event	None	Fake Onset-Offset	0.0-1.68_2.16-10.0s
Original Audio		Manipulated Audio
Original Spectra		Manipulated Spectra

5. Frame-Splice
Original Caption	A man speaking as rain lightly falls followed by thunder	Target Caption	a man speaking as rain lightly falls followed by thunder and a series of gunshots fire.
Inserted event	A series of gunshots fire	Fake Onset-Offset	4.12-5.92s
Original Audio		Manipulated Audio
Original Spectra		Manipulated Spectra
Original Caption	A man speaking followed by another man speaking with some rustling	Target Caption	A man speaking followed by another man speaking with some rustling as a baby laughing.
Inserted event	A baby laughing	Fake Onset-Offset	2.6-5.56s
Original Audio		Manipulated Audio
Original Spectra		Manipulated Spectra
Original Caption	Birds chirp and a duck quacks followed by a dog barking.	Target Caption	Birds chirp and a duck quacks followed by a dog barking and a woman talking.
Inserted event	A woman talking	Fake Onset-Offset	0.0-1.52s
Original Audio		Manipulated Audio
Original Spectra		Manipulated Spectra

6. Frame-Addition
Original Caption	A man and woman laughing followed by a man shouting then a woman laughing as a child laughs.	Target Caption	a man and woman laughing followed by a man shouting then a woman laughing as a child laughs and a baby cries.
Inserted event	A baby cries	Fake Onset-Offset	2.85-4.37s
Original Audio		Manipulated Audio
Original Spectra		Manipulated Spectra
Original Caption	A man speaks with some clinking and clanking.	Target Caption	A man speaks with some clinking and clanking while music plays in the foreground.
Inserted event	Music plays in the foreground	Fake Onset-Offset	5.78-8.38s
Original Audio		Manipulated Audio
Original Spectra		Manipulated Spectra