FoleyDesigner: Immersive Stereo Foley Generation with Precise Spatio-Temporal Alignment for Film Clips

Mengtian Li1,2 Kunyan Dai1, Yi Ding1, Ruobing Ni1, Ying Zhang1, Wenwu Wang3, Zhifeng Xie1,2
1Shanghai Film Academy, Shanghai University 2Shanghai Engineering Research Center of Motion Picture Special Effects 3University of Surrey, UK
CVPR 2026

Video Presentation

FoleyDesigner Demo Gallery

FilmStereo Sample Gallery

Abstract

Foley art plays a pivotal role in enhancing immersive auditory experiences in film, yet manual creation of spatio-temporally aligned audio remains labor-intensive. We propose FoleyDesigner, a novel framework inspired by professional Foley workflows, integrating film clip analysis, spatio-temporally controllable Foley generation, and professional audio mixing capabilities. FoleyDesigner employs a multi-agent architecture for precise spatio-temporal analysis. It achieves spatio-temporal alignment through latent diffusion models trained on spatio-temporal cues extracted from video frames, combined with large language model (LLM)-driven hybrid mechanisms that emulate post-production practices in film industry. To address the lack of high-quality stereo audio datasets in film, we introduce FilmStereo, the first professional stereo audio dataset containing spatial metadata, precise timestamps, and semantic annotations for eight common Foley categories. For applications, the framework supports interactive user control while maintaining seamless integration with professional pipelines, including 5.1-channel Dolby Atmos systems compliant with ITU-R BS.775 standards, thereby offering extensive creative flexibility. Extensive experiments demonstrate that our method achieves superior spatio-temporal alignment compared to existing baselines, with seamless compatibility with professional film production standards.

Method Overview

Method Overview

We propose FoleyDesigner, a novel system for controllable film foley generation that mirrors professional post-production workflows. Given a silent film clip and its script, our pipeline operates through three sequential stages: (1) Fine-Grained Film Decomposition employs Tree-of-Thought reasoning to analyze visual content and narrative context, producing hierarchical Foley scripts that decompose densely overlapping sound events into temporally-layered foreground and background elements; (2) Spatio-Temporal Foley Generation synthesizes each event as stereo audio using a DiT-based latent diffusion model conditioned on textual descriptions and spatio-temporal cues extracted from visual tracking—depth and azimuth coordinates are derived from monocular depth estimation and bounding box localization, then injected into the diffusion backbone via cross-attention mechanisms to ensure frame-accurate temporal alignment and precise spatial positioning; (3) Foley Refinement and Professional Mixing applies multi-agent audio processing to diagnose and correct acoustic inconsistencies—specialist agents perform reverberation matching, equalization for spectral clarity, and dynamics balancing to achieve professional acoustic quality, followed by channel-wise upmixing to 5.1 Dolby Surround format following ITU-R BS.775 specifications. This architecture addresses critical limitations of prior work by enabling explicit decomposition of complex soundscapes, grounding audio generation in quantitative visual-spatial cues, and ensuring cinematic-grade acoustic refinement.

FilmStereo Dataset

FilmStereo Dataset Overview

We present FilmStereo, a stereo audio dataset designed for film foley generation that pairs spatially and temporally annotated audio with detailed captions. Covering eight common foley categories, the dataset is created by collecting and preprocessing audio clips to ensure clarity, followed by spatial audio simulation using gpuRIR and reverberation effects. Semantic alignment is verified with CLAP, while spatially aware captions and timestamped sound events are generated through chain-of-thought prompting and amplitude analysis, with manual quality control to ensure dataset reliability.

Experiment

Spatial Analysis. Our method demonstrates proper stereo separation with left channel (L) weakening and right channel (R) strengthening, while SpatialSonic (stereo output) shows limited spatial variation despite having two-channel capability.

Temporal Analysis. Each video frame corresponds to a temporal segment in the spectrogram below. Yellow checkmarks indicate successful audio-visual synchronization. Our method achieves consistent temporal alignment across key events, while baseline methods show varying degrees of synchronization failure regardless of their output channel configuration.

Experimental Setup

Our user study was conducted through both offline and online evaluations to comprehensively assess the perceived quality of generated foley audio.

Offline Evaluation

We recruited 12 participants with normal hearing to conduct perceptual evaluation in a professional audio mixing studio with controlled acoustic conditions. The evaluation environment featured a standard 5.1 surround sound system configured according to ITU-R BS.775-3 specifications, as illustrated below.

5.1 Surround Sound Setup

User study setup. (a) Standard 5.1 surround sound speaker configuration showing Front Left (FL), Center (C), Front Right (FR), Low Frequency Effects (LFE), Left Surround (LS), and Right Surround (RS) positions. (b) Professional mixing studio environment.

The listening room was acoustically optimized with controlled reverberation time and minimal background noise. Participants were positioned at the sweet spot, maintaining equal distance from all speakers. The audio playback system utilized monitors with flat frequency response to ensure accurate sound reproduction.

Participants in Studio

Participants conducting the perceptual evaluation in the professional mixing studio environment.

Online Evaluation

We conducted an online questionnaire-based evaluation with 53 participants, categorized into two groups: 23 film audio professionals (43.4%) and 30 non-professionals (56.6%). Participants evaluated stereo audio samples through a web-based interface. For baseline comparisons, we evaluated our FoleyDesigner against three state-of-the-art methods: See2Sound, Stable-Audio-Open, and SpatialSonic.

Participants rated each method across five dimensions: (1) timbre consistency with film content; (2) spatial alignment relative to visual sources; (3) temporal alignment with visual events; (4) emotional alignment with scene atmosphere; (5) immersion.

Conclusion

We introduce FoleyDesigner, a novel framework for generating precise spatio-temporal alignment stereo audio for silent film clips. Drawing inspiration from the established workflow of professional foley artists, FoleyDesigner integrates multi-stage modules including film clip perception, spatio-temporal cue extraction, stereo audio generation, and multi-agent mixing. To support stereo foley generation, we introduce the FilmStereo dataset with detailed spatial and temporal annotations. Experimental results on multiple benchmarks demonstrate that FoleyDesigner achieves superior performance compared to existing methods. Beyond academic benchmarks, FoleyDesigner can be applied to real-world scenarios, including post-production for films and immersive sound design for virtual reality applications. Its modular design adapts to professional audio workflows, enabling efficient and controllable foley generation.