Click on each video to unmute or mute its generated audio.
Text input: Wave crashes against rocky shoreline with loud splashing and foamy sounds.
Text input: Airplane flyby.
Text input: Train wheels squealing.
Text input: The skateboard wheels scraping and grinding on the ground.
Text input: Typing on typewriter.
Text input: Swimmer's heavy breathing and grunting with each stroke.
Text input: Playing badminton.
Text input: People sobbing.
Text input: Gentle water splashing and trickling over mossy rocks.
Text input: Air horn.
Figure 1: (a) The proposed Flowley framework consists of two core modules. (b) First, visual, textual, and audio latent representations are processed together through the multi-stream block. (c) Latent features are then passed into the single-stream block, where they undergo weighted cross-attention with the visual and textual streams to estimate the flow field. At inference time, we integrate this learned flow using standard ODE solvers to generate the compressed mel-spectrogram, which is subsequently decoded and vocoded into the final audio waveform.