Image and Video Completion – Computer Vision and Robotics Laboratory

Sound2Sight: Generating Visual Dynamics from Sound and Context

Posted on August 25, 2021August 25, 2021 by 1qs9y

Learning associations across modalities is critical for robust multimodal reasoning, especially when a modality may be missing during inference. In this paper, we study this problem in the context of audio-conditioned visual synthesis – a task that is important, for example, in occlusion reasoning. Specifically, our goal is to generate future video frames and their motion dynamics conditioned on audio and a few past frames.

Visual Scene Graphs for Audio Source Separation

Posted on August 25, 2021August 25, 2021 by 1qs9y

State-of-the-art approaches for visually-guided audio source separation typically assume sources that have characteristic sounds, such as musical instruments. These approaches often ignore the visual context of these sound sources or avoid modeling object interactions that may be useful to characterize the sources better, especially when the same object class may produce varied sounds from distinct interactions.

A Hierarchical Variational Neural Uncertainty Model for Stochastic Video Prediction

Posted on August 25, 2021August 25, 2021 by 1qs9y

Predicting the future frames of a video is a challenging task, in part due to the underlying stochastic real-world phenomena. Prior approaches to solve this task typically estimate a latent prior characterizing this stochasticity, however do not account for the predictive uncertainty of the (deep learning) model. Such approaches often derive the training signal from the mean-squared error (MSE) between the generated frame and the ground truth, which can lead to sub-optimal training, especially when the predictive uncertainty is high.

Block-based motion estimation for missing video frame interpolation, and spatially scalable (multi-resolution) video coding

Posted on October 7, 2001July 28, 2021 by 1qs9y

Video frames are often dropped during compression at very low bit rates. At the decoder, a missing frame interpolation method synthesizes the missed frames. We propose a two step motion estimation method for the interoplation. More specifically, the coarse motion vector field is refined at the decoder using mesh-based motion estimation instead of using computationally intensive dense motion estimation.