Tracking Persons-of-Interest via Unsupervised Representation Adaptation

Multi-face tracking in unconstrained videos is a challenging problem as faces of one person often appear drastically
different in multiple shots due to significant variations in scale, pose, expression, illumination, and make-up. Existing multi-target tracking methods often use low-level features which are not sufficiently discriminative for identifying faces with such large appearance variations.

Read More “Tracking Persons-of-Interest via Unsupervised Representation Adaptation”

Sound2Sight: Generating Visual Dynamics from Sound and Context

Learning associations across modalities is critical for robust multimodal reasoning, especially when a modality may be missing during inference. In this paper, we study this problem in the context of audio-conditioned visual synthesis – a task that is important, for example, in occlusion reasoning. Specifically, our goal is to generate future video frames and their motion dynamics conditioned on audio and a few past frames.

Read More “Sound2Sight: Generating Visual Dynamics from Sound and Context”

Remove to Improve

The workhorses of CNNs are its filters, located at different layers and tuned to different features. Their responses are combined using weights obtained via network training. Training is aimed at optimal results for the entire training data, e.g., highest average classification accuracy. In this paper, we are interested in extending the current understanding of the roles played by the filters, their mutual interactions, and their relationship to classification accuracy.

Read More “Remove to Improve”

Unsupervised 3D Pose Estimation for Hierarchical Dance Video

Dance experts often view dance as a hierarchy of information, spanning low-level (raw images, image sequences), mid-levels (human poses and bodypart movements), and high-level (dance genre). We propose a Hierarchical Dance Video Recognition framework (HDVR). HDVR estimates 2D pose sequences, tracks dancers, and then simultaneously estimates corresponding 3D poses and 3D-to-2D imaging parameters, without requiring ground truth for 3D poses.

Read More “Unsupervised 3D Pose Estimation for Hierarchical Dance Video”

Visual Scene Graphs for Audio Source Separation

State-of-the-art approaches for visually-guided audio source separation typically assume sources that have characteristic sounds, such as musical instruments. These approaches often ignore the visual context of these sound sources or avoid modeling object interactions that may be useful to characterize the sources better, especially when the same object class may produce varied sounds from distinct interactions.
Read More “Visual Scene Graphs for Audio Source Separation”

A Hierarchical Variational Neural Uncertainty Model for Stochastic Video Prediction

Predicting the future frames of a video is a challenging task, in part due to the underlying stochastic real-world phenomena. Prior approaches to solve this task typically estimate a latent prior characterizing this stochasticity, however do not account for the predictive uncertainty of the (deep learning) model. Such approaches often derive the training signal from the mean-squared error (MSE) between the generated frame and the ground truth, which can lead to sub-optimal training, especially when the predictive uncertainty is high.

Read More “A Hierarchical Variational Neural Uncertainty Model for Stochastic Video Prediction”