Learning associations across modalities is critical for robust multimodal reasoning, especially when a modality may be missing during inference. In this paper, we study this problem in the context of audio-conditioned visual synthesis – a task that is important, for example, in occlusion reasoning. Specifically, our goal is to generate future video frames and their motion dynamics conditioned on audio and a few past frames.
Category: Learning
Remove to Improve
The workhorses of CNNs are its filters, located at different layers and tuned to different features. Their responses are combined using weights obtained via network training. Training is aimed at optimal results for the entire training data, e.g., highest average classification accuracy. In this paper, we are interested in extending the current understanding of the roles played by the filters, their mutual interactions, and their relationship to classification accuracy.
Visual Scene Graphs for Audio Source Separation
A Hierarchical Variational Neural Uncertainty Model for Stochastic Video Prediction
Predicting the future frames of a video is a challenging task, in part due to the underlying stochastic real-world phenomena. Prior approaches to solve this task typically estimate a latent prior characterizing this stochasticity, however do not account for the predictive uncertainty of the (deep learning) model. Such approaches often derive the training signal from the mean-squared error (MSE) between the generated frame and the ground truth, which can lead to sub-optimal training, especially when the predictive uncertainty is high.
Scene Classification
We use features of segmentation for semantic classification of real images. We model the image in terms of a probability density function, a Gaussian mixture model (GMM) to be specific, of its region features. This GMM is fit to the image by adapting a universal GMM which is estimated so it fits all images.
Texture Recognition
Given an arbitrary image, our goal is to segment all distinct texture subimages. This is done by discovering distinct, cohesive groups of spatially repeating patterns, called texels, in the image, where each group defines the corresponding texture. Texels occupy image regions, whose photometric, geometric, structural, and spatial-layout properties are samples from an unknown pdf.
Region Based Image Matching
We propose novel approaches to region-based hierarchical image matching, where, given two images, the goal is to identify the largest part in image 1 and its match in image 2 having the maximum similarity measure defined in terms of geometric and photometric properties of regions (e.g., area, boundary shape, and color), as well as region topology (e.g.,
Connected Segmentation Tree For Object Modeling
We propose a new object representation, called connected segmentation tree (CST), which captures canonical characteristics of the object in terms of the photometric, geometric, and spatial adjacency and containment properties of its constituent image regions. CST is obtained by augmenting the object’s segmentation tree (ST) with inter-region neighbor links, in addition to their recursive embedding structure already present in ST.
Object Category Recognition
Low level segmentation based image features are used for the problem of object categorization. In general, object categorization comprises two main research areas: (1) classification or clustering of images containing objects belonging to an object category, and (2) detection, localization, and segmentation of individual object-category instances in images. The first thrust of research is typically concerned with exemplar based methods, where the main focus is to develop an efficient distance measure between two images.
Segmentation Based Object Discovery
Given a set of images, possibly containing objects from an unknown category, determine if a category is present. If a category is present, learn spatial and photometric model of the category. Given an unseen image, segment all occurrences of the category.
- S. Todorovic and N. Ahuja, Extracting Subimages of an Unknown Category from a Set of Images, Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR), New York, NY, Vol.
Face Detection
Faces and Gestures
The aforementioned work on representation and learning has contributed to two types of human computer interfaces we have developed. First, learning and classification techniques, including usual statistical classifiers, neural networks, support vector machines and artificial intelligence approaches, have been used to develop new methods for human face detection and hand gesture recognition.
Face Recognition
Faces and Gestures
The aforementioned work on representation and learning has contributed to two types of human computer interfaces we have developed. First, learning and classification techniques, including usual statistical classifiers, neural networks, support vector machines and artificial intelligence approaches, have been used to develop new methods for human face detection and hand gesture recognition.
Learning to Recognize 3D Objects
3D Object Recognition
Recognition is achieved either by explicitly coding the recognition criteria in terms of low level structure, or through learning from examples. Learning algorithms incorporate subspace projections of higher dimensional data symbolically or using neural approaches.
Learning to Recognize 3D Objects
A learning account for the problem of object recognition is developed within the PAC (Probably Approximately Correct) model of learnability.
Learning for Object Recognition
3D Object Recognition
Recognition is achieved either by explicitly coding the recognition criteria in terms of low level structure, or through learning from examples. Learning algorithms incorporate subspace projections of higher dimensional data symbolically or using neural approaches.
Learning for Object Recognition
A learning algorithm accounting for the problem of object recognition is developed within the PAC (Probably Approximately Correct) model of learnability.
Learning of Low-level Spatiotemporal Structural Patterns
3D Object Recognition
Recognition is achieved either by explicitly coding the recognition criteria in terms of low level structure, or through learning from examples. Learning algorithms incorporate subspace projections of higher dimensional data symbolically or using neural approaches.
Learning of Low-level Spatiotemporal Structural Patterns
Given an image or a video sequence, a prespecified set of low level, spatial and/or temporal descriptors of the image/video structure, and a higher level interpretation of the structure, use computational learning methods to derive a succinct relationship between the interpretation and the low level structural description.
Gesture Recognition
Faces and Gestures
The aforementioned work on representation and learning has contributed to two types of human computer interfaces we have developed. First, learning and classification techniques, including usual statistical classifiers, neural networks, support vector machines and artificial intelligence approaches, have been used to develop new methods for human face detection and hand gesture recognition.