This research theme is concerned with the problem of low level image segmentation, or partitioning an image into regions, that represent low level image structure. A region is characterized as possessing a certain degree of interior homogeneity and a contrast with the surround which is large compared to the interior variation. This is a satisfactory characterization from both perceptual and quantitative viewpoints. Homogeneity and contrast may be defined differently: A region may be uniform, in which case its contrast with the surround must be large; alternatively, a region may be shaded, in which case the local contrast across a boundary point must be large compared to the interior variation on each side. The sizes, shapes, types of homogeneity, and contrast values of regions in an image are a priori unknown. The goal is the accurate detection of regions without using rigid, geometric, and photometric models and automatic estimation of all scales associated with an image.
Related Publications:
Unsupervised video segmentation is a challenging problem because it involves a large amount of data, and image segments undergo noisy variations in color, texture and motion with time. However, there are significant redundancies that can help disambiguate the effects of noise. To exploit these redundancies and obtain the most spatio-temporally consistent video segmentation, we formulate the problem as a consistent labeling problem by exploiting higher order image structure. A label stands for a specific moving segment. Each segment (or region) is treated as a random variable which is to be assigned a label. Regions assigned the same label comprise a 3D space-time segment, or a region tube. The labels can also be automatically created or terminated at any frame in the video sequence, to allow objects entering or leaving the scene. To formulate this problem, we use the CRF (conditional random field) model. Unlike conventional CRF which has only unary and binary potentials, we also use higher order potentials to favor label consistency among disconnected spatial and temporal segments.
We present a novel scale adaptive, non-parametric approach to clustering point patterns. Clusters are detected by moving all points to their cluster cores using shift vectors. First, we propose a novel scale selection criterion based on local density isotropy which determines the neighborhoods over which the shift vectors are computed. We then construct a directed graph induced by these shift vectors. Clustering is obtained by simulating random walks on this digraph. We also examine the spectral properties of a similarity matrix obtained from the directed graph to obtain a K-way partitioning of the data. Additionally, we use the eigenvector alignment algorithm of [1] to automatically determine the number of clusters in the dataset.
Statistical models of pixel value variations have been developed and analyzed. Some of the work focuses on kernel density estimators to develop such models. Consequently, statistical theory of density estimators can be used for various tasks including segmentation of locally/globally parametric image signals; scale estimation and object registration. The main projects of this sub-theme are "Bandwidth Selection for Kernel Density Estimators" and "Estimation and Segmentation of Images Using Parametric Image Models" detailed below.
A regression-based model which admits a realistic framework for automatically choosing bandwidth parameters which minimizes a global error criterion. This is used for automatic segmentation of images at any input resolution scale (for e.g., the wavelet decomposition scale).
Related Publications:
Models of spatial variation in images are central to a large number of low-level computer vision problems including egmentation, registration, and 3D structure detection. Often, images are represented using parametric models to characterize (noise-free) image variation, and, additive noise. However, the noise model may be unknown and parametric models may only be valid on individual segments of the image. Consequently, we model noise using a nonparametric kernel density estimation framework and use a locally or globally linear parametric model to represent the noise-free image pattern. This results in a ovel, robust, redescending, M- parameter estimator for the above image model which we call the Kernel Maximum Likelihood estimator (KML). We also provide a provably convergent, iterative algorithm for the resultant optimization problem. The estimation framework is empirically validated on synthetic data and applied to the task of range image segmentation.
Related Publications:
Low level segmentation based image features are used for the problem of object categorization. In general, object categorization comprises two main research areas: (1) classification or clustering of images containing objects belonging to an object category, and (2) detection, localization, and segmentation of individual object-category instances in images. The first thrust of research is typically concerned with exemplar based methods, where the main focus is to develop an efficient distance measure between two images. Work in the second research area is primarily concerned with object-category modeling on training images, and using category models for object detection, localization and segmentation in test images. These approaches differ from object recognition methods in that category instances in the training and test sets are different.
We use features of segmentation for semantic classification of real images. We model the image in terms of a probability density function, a Gaussian mixture model (GMM) to be specific, of its region features. This GMM is fit to the image by adapting a universal GMM which is estimated so it fits all images. Adaptation is done using a maximum-aposteriori criterion. We use kernelized versions of Bhattacharyya distance to measure the similarity between two GMMs and support vector machines to perform classifcation.
Given an arbitrary image, our goal is to segment all distinct texture subimages. This is done by discovering distinct, cohesive groups of spatially repeating patterns, called texels, in the image, where each group defines the corresponding texture. Texels occupy image regions, whose photometric, geometric, structural, and spatial-layout properties are samples from an unknown pdf. If the image contains texture, by definition, the image will also contain a large number of statistically similar texels. This, in turn, will give rise to modes in the pdf of region properties. Texture segmentation can thus be formulated as identifying modes of this pdf. To this end, first, we use a low-level, multiscale segmentation to extract image regions at all scales present. Then, we use the meanshift with a new, variable-bandwidth, hierarchical kernel to identify modes of the pdf defined over the extracted hierarchy of image regions. The hierarchical kernel is aimed at capturing texel substructure.
Given a set of images, possibly containing objects from an unknown category, determine if a category is present. If a category is present, learn spatial and photometric model of the category. Given an unseen image, segment all occurrences of the category.
We propose novel approaches to region-based hierarchical image matching, where, given two images, the goal is to identify the largest part in image 1 and its match in image 2 having the maximum similarity measure defined in terms of geometric and photometric properties of regions (e.g., area, boundary shape, and color), as well as region topology (e.g., recursive embedding of regions).
We propose a new object representation, called connected segmentation tree (CST), which captures canonical characteristics of the object in terms of the photometric, geometric, and spatial adjacency and containment properties of its constituent image regions. CST is obtained by augmenting the object's segmentation tree (ST) with inter-region neighbor links, in addition to their recursive embedding structure already present in ST. This makes CST a hierarchy of region adjacency graphs. A region's neighbors are computed using an extension to regions of the Voronoi diagram for point patterns. Unsupervised learning of the CST model of a category is formulated as matching the CST graph representations of unlabeled training images, and fusing their maximally matching subgraphs.
Recognition is achieved either by explicitly coding the recognition criteria in terms of low level structure, or through learning from examples. Learning algorithms incorporate subspace projections of higher dimensional data symbolically or using neural approaches.
A learning account for the problem of object recognition is developed within the PAC (Probably Approximately Correct) model of learnability. The key assumption underlying this work is that objects can be recognized (or, discriminated) using simple representations in terms of "syntactically" simple relations over the raw image. Although the potential number of these simple relations could be huge, only a few of them are actually present in each observed image and a fairly small number of those observed is relevant to discriminating an object. We show that these properties can be exploited to yield an efficient learning approach in terms of sample and computational complexity, within the PAC model. No assumptions are needed on the distribution of the observed objects and the learning performance is quantified relative to its past experience. Most importantly, the success of learning an object representation is naturally tied to the ability to represent it as a function of some intermediate representations extracted from the image.We evaluate this approach in a large scale experimental study in which the SNoW learning architecture is used to learn representations for the 100 objects in the Columbia Object Image Database (COIL-100). Experimental results exhibit very good generalization and robustness properties of the SNoW-based method relative to other approaches. SNoW's recognition rate degrades more gracefully when the training data contains fewer views and it shows similar behaviour also in some preliminary experiments with partially occluded objects.
A learning algorithm accounting for the problem of object recognition is developed within the PAC (Probably Approximately Correct) model of learnability. We evaluate this apporach using the COIL-100 database and exhibit its advantages over conventional methods.
Given an image or a video sequence, a prespecified set of low level, spatial and/or temporal descriptors of the image/video structure, and a higher level interpretation of the structure, use computational learning methods to derive a succinct relationship between the interpretation and the low level structural description.
The aforementioned work on representation and learning has contributed to two types of human computer interfaces we have developed. First, learning and classification techniques, including usual statistical classifiers, neural networks, support vector machines and artificial intelligence approaches, have been used to develop new methods for human face detection and hand gesture recognition.
GIST (Gesture Interpretation using Spatio-Temporal analysis) project is an attempt to recognize and interpret sign gestures of American Sign Language from a video sequence based on an integrated method of motion segmentation, shape, size and color. A multi-scale motion segmentation based on Ahuja's New Transform is applied to a video sequence to get motion regions and their correspondence across frames. Regions of interest, such as fingertip, palm and elbow, are extracted from motion segmented images by formulating and solving a constraint satisfaction problem. From these joints, pixel trajectories are extracted. A spatio-temporal analysis based on time-delay neural network is applied to classify these patterns. The ultimate goal of GIST is to allow content-based video retrieval based on video clips and better understanding of motion segmentation.
We present a probabilistic method to detect human faces using a mixture of factor analyzers. One characteristic of this mixture model is that it concurrently performs clustering and, within each cluster, local dimensionality reduction. A wide range of face images that consists of faces in different poses, faces in different expressions and faces under different lighting conditions is used as the training set to capture the variations of human faces. In order to fit the mixture model to the sample face images, the parameters are estimated using an EM algorithm. Experimental results show that faces in different poses, with facial expressions, and under different lighting conditions are detected by our method.
To develop methods to tell the identity of a person from a frontal image and evaluate its performance with state-of-the-art methods.
In this work, we propose analytical solution to non-frontal camera calibration in a generalized pupil-centric imaging framework. The decentering distortion is explicitly
modelled as a sensor rotation with respect to the lens plane. The rotation parameters are then computed analytically along with other calibration parameters. The centre of
radial distortion is then computationally obtained given the analytical solution. We also look at the radial alignment constraint(RAC) of Tsai and generalize it to a non-frontal
setting proposing a generalized radial alignment constraint (gRAC). In the new setting, we derive analytical solution to a subset of calibration parameters and propose
techniques to handle ambiguities in this setting. We also propose a focal stack calibration which uses a non-frontal image sensor to capture the focal stack and uses the blurring
information in the focal stack to improve of traditional camera calibration techniques.
Many computational imaging applications involve manipulating the incoming light beam in the aperture and image planes. However, accessing the aperture, which conventionally stands inside the imaging lens, is still challenging. In this paper, we present an approach that allows access to the aperture plane and enables dynamic control of its transmissivity, position, and orientation. Specifically, we present two kinds of compound imaging systems (CIS), CIS1 and CIS2, to reposition the aperture in front of and behind the
imaging lens respectively. CIS1 repositions the aperture plane in front of the imaging lens and enables the dynamic control of the light beam coming to the lens. This control is quite useful in panoramic imaging at the single viewpoint. CIS2 uses a rear-attached relay system (lens) to replace the aperture plane behind the imaging lens, and enables the dynamic control of the imaging light jointly formed by the imaging lens and the relay lens. In this way, the common imaging beam can be coded or split in the aperture plane to achieve many imaging functions, such as coded aperture imaging, high dynamic range (HDR) imaging and light field sampling. In addition, CIS2 repositions the aperture behind, instead of inside, the relay lens,
which allows the employment of the optimized relay lens to preserve the high imaging quality. Finally, we present the physical implementations of CIS1 and CIS2, to demonstrate (1) their effectiveness in providing access to the aperture and (2) the advantages of aperture manipulation in computational imaging applications.
In developing the new opto-geometric configurations, we have found that certain classical models and approaches cease to be adequate. For example, the long-established Gaussian model of image formation fails to adequately predict the acquired images, and the optical and geometric phenomena ignored in the traditional characterization of the most focused scene point make the traditional methods of focus analysis unacceptable. We have the old models with new, more rigorous, and satisfactory models. These new models are also useful in contexts other than next generation camera designs – they are useful in improving the performance of currently “acceptable” systems, and in extending the applicability of computer vision methods to many scenarios and applications which were out of reach otherwise.
We discuss how to generate omnifocus images from a sequence of different focal setting images. We first show that the existing focus measures would encounter difficulty when detecting which frame is most focused for pixels in the regions between intensity edges and uniform areas. Then we propose a new focus measure that could be used to handle this problem. In addition, after computing focus measures for every pixel in all images, we construct a three dimensional (3D) node-capacitated graph and apply a graph cut based optimization method to estimate a spatio-focus surface that minimizes the summation of the new focus measure values on this surface. An omnifocus image can be directly generated from this minimal spatio-focus surface. Experimental results with simulated and real scenes are provided.
We have developed a camera which is capable of acquiring very large field of view (FOV) images at high and uniform resolution, from a single viewpoint, at video rates. The FOV can range from being nearly hemispherical, to being nearly omni-directional, barring some small scene parts being obstructed by image sensors themselves. The camera consists of multiple imaging sensors and a hexagonal prism made of planar mirror faces. Each sensor is paired with a planar face of the prism. The sensors are positioned in such a way that they image different parts of the scene from a single virtual viewpoint, either directly or after reflections off the prism. A panoramic image is constructed by concatenating the images taken by different sensors. The resolution of the panoramic image is proportional to the number of sensors used and therefore a multiple of that of an individual sensor. Further, the resolution is substantially uniform across the entire panoramic image.
1. Introduction
A panoramic camera is an imaging device capable of capturing a very large field of view (FOV). Like any other camera, it is desirable that such cameras acquire the entire FOV from a single viewpoint, in real time, at high resolution which is uniform across the FOV, with large dynamic range, and over a large depth of field. Such devices find applications in many areas including tele-conferencing, surveillance and robot navigation. Many efforts have been made to achieve various subsets of these properties (i.e. wide FOV, high and uniform resolution, large depth of field, high dynamic range, a single viewpoint, and real-time acquisition. These methods of capturing panoramic or omni-directional images fall into two categories: dioptric methods, where only refractive elements (lenses) are employed, and catadioptric methods, where a combination of reflective and refractive components is used.
Typical dioptric systems include camera clusters, panning cameras, and fisheye lenses. Catadioptric methods include curved mirror systems where a conventional camera captures the scene reflected off a single non-planar mirror (e.g. parabolic or hyperbolic mirror), and planar mirror systems such as mirror pyramid systems where multiple conventional cameras image the scene reflected off the faces of a mirror-pyramid. The cameras that use a parabolic- or a hyperbolic-mirror to map an omni-directional view onto a single sensor are able to capture a large FOV from a single viewpoint at video rate. However, the FOV shape is hemispherical minus a central cone which is blocked by self-occlusion. The overall resolution of the acquired images is limited to that of the sensor used, and further it varies with the viewing direction across the ring-like FOV, e.g., from a high just outside the central blind spot to a low in the periphery. The cameras using a spherical or conical mirror have similar properties as those using parabolic or hyperbolic mirrors except that they do not possess a single viewpoint.
Many of the aforementioned systems provide a cylindrical shape FOV which is 360 degrees wide in azimuth, but has limited height in elevation (Fig. 1a). In certain applications such as robot navigation and surveillance, however, a hemispherical shape FOV is highly desirable (Fig. 1b). We have developed a system which is capable of acquiring hemispherical panoramic images in real time, with high and substantially uniform resolution, and from a single viewpoint. By substantially uniform resolution we mean the same level of uniformity as delivered by a conventional, non-panoramic camera.
We describe a new omnidirectional stereo imaging system that uses a concave lens and a convex mirror to produce a stereo pair of images on the sensor of a conventional camera. The light incident from a scene point is split and directed to the camera in two parts. One part reaches camera directly after reflection from the convex mirror and forms a single-viewpoint omnidirectional image. The second part is formed by passing a subbeam of the reflected light from the mirror through a concave lens and forms a displaced single viewpoint image where the disparity depends on the depth of the scene point. A closed-form expression for depth is derived. Since the optical components used are simple and commercially available, the resulting system is compact and inexpensive. This, and the simplicity of the required image processing algorithms, make the proposed system attractive for real-time applications, such as autonomous navigation and object manipulation. The experimental prototype we have built is described.
The concept of omnifocus nonfrontal imaging camera, OMNICAM or NICAM, initiated a new chapter in imaging and digital cameras. NICAM has introduced hitherto non-existent imaging capabilities, in addition to overcoming some problems with previous methods. NICAM is capable of acquiring seamless panoramic images and range estimates of wide scenes with all objects in focus, regardless of their locations. To understand the impact of NICAM, first consider imaging with conventional cameras.The camera’s field of view is generally much smaller than the entire visual field of interest. Consequently, the camera must pan across the scene of interest, focus on a part at a time, and acquire an image of each part. All the resulting images together then capture the complete scene. As by-product of focusing, the range of the objects in the scene can also be estimated. Usual methods for focusing as well as range estimation from focusing mechanically relocate the sensor plane, thereby varying the focus distance setting in the camera. When a scene point appears in sharp focus, the corresponding depth and focus distance values satisfy the lens law. The depth for the scene point can then be calculated length and the focus distance.
Most imaging sensors have a limited dynamic range and hence can satisfactorily respond to only a part of illumination levels present in a scene. This is particularly disadvantageous for omnidirectional and panoramic cameras since larger fields of view have larger brightness ranges. We propose a simple modification to existing high resolution omnidirectional/panoramic cameras in which the process of increasing the dynamic range is coupled with the process of increasing the field of view. This is achieved by placing a graded transparency(mask) in front of the sensor which allows every scene point to be imaged under multiple exposure settings as the camera pans, a process anyway required to capture large fields of view at high resolution. The sequence of images are then mosaiced to construct a high resolution,high dynamic range panoramic/omnidirectional image.Our method is robust to alignment errors between the mask and the sensor grid and does not require the mask to be placed on the sensing surface. We have designed a panoramic camera with the proposed modifications and have discussed various theoretical and practical issues encountered in obtaining a robust design. We show with an example of high resolution, high dynamic range panoramic image obtained from the camera we designed.
To acquire panoramic video sequences, we have developed two types of Double-Mirror-Pyramid cameras that capture up to 360-degree fields of view at high-resolution. The first one, A Single View Double-Mirror-Pyramid Panoramic Camera, acquires a single sequence from one viewpoint, whereas the second, A Multiview Double-Mirror-Pyramid Panoramic Camera, provides multiple video sequences each taken from a different viewpoint, e.g. stereo sequences for 3D viewing. Both of these cameras belong to the family of pyramid cameras.
Related projects:
Panoramic images and video are useful in many applications such as special effects, immersive virtual reality environments, and video games. Among the numerous devices proposed for capturing panoramas, mirror pyramid-based camera systems are a promising approach for video rate capture, as they offer single-viewpoint imaging, and use only flat mirrors that are easier to produce than curved mirrors. Past work has focused on capturing panoramas from a single viewpoint.
In this work, we have extended our work on the Double Mirror Pyramid Panoramic Camera, that acquires panoramic images from a single viewpoint, to multiple viewpoints.
A mirror pyramid consists of a set of flat mirror faces arranged around an axis of symmetry, inclined to form a pyramid. By strategically positioning a number of conventional cameras around a mirror pyramid, the viewpoints for the individual cameras? mirror images can be colocated at a single point within the pyramid, effectively forming a virtual camera with a wide field of view. Mirror pyramid-based panoramic cameras have a number of attractive properties, including
Currently existing designs realize a single viewpoint within each mirror pyramid. In order to capture panoramas from multiple viewpoints with these designs, the entire physical setup would need to be relocated or duplicated. The former solution lacks the capability of video rate imaging, and the latter leads to bulky designs due to the multiple mirror pyramids.
Multiview Double Mirror Pyramid CamerasWe have extended mirror pyramid panoramic camera to a generalized design that accommodates multiple viewpoints. Each viewpoint is the common mirror image of the optic points of a set of physical cameras located outside the pyramid and together they yield a seamless panoramic image. Each camera set yields a panoramic image from its associated viewpoint. The result is simultaneous, multiview, panoramic, video rate imaging with a compact design. Using a double mirror pyramid, i.e., two pyramids back to back with a shared base, helps double the height of the visual field in a manner similar to monocular imaging in our Double Mirror Pyramid Panoramic Camera. The resulting set of panoramic images can be used for stereo analysis or stereo viewing.
Figure 1. Variation in the physical camera position with viewpoint position. (a) Viewpoint is centered within four-sided pyramid, shown with the corresponding eight camera positions. (b) Translated viewpoints marked A, B, and C are shown with correspondingly marked physical camera positions. (c) Same as (b), but for a mirror pyramid having a large number of faces, to show how the shape changes as the viewpoint translates.
PrototypeFigure 2. The experimental two-view panoramic camera shown with only four sensors. (a) A schematic showing double mirror pyramid with the four physical cameras associated with two faces, two per face each corresponding to one of the two viewpoints. (b) The physical implementation with four sensors (conventional cameras) whose reflections can be seen inside the two associated mirror faces.
ResultsFigure 3. (a) The four images captured by the four sensors (conventional cameras). (b) The mosaic of four images. These mosaic will be 360-degree wide when all physical cameras are present instead of just the four used here.
High-resolution panoramic capture is highly desirable in many applications such as immersive virtual environments, tele-conferencing, surveillance, and robot navigation. In addition, a single viewpoint for all viewing directions, a large depth-of-field (omni-focus), and real-time acquisition are desired in some imaging applications (e.g. 3D reconstruction and rendering). The FOV of a conventional camera is limited by the size of its sensor and the focal length of its lens. For example, a typical 16mm lens with 2/3? CCD sensor has a 30 deg x 23 deg FOV. The number of pixels on the sensor (640 x 480 for NTSC camera) determines the resolution. The depth-of-field is limited and is determined by various imaging parameters such as aperture, focal length, and the scene location of the object.
Many approaches have been presented to achieve various subsets of these properties: wide FOV, high resolution, large depth-of-field, a single viewpoint, and real-time acquisition. Among these, mirror-pyramid (MP)-based camera systems offer a promising approach to capturing high-resolution, wide-FOV panoramas as they provide single-viewpoint images at video rate. Such systems use planar mirrors assembled in pyramid or prism shapes, and as many cameras as the number of mirror faces, each located and oriented to capture the part of the scene reflected off one of the flat mirror faces. Images from the individual cameras are concatenated to yield a 360-degree wide panoramic image. Compared to designs using parabolic or hyperbolic mirrors, flat mirrors are easier to design and produce, and they introduce minimal optical aberrations.
We have developed a double-mirror-pyramid design that doubles the size of the visual field of the single-pyramid based systems. With this prototype, we have developed methods for optimally choosing the parameters of MP-based camera systems, e.g., camera placement, pyramid geometry, sensor usage, and uniformity of image resolution, and how the resultant image quality can be evaluated.
Overview of panoramic imagingThe existing methods of capturing panoramas fall into one of the two categories: dioptric methods, where only refractive elements (lenses) are employed, and catadioptric methods, where a combination of reflective and refractive components is used. Typical dioptric systems include: the camera cluster method where multiple cameras point in different directions to cover a wide FOV; the fisheye method where a single camera acquires a wide FOV image through a fisheye lens; and the rotating camera method where a conventional camera pans to generate mosaics, or a camera with a non-frontal, tilted sensor pans around its viewpoint to acquire panoramic omni-focused images. The catadioptric methods include: sensors in which a single camera captures the scene as reflected off a single curved mirror, or sensors in which multiple cameras image the scene as reflected off the planar mirror surfaces.
The dioptric camera clusters are capable of capturing high-resolution panoramas at video rate. However, the cameras in these clusters due to physical constraints, which makes it difficult or even impossible to mosaic individual images to form a true panoramic view, while apparent continuity across images may be achieved by ad hoc image blending. The sensors with fisheye lens are able to deliver large FOV images at video rate, but suffer from low resolution, irreversible distortion for close-by objects, and non-unique viewpoints for different portions of the FOV. The rotating cameras deliver high-resolution wide FOV via panning, as well as omni-focus when used in conjunction with non-frontal imaging, but they have limited vertical FOV. Furthermore, because they sequentially capture different parts of the FOV, moving objects may be imaged incorrectly. typically do not share a unique viewpoint
The catadioptric sensors that use a parabolic- or a hyperbolic-mirror to map an omni-directional view onto a single sensor are able to achieve a single viewpoint at video rate, but the resolution of the acquired image is limited to that of the sensor used and varies significantly with the viewing direction across the visual fields. Analogous to the dioptric case, this resolution problem can be alleviated partially by replacing the simultaneous imaging of the entire FOV with panning and sequential imaging of its parts, followed by mosaicing the images, at the expense of video rate. Another category of the catadioptric sensors employs a number of planar mirrors assembled in the shape of right mirror-pyramids, together with as many cameras as the number of pyramid faces. Each of these cameras, capturing the part of the scene reflected off one of the faces, is located and oriented strategically such that the mirror images of their viewpoints are co-located at a single point inside the pyramid. Effectively, this creates a virtual camera that captures wide-FOV, high-resolution panorama at video rate.
Proposed Double-Mirror-Pyramid CameraThe main challenge in constructing a panoramic camera from multiple sensors is to co-locate the entrance pupils of the multiple cameras so that adjacent cameras cover contiguous FOV without obstructing the view of other cameras or their own. Nalwa first used a right mirror pyramid (MP) formed from planar mirrors for this purpose. He reported an implementation using a 4-sided right pyramid and 4 cameras. The pyramid stands on its horizontal base. Each triangular face forms a 45-degree angle with the base. The cameras are positioned in the horizontal plane that contains the pyramid’s vertex such that the entrance pupil of each camera is equidistant from the vertex and the mirror images of the entrance pupils coincide at a common point, C, on the axis of the pyramid. The cameras are pointed vertically downward at the pyramid faces such that the virtual optical axes of the cameras are all contained in a plane parallel to the pyramid base, effectively viewing the world horizontally outward from the common virtual viewpoint C.
The vertical dimension of the panoramic FOV in each of the aforementioned cases is the same as that of each of the cameras used only their horizontal FOVs are concatenated to obtain a wider, panoramic view. We have developed a panoramic design that uses a dual mirror-pyramid (DMP), formed by joining two mirror-pyramids such that their bases coincide (Fig. 2), together with two layers of camera clusters. Such a DMP-based design thus doubles the vertical FOV while preserving the ability to acquire panoramic high-resolution images from an apparent single viewpoint at video rate.
Standard imaging sensors have limited dynamic range and hence are sensitive to only a part of the illumination range present in a natural scene. The dynamic range can be improved by acquiring multiple images of the same scene under different exposure settings and then combining them. We have developed a multi-sensor camera design, called Split-Aperture Camera, to acquire registered, multiple images of a scene, at different exposure, from a single viewpoint, and at video-rate. The resulting multiple exposure images are then used to construct a high dynamic range image.
There are three main steps to composing the high dynamic range image. First, we transform the recorded intensities by each sensor into the actual sensor irradiance values. This mapping can be obtained using radiometric calibration techniques applicable to normal cameras. Second, since the irradiance at corresponding points on different sensors can be different, we need a correction factor to represent a scene point by a unique value independent of the sensor where it gets imaged. This factor is spatially variant and it is different for different sensors. The third and the last step is fusing the intensity transformed images into a single high dynamic range mosaic. For every pixel on a canvas (an empty image of same dimensions as any of the sensors), we have a set of transformed intensity values one from each of the images. We discard the values from images in which those locations were either saturated or clipped. Since, the values not discarded may be noisy, we combine them to obtain the final value.
To build a prototype, we used a pyramid beam-splitter which is the corner of a mirror cube (3-face pyramid) and three sensors used were Sony monochrome board cameras CCB-ME37. The glass cube corners are commercially available and marketed as solid retroreflectors. The triangular surfaces were coated with a metallic coating such as aluminum to obtain the three desired reflective surfaces. We designed a special lens whose aperture is located just behind the lens, and aligned the pyramid with the optical axis with its tip at the center of the aperture. The position of the sensors were carefully calibrated to ensure that all the sensors were normal to the split optical axes, equidistant from the tip of the pyramid and images from all sensors overlaid exactly on top of each other. This arrangement ensures that distribution of light across the three sensors is independent of the 3D coordinates of the objects being imaged. We used thin-film neutral density filters with transmittances 1, 0.5 and 0.25 in front of the sensors to obtain images capturing different parts of the illumination range. The frame grabber used was Matrox multichannel board capable of synchronizing and capturing three channels simultaneously. Figure 1 below shows the prototype built. Figures 2 and 3 show samples of images acquired by the prototype.
A visual depth sensor composed of a single camera and a transparent plate rotating about the optical axis in front of the camera. Depth is estimated from the disparities of scene points observed in multiple images acquired viewing through the rotating the plate.
We propose a novel depth sensing imaging system composed of a single camera along with a parallel planar plate rotating about the optical axis of the camera. Compared with conventional stereo systems, only one camera is utilized to capture stereo pairs, which can improve the accuracy of correspondence detection as is the case for any single camera stereo systems. The proposed system is able to capture multiple images by simply rotating the plate. With multiple stereo pairs, it is possible to obtain precise depth estimates, without encountering matching ambiguity problems, even for objects with low texture. Given the large number of resulting images, in conjunction with the estimated depth map, we show that the proposed system is also capable of acquiring super-resolution images. Finally, experimental results on reconstructing 3D structures and recovering high-resolution textures are presented.
Stereo is one of the most widely explored sources of scene depth. Stereo usually refers to spatial stereo, wherein two cameras, separated by a baseline, simultaneously capture stereo image pairs. The spatial disparity in the images of the same scene feature then captures the feature’s depth. More than two cameras can also be used to capture the disparity information across multiple views.
An alternative to such spatial stereo is temporal stereo wherein a single camera is relocated to the same set of viewpoints to capture the two or more images sequentially. This loses the parallel imaging capability and therefore the ability to handle fast moving objects, but it reduces the number of cameras used to one as well as eliminates the need for photometric calibration of the camera if needed for feature/intensity matching of stereo images.
We propose a single-camera depth estimation system that captures a large number of images, after the incoming light from a scene point has been deflected in a manner that depends on the object depth. The deflection mechanism is the passage of light through a thick glass plate placed in front of an ordinary camera at a certain orientation to the optical axis. In order to estimate depth, at least two images captured under two different plate poses are necessary. However, a larger number of images, containing redundant depth information, are acquired by changing the plate orientation sequentially, e.g. by rotating and/or reorienting the plate. Rotating the plate at a fixed orientation with respect to the optical axis is a mechanically convenient way of obtaining a large number of depth-coded images, followed, if desired, by more plate orientations and rotations, to acquire more images, as illustrated in Fig 1. An analysis of the correspondences among the set of images yields depth estimates. High quality, dense depth estimation distinguishes this new camera from other single camera stereo systems.
Depth Dependent Pixel DisplacementIt is well known from optics that light ray passing through a planar plate will encounter a lateral displacement. For a camera-plate system, this phenomenon is shown as the pixel shifts in the image. For a given object point, we assume that the point is shifted by the plate in a plane parallel with image plane.
The displacements of a pixel depend differently on changes in the plate tilt angle and rotation angle. Using both changes leads to robust estimation because the lack of sensitivity to one type of change is complemented by higher sensitivity to the other. Depth estimation is done by minimizing a cost function which is overdetermined due to the large number of images available. The cost function includes the error due to the fit of the result to the multiple estimates available at a single pixel, and the local roughness of the surface at the pixel.
The dimensions of the plate and its tilt angle, among other parameters, have to be selected properly to achieve good depth estimates. The plate parameters affect the amount of depth-sensitive displacement in the image. Larger refractive index and thicker plate yield a larger displacement, which corresponds to a higher depth resolution. However, larger refractive index and thicker plate also introduce larger chromatic aberration, which may degrade the image quality. The tilt angle of the plate also affects the displacement – larger the tilt angle, larger the pixel displacement for the same depth. However, with the increase of the tilt angle, the size of the plate increases dramatically.
PrototypeWe have developed a prototype with the rotating plate oriented at a fixed tilt angle. It requires calibration only once, in the beginning, after which the system acquires images continuously without the need for any further calibration. This is achieved by rotating the plate continuously about the x, y and/or z axes, acquiring images at video rate. The images of a scene point, acquired during rotation, lie along a 4-dimensional manifold in the space defined by the object depth in the viewing direction, and the three rotation angles. To estimate object depth in a given direction, we simply find the best estimate of the intersection of the line in that direction with the manifold samples obtained from the images acquired during rotation. Each pixel in each image thus contributes to the number of manifold samples. Since the locations of pixels in different images define a denser array of directions than possible with a single (orientation) image grid, the system yields depth estimates along a direction array denser than the original images.
We used a Sony DFX900 color camera equipped with a 16mm lens, and an ordinary glass plate with 13mm thickness and refractive index of around 1.5. The plate was mounted on a rotary stage and tilted approximately 45 degrees with respect to the image plane. A total 36 plate poses , evenly distributed within the range of 360 degrees, were calibrated and used to recover depth and synthesize the super-resolution images.
We tested our system with two objects: house and monster head. The experimental results of the house object are shown in Fig 2. Fig. 2a is one of the input images taken through the plate, Fig. 2b shows the recovered depth map, and Fig. 2c shows a new view generated from the reconstructed house model.
Fig 3 shows the experimental results for the monster head object. Figs 3a~3c show one of input images, the recovered depth map, and a new view of the reconstructed head model, respectively. This result indicates that the camera performs well for objects having a limited amount of texture.
The objective of this work is to automate human visual inspection and improve the efficiency and effectiveness of required train inspections. Video data is acquired using CCD cameras and outdoor lighting techniques to record the side, front, and/or bottom of trains passing by our custom trackside camera systems. These videos are then decomposed to produce panoramic images that are analyzed to detect defects in specific components and evaluate their compliance to the federal regulations.
Intermodal (IM) trains are typically the fastest freight trains operated in North America. The aerodynamic characteristics of many of these trains are often relatively poor resulting in high
fuel consumption. However, considerable variation in fuel efficiency is possible depending on
how the loads are placed on railcars in the train. Consequently, substantial potential fuel savings
are possible if more attention is paid to the loading configuration of trains.
A wayside machine vision (MV ) system was developed to automatically scan passing IM trains
and assess their aerodynamic efficiency. MV algorithms are used to analyse these images, detect
and measure gaps between loads. In order to make use of the data, a scoring system was devel-oped based on two attributes – the aerodynamic coefficient and slot efficiency. The aerodynamic
coefficient is calculated using the Aerodynamic Subroutine of the train energy model. Slot effi-ciency represents the difference between the actual and ideal loading configuration given the
particular set of railcars in the train. This system can provide IM terminal managers feedback
on loading performance for trains and be integrated into the software support systems used for
loading assignment.
One machine vision system researched by the University of Illinois Urbana-Champaign (UIUC), under sponsorship of the AAR's Technology Scanning Strategic Research Initiative, demonstrates that machine vision can be used for inspection of railcars. The UIUC prototype system inspect wheel, truct, and brake system components by automated, machine vision-based systems. Machine vision-based wheel and brake shoe inspection systems are already or will soon become commercially available. Inspection of the other truck components will soon follow. This further work by UIUC will focus on other aspects of car inspection, particularly in the area of safety appliances.
Before North American trains depart a terminal or rail yard, many aspects of the cars and
locomotives undergo inspection, including their safety appliances. Safety appliances are handholds,
ladders and other objects that serve as the interface between humans and railcars during transportation.
The current inspection process is primarily visual and is labor intensive, redundant, and generally lacks
"memory" of the inspection results. The effectiveness and efficiency of safety appliance inspections can
be improved by use of machine vision technology. This paper describes a research project investigating
the use of machine vision technology to perform railcar safety appliance inspections. Thus far,
algorithms have been developed that can detect deformed ladders, handholds, and brake wheels on open-top gondolas and hoppers. Visual learning is being used to teach the algorithm the differences between
safety appliance defects that require immediate repair, and other types of deformation that do not. Field
experiments under natural and artificial lighting have been conducted to determine the optimal
illumination needed for proper functioning of the algorithms. Future work will consist of developing
algorithms that can identify deformed safety appliances across the spectrum of North American railcars
under varied environmental conditions. The final product will be a wayside inspection system capable of
inspecting safety appliance defects on passing railcars.
Locomotive and rolling stock condition is an important element of railway safety, reliability, and
service quality. Traditionally, railroads have monitored equipment condition by conducting
regular inspections. Over the past several decades, certain inspection tasks have been
automated using technologies that have reduced the cost and increased the effectiveness of the
inspection. However, the inspection of most aspects of railroad equipment undercarriages is
conducted manually. This is a labor-intensive process and often requires that railroad
equipment be taken out of service so that the inspection can be conducted on specially
equipped pit tracks. Machine vision technology offers the potential to quickly and automatically
monitor, assess, record, and transmit detailed information regarding the condition of railroad
equipment undercarriages and components. Multi-spectral imaging (e.g. visible and infrared
range) allows recording of both physical and thermal condition and correlation between the two.
This allows precise comparison of any undercarriage component or element of interest using
templates, and/or references to previously recorded inspections, of the same or similar
equipment and enables other analyses such as trending of component wear and detection of
progressive increases in component heating over time. Multi-spectrum machine vision
algorithms can determine if a component is outside its normal operating range and if anomalies
are correlated across different spectra. The inspection tasks of particular interest for this
investigation include disc brake condition, bearing performance, and detection of incipient
failure of electrical systems such as locomotive traction motors and air conditioning units. We
are also investigating detection of damaged or missing components and foreign objects.
Results
A relatively coarse-level analysis could be used for detection of foreign or missing components by comparing the railcar panorama to a car-level template. Car-specific visible and thermal templates were created and stored for each piece of equipment to allow for unique configurations due to repairs and other modifications. To detect anomalies, block-level correlation was performed between the recorded panorama (Figure 1b) and the railcar's template (Figure 1a). An example of anomaly detection is illustrated by the junction box present in the panorama for the car but not in the template. Areas of mismatch have a low correlation and the junction box is clearly evident (dark area in Figure 1c). Detection of thermal anomalies is computed in a similar manner using differences in color values between the thermal panorama and the car’s thermal template (Figures 1d – 1f).
To ensure the safe and efficient operation of the approximately 1.6 million freight cars (wagons) in the North American railroad network, the United States Department of Transportation (USDOT), Federal Railroad Administration (FRA) requires periodic inspection of railcars to detect structural damage and defects. Railcar structural underframe components, including the centre sill, sidesills, and crossbearers, are subject to fatigue cracking due to periodic and/or cyclic loading during service and other forms of damage. The current railcar inspection process is time-consuming and relies heavily on the acuity, knowledge, skill, and endurance of qualified inspection personnel to detect these defects. Consequently, technologies are under development to automate critical inspection tasks to improve their efficiency and effectiveness. Research was conducted to determine the feasibility of inspecting railcar underframe components using machine vision technology. A digital video system was developed to record images of railcar underframes and computer software was developed to identify components and assess their condition. Tests of the image recording system were conducted at several railroad maintenance facilities. The images collected there were used to develop several types of machine vision algorithms to analyse images of railcar underframes and assess the condition of certain structural components.The results suggest that machine vision technology, in conjunction with other automated systems and preventive maintenance strategies, has the potential to provide comprehensive and objective information pertaining to railcar underframe component condition, thereby improving utilization of inspection and repair resources and increasing safety and network efficiency.
North American railroads and the United States Department of Transportation (US DOT) Federal
Railroad Administration (FRA) require periodic inspection of railway infrastructure to ensure safe
railway operation. The primary focus of this research is the inspection of North American Class I
railroad mainline and sidings, as these generally experience the highest traffic densities. Tracks
that are subjected to heavy-haul traffic necessitate frequent inspection and have more intensive
maintenance requirements, leaving railroads with less time to accomplish these inspections. To
improve the current (primarily manual) inspection process in an efficient and cost effective
manner, machine vision technology can be developed and used as a robust alternative. The
machine vision system consists of a video acquisition system for recording digital images, a
mobile rail platform for allowing video capture in the field, and custom designed algorithms to
identify defects and symptomatic conditions from these images. Results of previously developed
inspection algorithms have shown good reliability in identifying cut spikes and rail anchors from
field-acquired videos. The focus of this paper is the development of machine vision algorithms
designed to recognize turnout components and inspect them for defects. Inorder to prioritize
which turnout components are the most critical for the safe operation of trains, a risk-based
analysis of the FRA accident database has been performed. From these prioritized turnout
components, those that are best suited for vision-based inspection are being further investigated.
Future analysis of the machine vision system results, in conjunction with a comparison of
historical data, will enhance the ability for longer-term proactive assessment of the health of the
track system and its components.
Results
This detection and measurement algorithm is demonstrated on the point of the switch shown in Figure 1.
Experimental results show an accuracy of 100% for the base-of-rail localization using the lateral view, and 76% for the over-the-rail-view. In the case of spikes, both views resulted in 71% accuracy for spike head localization. For individual components, 93% of the ties were detected without false positives in the lateral view. For over-the-rail view, all ties were detected, however 8% of the detected ties were false positives. Finally, 100% of the anchors were detected (100% recall), however only 80% of objects that were detected as "anchors" were in fact anchors (80% precision).
3D Surfaces, Reflectance and Illumination from Stereo, Appearence and Shading in multiple views.
In this paper, we propose a new photometric stereo method for estimating diffuse reflection and surface normal from color images. Using dichromatic reflection model, we introduce surface chromaticity as a matching invariant for photometric stereo, which serves as the foundation of the theory of this paper. An extremely simple and robust reflection components separation method is proposed based on the invariant. Our separation method differs from most previous methods which either assume dependencies among pixels or require segmentation. We also show that a linear relationship between the image color and the surface normal can be obtained based on this invariant. The linear relationship turns the surface normal estimation problem into a linear system that can be solved exactly or via least-squares optimization. We present experiments on both synthetic and real images, which demonstrate the effectiveness of our method.
Based on a new correspondence matching invariant called \emph{Illumination Chromaticity Constancy}, we present a new solution for illumination chromaticity estimation, correspondence searching and specularity removal. Using as few as two images, the core of our method is the computation of a vote distribution for a number of illumination chromaticity hypotheses via correspondence matching. The hypothesis with the highest vote is accepted as correct. The estimated illumination chromaticity is then used together with the new matching invariant to match highlights, which inherently provides solutions for correspondence searching and specularity removal. Our method differs from the previous approaches: those treat these vision problems separately, and generally require specular highlights be detected in a pre-processing step. Also, our method uses more images than previous illumination chromaticity estimation methods, which increases its robustness as more inputs/constraints are used. Experimental results on both synthetic and real images demonstrate the effectiveness of the proposed method.
Non-lambertian surfaces causes difficulties for many stereo systems. We describe methods to recover both 3D surface shape and reflectance models of an object from multiple views. We use an iterative method, based on multi-view shape from shading, to estimate shape and reflectance models. The estimated models can be used to generate objects in new views and under new lighting conditions using computer graphics techniques.
In this paper, we consider the problem of stereo matching using loopy belief propagation. Unlike previous methods which focus on the original spatial resolution, we hierarchically reduce the disparity search range. By fixing the number of disparity levels on the original resolution, our method solves the message updating problem in a time linear in the number of pixels contained in the image and requires only constant memory space. Specifically, for a 800 × 600 image with 300 disparities, our message updating method is about 30× faster (1.5 second) than standard method, and requires only about 0.6% memory (9 MB). Also, our algorithm lends itself to a parallel implementation. Our GPU implementation (NVIDIA Geforce 8800GTX) is about 10× faster than our CPU implementation. Given the trend toward higher-resolution images, stereo matching using belief propagation with large number of disparity levels as efficient as the small ones makes our method future-proof. In addition to the computational and memory advantages, our method is straightforward to implement.
In this paper, we propose a simple but effective image transform, called the epipolar distance transform, for matching low-texture regions. It converts image intensity values to a relative location inside a planar segment along the epipolar line, such that pixels in the low-texture regions become distinguishable. We theoretically prove that the transform is affine invariant, thus the transformed images can be directly used for stereo matching. Any existing stereo algorithms can be directly used with the transformed images to improve reconstruction accuracy for low-texture regions. Results on real indoor and outdoor images demonstrate the effectiveness of the proposed transform for matching low-texture regions, keypoint detection, and description for low-texture scenes. Our experimental results on Middlebury images also demonstrate the robustness of our transform for highly textured scenes. The proposed transform has a great advantage, its low computational complexity. It was tested on a MacBook Air laptop computer with a 1.8 GHz Core i7 processor, with a speed of about 9 frames per second for a video graphics array-sized image.
A robust stereo matching algorithm using kernel representation of the probability density functions (pdf’s) of the sources that generate the stereoscopic images. Matching is done using either a Maximum Likelihood framework or using correlation in the pdf domain and an MRF prior to model the disparity function.
Given multiple images of a scene, taken from multiple cameras and different viewpoints, find the 3D depth map and surfaces
Given multiple calibrated pictures of a real world object captured from different viewpoints, reconstruct a three-dimensional model of the object.
3D Surface Orientation from Texture Gradient computed in a single image of a homogeneously textured surface.
In an image containing texture elements at a range of scales, detect all elements, their relative locations and mutual containment relationships.
OBJECTIVE
Given a slanted view of a planar, homogeneously textured surface, estimate the surface slant from the image texture gradient.
APPROACH
(1) Identification of image texture elements (texels) that correspond to surface texture elements is itself a significant problem since the scale at which surface detail is captured varies continuously with the three-dimensional distance, and therefore across the image texture. The image texels may exhibit a systematic variation in a priori unknown properties, e. g., size, density or contrast. All regions are potential texels. Consequently, all regions, of all sizes and contrasts, are detected at each location and treated as candidate texels.
(2) The estimation of surface slope (slant and tilt) is integrated with the process of selecting texels from among the large number of detected regions. For any given slant and tilt, only those regions across the image are interpreted as texels whose properties, e. g., area distribution, match the spatial distribution predicted by the hypothesized slant and tilt, and which occupy the largest fraction of the image. The image area is used as a measure of the extent of support for the particular slant-tilt pair.
(3) All possible slant-tilt values are considered as hypotheses, and a search is conducted to find the hypothesis with most support. This is the estimated surface orientation.
Our goal is to obtain a noise-free, high resolution (HR) image, from an observed, noisy, low resolution (LR) image. The conventional approach of preprocessing the image with a denoising algorithm, followed by applying a super-resolution (SR) algorithm, has an important limitation: Along with noise, some high frequency content of the image (particularly textural detail) is invariably lost during the denoising step. This `denoising loss' restricts the performance of the subsequent SR step, wherein the challenge is to synthesize such textural details. In this work, we show that high frequency content in the noisy image (which is ordinarily removed by denoising algorithms) can be effectively used to obtain the missing textural details in the HR domain. We show that part-recovery and part-synthesis of textures through our algorithm yields HR images that are visually more pleasing than those obtained using the conventional processing pipeline.
Super-resolution of a single image is a highly ill-posed problem since the number of high resolution pixels to be be estimated far exceeds the number of low resolution pixels available. Therefore, appropriate regularization or priors play an important role in the quality of results. In this line of work, we propose a family of methods for learning transform domain priors for the single-image super-resolution problem. Our algorithms are able to better synthesize high frequency textural details as compared to the state-of-the-art.
Classical optical flow objective functions consist of a data term that enforces brightness constancy, and a spatial smoothing term that encourages smooth flow fields. The use of structural information from images has been conventionally used for designing more robust regularizers, to prevent oversmoothing motion discontinuities. In this line of work, we are looking at exploiting image structure in a more detailed manner, as compated to conventionally used gradient filters. We not only propose better regularization terms using this structural information, but also show incorporate it in the data term to improve results.
We propose an image sharpening method that automatically optimizes the perceived sharpness of an image. Image sharpness is defined in terms of the one-dimensional contrast across region boundaries. Regions are automatically extracted for all natural scales present that are themselves identified automatically. Human judgments are collected and used to learn a function that determines the best sharpening parameter values at an image location as a function of certain local image properties. Experimental results demonstrate the adaptive nature and superior performance of our approach.
In this paper, we propose a new method to construct an edge-preserving filter which has very similar response to the bilateral filter. The bilateral filter is a normalized convolution in which the weighting for each pixel is determined by the spatial distance from the center pixel and its relative difference in intensity range. The spatial and range weighting functions are typically Gaussian in the literature. In this paper, we cast the filtering problem as a vector-mapping approximation and solve it using a support vector machine(SVM). Each pixel will be represented as a feature vector comprising of the exponentiation of the pixel intensity, the corresponding spatial filtered response, and their products.The mapping function is learned via SVM regression using the feature vectors and the corresponding bilateral filtered values from the training image. The major computation involved is the computation of the spatial filtered responses of the exponentiation of the original image which is invariant to the filter size given that an IIR O(1) solution is available for the spatial filtering kernel. To our knowledge, this is the first learning-based O(1) bilateral filtering method. Unlike previous O(1) methods, our method is valid for both low and high range variance Gaussian and the computational complexity is independent of the range variance value. Our method is also the fastest O(1) bilateral filtering yet developed.Besides, our method allows varying range variance values, based on which we propose a new bilateral filtering method avoiding the over-smoothing or under-smoothing artifacts in traditional bilateral filter.
The analysis of periodic or repetitive motions is useful in many applications, both in the natural and the man-made world. An important example is the recognition of human and animal activities. Existing methods for the analysis of periodic motions first extract motion trajectories, e.g. via correlation, or feature point matching. We present a new approach, which takes advantage of both the frequency and spatial information of the video. The 2D spatial Fourier transform is applied to each frame, and time-frequency distributions are then used to estimate the time-varying object motions. Thus, multiple periodic trajectories are extracted and their periods are estimated. The period information is finally used to segment the periodically moving objects. Unlike existing methods, our approach estimates multiple periodicities simultaneously, it is robust to deviations from strictly periodic motion, and estimates periodicities superposed on translations. Experiments with synthetic and real sequences display the capabilities and limitations of this approach. Supplementary material is provided, showing the video sequences used in the experiments.
We present a new approach for the identification and segmentation of objects undergoing periodic motion. Our method uses a combination of maximum likelihood estimation of the period, and segments moving objects using correlation of image segments over an estimated period of interest. Correlation provides the best locations of the moving objects in each frame. Segmentation tree provides the image segments at multiple resolutions. We ensure that children regions and their parent regions have the same period estimates. We show results of testing our method on real videos.
We propose a new bilateral filtering algorithm with computational complexity invariant to filter kernel size, socalledO(1) or constant time in the literature. By showing that a bilateral filter can be decomposed into a number of constant time spatial filters, our method yields a new class of constant time bilateral filters that can have arbitrary spatial1and arbitrary range kernels. In contrast, the current available constant time algorithm requires the use of specific spatial or specific range kernels. Also, our algorithm lends itself to a parallel implementation leading to the first real-time O(1) algorithm that we know of. Meanwhile, our algorithm yields higher quality results since we are effectively quantizing the range function instead of quantizing both the range function and the input image. Empirical experiments show that our algorithm not only gives higherPSNR, but is about 10× faster than the state-of-the-art. It also has a small memory footprint, needed only 2% of the memory required by the state-of-the-art for obtaining the same quality as exact using 8-bit images. We also show that our algorithm can be easily extended for O(1) median filtering. Our bilateral filtering algorithm was tested in a number of applications, including HD video conferencing,video abstraction, highlight removal, and multi-focus imaging.
We present a new upsampling method to enhance the spatial resolution of depth images. Given a low-resolution depth image from an active depth sensor and a potentially high-resolution color image from a passive RGB camera, we formulate it as an adaptive cost aggregation problem and solve it using the bilateral filter. The formulation synergistically combines the median and bilateral filters thus it better preserves the depth edges and is more robust to noise. Numerical and visual evaluations on a total of 37 Middlebury data sets demonstrate the effectiveness of our method. A real-time high-resolution depth capturing system is also developed using commercial active depth sensor based on the proposed upsampling method.
In this paper, we propose a simple but effective shadow removal method using a single input image. We first derive a 2-D intrinsic image from a single RGB camera image based solely on colors, particularly chromaticity. We next present a method to recover a 3-D intrinsic image based on bilateral filtering and the 2-D intrinsic image. The luminance contrast in regions with similar surface reflectance due to geometry and illumination variances is effectively reduced in the derived 3-D intrinsic image, while the contrast in regions with different surface reflectance is preserved. However, the intrinsic image contains incorrect luminance values. To obtain the correct luminance, we decompose the input RGB image and the intrinsic image. Each image is decomposed into a base layer and a detail layer. We obtain a shadow-free image by combining the base layer from the input RGB image and the detail layer from the intrinsic image such that the details of the intrinsic image are transferred to the input RGB image from which the correct luminance values can be obtained. Unlike previous methods, the presented technique is fully automatic and does not require shadow detection.
In this paper, we propose a simple but effective specular highlight removal method using a single input image. Our method is based on a key observation - the maximum fraction of the diffuse color component (so called maximum diffuse chromaticity in the literature) in local patches in color images changes smoothly. Using this property, we can estimate the maximum diffuse chromaticity values of the specular pixels by directly applying low-pass filter to the maximum fraction of the color components of the original image, such that the maximum diffuse chromaticity values can be propagated from the diffuse pixels to the specular pixels. The diffuse color at each pixel can then be computed as a nonlinear function of the estimated maximum diffuse chromaticity. Our method can be directly extended for multi-color surfaces if edge-preserving filters (e.g., bilateral filter) are used such that the smoothing can be guided by the maximum diffuse chromaticity. But maximum diffuse chromaticity is to be estimated. We thus present an approximation and demonstrate its effectiveness. Recent development in fast bilateral filtering techniques enables our method to run over 200× faster than the state-of-the-art on a standard CPU and differentiates our method from previous work.
The second type of human-computer interface is a free-hand-sketch based interface for image editing (e.g., moving, size-scaling, color-transforming parts of an image) is developed. The sketches drawn by the user on top of the image serve as a natural way of specifying an image part and the editing (e.g., move, deletion) operation to be performed.
To assist humans in referring to specific parts of an image, and performing desired operations on these parts, through natural-like interpersonal communication, e.g. by freely drawing sketches over the image which mean specific editorial operations such as move, expand and delete.
Compressive sampling (CS) is aimed at acquiring a signal or image from data which is deemed insufficient by Nyquist/Shannon sampling theorem. Its main idea is to recover a signal from limited measurements by exploring the prior knowledge that the signal is sparse or compressible in some domain. In this paper, we propose a CS approach using a new total-variation measure TVL1, or equivalently TVL1 , which enforces the sparsity and the directional continuity in the gradient domain. Our TVL1 based CS is characterized by the following attributes. First, by minimizing the ? 1 -norm of partial gradients, it can achieve greater accuracy than the widely-used TVL1L2 based CS. Second, it, named hybrid CS, combines low-resolution sampling (LRS) and random sampling (RS), which is motivated by our induction that these two sampling methods are complementary.
We explore new algorithms for computer vision based on multilinear algebra. Firstly, we learn the expression subspace and person subspace from a corpus of images based on Higher-Order Singular Value Decomposition (HOSVD), and investigate their applications in facial expression synthesis, face recognition and facial expression recognition. Secondly, we explore new algorithms for image ensembles/video representation and recognition using tensor rank-one decomposition and tensor rank-R approximation.
The goal of this project is to explore new algorithms based on multilinear algebra for representation of multidimensional data in computer vision.
New algorithms for facial image analysis based on multilinear algebra. We learn the expression subspace and person subspace from a corpus of images based on Higher-Order Singular Value Decomposition, and investigate their applications in facial expression synthesis, face recognition and facial expression recognition.
Video frames are often dropped during compression at very low bit rates. At the decoder, a missing frame interpolation method synthesizes the missed frames. We propose a two step motion estimation method for the interoplation. More specifically, the coarse motion vector field is refined at the decoder using mesh-based motion estimation instead of using computationally intensive dense motion estimation. We propose a framework for detecting and utilizing local motion boundaries in terms of an explicit model. Motion boundaries are modeled using edge detection and Hough transform. The motion of the occluding side is represented by affine mapping. The newly appearing region is also detected. Each pixel is interpolated differently according to adaptive interpolation based on the property of the pixel: moving, static, and disoccluded. The resulting quality of interpolated images is constantly better than block-based interpolation and is comparable to optical flow methods.
Results
We implemented the proposed algorithm using various QCIF (176×144) video sequences. Fig. 1(a) shows a typical motion vector field resulting from the block matching algorithm (BMA). The block size for the BMA is set to 16×16. We tried to make the number of triangles of a uniform mesh the same as the case of the BMA. After fixing the locations of node points, the algorithm obtains one motion vector per node in order to estimate affine parameters per triangle. Fig. 1(b) shows the improved motion vector field. Fig. 2 shows comparisons among the block-based method, the pel-recursive method, and the proposed method. The frame numbers in Fig. 2 refer to the interpolated frames. It means that we do not include the PSNRs of the transmitted frames in the graphs. The proposed algorithm shows a performance close to the pel-recursive approach in terms of PSNR. Fig. 3 shows a comparison between two interpolated images using the block-based method and the proposed method, respectively.
In order to apply a multidimensional linear transform over an arbitrarily shaped support, the usual practice is to fill out the support to a hypercube by zero padding. The problem that we tackle is: how do we redefine the transform over an arbitrary shaped region suited to a given application? We present a novel iterative approach to define any multidimensional linear transform over an arbitrary shape given that we know its definition over a hypercube. The proposed solution is extensible to all possible shapes of supports and adaptable to the needs of a particular application. Applications with this method are segmentation based image compression, shape based video encoding and region merging.
Discrete linear transforms in two (or more) dimensions are in most cases defined over a rectangular (hypercubic) support. The usual practice when we want to apply the transform over an arbitrarily shaped support is to fill out the rest of the support with zeros to make up the rectangle (hypercube) and then use the natural definition of the transform over a rectangle (hypercube). This is an extension of the 1-D case where we fill out an arbitrary-length data set with zeros to form a data set of length 2neither to increase the computational speed (through FFTs for Fourier transforms) or to satisfy the definition of the transform (in the case of dyadic wavelets). This, however, does not lead to a satisfactory definition of the linear transform in two or more dimensions for many applications. An example can be used to illustrate this point. The Fourier transform of a function that is constant on a circular support in 2-D is a Jinc. As can be seen from the figure, the magnitude of the Fourier coefficients does not have any relation to the smoothness of the function, which is a constant within its support.
The above discussion leads us to the following question: What should be the values attributed to the sample points which lie within the rectangle but not within the support of the function? The answer is evidently not unique and depends upon the application. With each possible choice of the values for the pixels which lie outside the support but within the rectangular (hypercubic) region, we can associate a possible function-transform pair. The aim of the proposed scheme is to algorithmically constrain the choice of the possible function-transform pairs in such a way as to lead to the optimal choice of the function-transform pair for the particular application under consideration.
The applications that we consider assume that we have a smooth 2-D function defined on an arbitrarily shaped, connected support; we would like to define the free pixels (the pixels within the rectangular region but outside the support) so as to minimize the high frequency content in the Fourier domain. The problem is formulated in terms of a Projection onto Convex Sets formalism and incorporates a few more constraints such as the bounded variation of the free pixels . The variation in the free pixels is controlled by a parameter Z.
Results for the Fourier transform as Z is varied are presented . The example function is a smooth region taken out of a natural image (Lena). As can be seen, an increase of Z allows more variation in the free pixels in the spatial domain, thus destroying the ad hoc shape information (the pixels within the support are unchanged in all images). Similar results maybe obtained if we replace the Fourier transform with either DCT or wavelets.
Related Publications:
A new method for digital image watermarking which does not require the original image for watermark detection is presented. Assuming that we are using a transform domain spread spectrum watermarking scheme, it is important to add the watermark in select coefficients with significant image energy in the transform domain in order to ensure non-erasability of the watermark. Previous methods, which did not use the original in the detection process, could not selectively add the watermark to the significant coefficients, since the locations of such selected coefficients can change due to image manipulations. Since watermark verification typically consists of a process of correlation which is extremely sensitive to the relative order in which the watermark coefficients are placed within the image, such changes in the location of the watermarked coefficients was unacceptable. We present a scheme which overcomes this problem of "order sensitivity". Advantages of the proposed method include (i) improved resistance to attacks on the watermark, (ii) implicit visual masking utilizing the time-frequency localization property of the wavelet transform and (iii) a robust definition for the threshold which validates the watermark. We present results comparing our method with previous techniques, which clearly validate our claims.
Related Publications:
In order to apply a multi-dimensional linear transform, over an arbitrarily shaped support, the usual practice is to fill out the support to a hypercube by zero padding. This does not however yield a satisfactory definition for transforms in two or more dimensions. The problem that we tackle is: how do we redefine the transform over an arbitrary shaped region suited to a given application? We present a novel iterative approach to define any multi-dimensional linear transform over an arbitrary shape given that we know its definition over a hypercube. The proposed solution is (1) extensible to all possible shapes of support (whether connected or unconnected) (2) adaptable to the needs of a particular application. We also present results for the Fourier transform, for a specific adaptation of the general definition of the transform which is suitable for compression or segmentation algorithms.
Related Publications:
We present a novel improvement to existing schemes for abrupt shot change detection. Existing schemes declare a shot change whenever the frame to frame histogram difference (FFD) value is above a particular threshold. In such an approach, a high value for the threshold results in a small number of false alarms and a large number of missed detections while a low value for the threshold decreases the number of missed detections at the expense of increasing the false alarms. We attribute this situation to the fact that the FFD cannot be reliably used as the sole indicator for the presence of a shot change. In the proposed method a two step shot detection strategy is used which selectively uses a likelihood ratio (computed directly from the frames and not from the histograms) to confirm the presence of a shot change. Such a two-step checking increases the probability of detection without increasing the probability of false alarm.
Related Publications:
This work proposes a computationally fast scheme for denoising a video sequence. Temporal processing is done separately from spatial processing and the two are then combined to get the denoised frame. The temporal redundancy is exploited using a scalar state 1D Kalman filter. A novel way is proposed to estimate the variance of the state noise from the noisy frames. The spatial redundancy is exploited using an adaptive edge-preserving Wiener filter. These two estimates are then combined using simple averaging to get the final denoised frame. Simulation results for the foreman, trevor and susie sequences show an improvement of 6 to 8 dB in PSNR over the noisy frames at PSNR of 28 and 24 dB.
Related Publications:
The application fields are (i) of appropriate granularity for best image compression, (ii) of appropriately rescaled size for image magnification or superresolution, and (iii) for smoothing for image quality restoration through structure-preserving denoising.
This work addresses the problem of denoising of images corrupted by AWGN. The Wiener filter is optimum in minimizing the mean-square-error under suitable assumptions of stationarity of the signal statistics. Locally, such assumptions are reasonable, as in the adaptive realization of theWiener filter whose performance is among the best known till date. Over the last few years, there has been much interest in threshold based denoising schemes. In this paper we present a novel framework for denoising signals from their compact representation in multiple domains. Each domain captures, uniquely, certain signal characteristics better than others. We define confidence sets around data in each domain and find sparse estimates that lie in the intersection of these sets, using a POCS algorithm. Simulations demonstrate the superior nature of the reconstruction (both in terms of mean-square error and perceptual quality) in comparison to the adaptive Wiener filter.
Results
The following images compare the performance of our scheme to that of Donoho and Johnstone’s.
Related Publications:
Resolution enhancement involves the problem of magnifying a small image to several times its size while avoiding blurring, ringing and other artifacts. We tackle the problem of magnifying an image without incurring the edge enhancement effects and other structural distortions characteristic of classical image magnification techniques. We propose an iterative algorithm based on a Projections onto Convex Sets (POCS) formalization.
Application fields are Image Magnification and Phase Retrieval. Classical image magnification methods include bilinear, bicubic and FIR interpolation schemes followed by a sharpening method like unsharp masking. Such interpolation schemes tend to blur the images when applied indiscriminately. Unsharp masking, which involves subtracting a properly scaled Laplacian of the image from itself, produces artifacts and increases noise. More sophisticated schemes involving wavelet- or fractal-based techniques have also been proposed. Such methods perform extrapolation of the signal in either the wavelet or fractal domain, which leads to objectionable artifacts when the assumptions behind such extrapolation are violated. It may also be noted that such extrapolatory assumptions predict and actively enhance the high-frequency content within the image, thus increasing any noise present in the unmagnified image.
The proposed method starts with an initial magnified image obtained through selective interpolation followed by an iterative procedure which aims to avoid edge-related artifacts while retaining and enhancing sharpness. The initial image is a composite image formed from a base interpolation scheme in the smooth areas of the image and from a selective interpolation mechanism in the non-smooth (or edge) areas. The proposed iterative algorithm aims to find a magnified image satisfying two constraints: one of the constraints is derived from sampling theory while the other constraint reflects the confidence that we place on the initial iterate. Both the constraints are convex sets; thus, we seek a solution which is at the intersection of these two convex sets and can be obtained using the projection on convex sets (POCS) method. Starting with the initial iterate, we project alternately on the two constraints. Convergence is guaranteed since we operate within the POCS formalism.
Results
Related Publications:
Our novel reversible image compression method employs multiscale segmentation within a computationally efficient optimization framework to obtain consistently good performance over a wide variety of images. We present new edge models that deal effectively with two issues that make such models normally unsuitable for compression applications: local applicability and large number of parameters needed for representation. Segmentation information is provided by a recent transform (1993), which we found to possess qualities making it especially suitable for compression. The final residual image is obtained using autocorrelation-based 2-D linear prediction. Different implementations providing lossless compression are presented along with results over a number of common test images. Results show that the proposed approach can be used to yield robust lossless compression, while providing consistently and significantly better results than the best possible JPEG lossless coder.
Results
Results show a consistent 15-20% improvement over the best possible JPEG lossless standard (see the table below). The results are invariant to the amount of detail and noise in the image. It is also found that the typical probability distribution of the residual image values is not Laplacian which do not use explicit edge modelling. It’s more Gaussian in shape, thus suggesting that the residual is mostly random noise. In conclusion, we have proposed a theoretically sound lossless compression method, which makes no crude approximations to the structure in the image. We have also proposed ways to represent edge models, which makes coding them compression wise a variable proposition.
The following tables are the results applying the various implementations. All results are in bits per pixel. In (a)-(c), interior (of a region) residual entropy and edge residual entropy are provided in addition to total entropy. Total entropy in all cases includes the overhead storage space.
Related Publications:
We develope a very low bit rate video compression algorithm using multiscale image segmentation based hierarchical motion compensation and residual coding. The proposed algorithm outperforms the H.261-like coder by 3 dB and the H.263 version 2 by 1 dB. Such gains come from the use of image segmentation and reversed motion prediction. The proposed region based reversed motion compensation strategy regulates the size and number of regions used, by pruning multiscale segmentation of video frames. Since regions used for motion compensation are obtained by segmenting the previously decoded frame, the shape of the regions need not be transmitted to the decoder. Furthermore, the hierarchical motion compensation strategy involves two stages: it refines an initial, region level, coarse motion field to obtain a dense motion field which provides pixel level motion vectors. The refinement procedure does not require any additional information to be transmitted. We also developed a residual coding technique for coding the displaced frame difference after segmentation based motion compensation. Residual coding is performed using a method which exploits the fact that the energy of the residual resulting from motion compensation is concentrated in a priori predictable positions. This residual coding technique can also be extrapolated to improve the performance of coders using a block based motion compensation strategy.
Results
We compare our coder with a generic block based coder as used in the H.261 or the H.263 standards. All performance comparison is performed on the luminance (Y) component of the video frames. In order to make an objective comparison, we used the same quantization strategies to quantize DCT coefficients for both the coders. The Huffman codes for motion vectors and DCT coefficients were the same for both the coders. The frame bit-rate was held (approximately) fixed for both the coders at 1280 bits. This bit-rate corresponds to a bit-rate of 9.6 kbps if every fourth frame is coded and a bit-rate of 38.4 kbps if all the frames are coded.
We also present results comparing our residual coding scheme with the usual block DCT based coding scheme. The overhead of 1 bit per coded block will be transmitted. Such a coder always performs better than the baseline block DCT scheme. The following figure shows the improvement (in dB PSNR) over the generic coder, when the quantization step size of AC coefficients is 16 and 32.
Related Publications:
Predictive coding is posed as a variant of the Wyner-Ziv coding, and problems in source and channel coding of video are addressed in this framework.
This project deals with scalable coding and robust Internet streaming of predictively encoded media. We frame the problem of predictive coding as a variant of the Wyner-Ziv problem in Information theory. Subsequently, LDPC based coset code constructions are used to compress the media in a scalable, error-resilient manner. In particular, we propose a video encoding algorithm that prevents the indefinite propagation of errors in predictively encoded video—a problem that has received considerable attention over the last decade. This is accomplished by periodically transmitting a small amount of additional information, termed coset information, to the decoder, as opposed to the popular approach of periodic insertion of intra-coded frames. Perhaps surprisingly, the coset information is capable of correcting for errors, without the encoder having a precise knowledge of the lost packets that resulted in the errors. In the context of real-time transmission, the proposed approach entails a minimal loss in performance over conventional encoding in the absence of channel losses, while simultaneously allowing error recovery in the event of channel losses.
Related Publications:
The design of compression techniques for streaming of image-based rendering data to remote viewers. A compression algorithm based on the use of Wyner-Ziv codes is proposed, which satisfies the key constraints for IBR streaming, namely those of random access for interactivity, and pre-compression. In the present work, we consider the design of compression techniques for streaming of IBR data to remote viewers. The key constraints that a compression algorithm for IBR streaming is required to satisfy, are those of random access for interactivity, and precompression. We propose a compression algorithm based on the use of coset codes for this purpose. The proposed algorithm employs H.264 source compression in conjunction with LDPC coset codes to precompress the IBR data. Appropriate coset information is transmitted to the remote viewers to allow interactive view generation. Results indicate that the proposed compression algorithm provides good compression efficiency, while allowing client interactivity and server precompression.
Related Publications:
Two-channel predictive multiple description coding is posed as a variant of the Wyner-Ziv coding problem. Practical code constructions are proposed within this framework, and the performance of the proposed codes is compared with conventional approaches, for communication of a first-order Gauss-Markov source over erasure channels with independent failure probabilities.Multiple Description (MD) coding of predictively coded sources is of practical interest in several multimedia applications such as redundant storage of video/audio data, and real-time video/audio telephony. A key problem associated with predictive MD coding is the occurrence of predictive mismatch. In the present paper, we pose the problem of predictive MD coding as a variant of the Wyner-Ziv decoder side-in formation problem. We propose an approach based on the use of coset codes for predictive MD coding, which avoids predictive mismatch without requiring restrictive channel assumptions or high latency. We specifically consider two-channel predictive MD coding of a first-order Gauss-Markov process.
Related Publications:
The Computer Vision and Robotics Lab studies a wide range of problems related to the acquisition, processing and understanding of digital images. Our research addresses fundamental questions in computer vision, image and signal processing, machine learning, as well as applications in real-world problems.