Depth Information for Light Fields

David Redkey
Lucas Pereira
cs348c -- Virtual Reality

Introduction

A light field is a new graphics format that blurs the line between artificial environments and reality. A light field can be generated by computer, or it can be captured from the real world.

In the future, it will be interesting to combine light fields with other objects. However, in order to get proper occlusion, it is necessary to determine the depth of every ray in the field. This is easily computed for computer-generated lightfields that have an internal geometric representation of a scene. However, for light fields captured from the real world, we often do not have the depth information.

Our goal was to use multi-baseline computer vision techniques to generate the depth information for a light field from the images. Once we reconstructed the depth information, we could then recombine the light field with other objects or scenes, and get proper occlusion.


Reconstructing Depth Information


(Click on a picture to see a larger version.)

Approach:

The impressive array of camera locations required to compute accurate lightfields provide us with an exellent opportunity to perform multi- baseline stereo reconstruction of depth. The fundamental assumption required is that a point on a surface as seen from one camera location will appear very similar in color when viewed from other locations, albeit with different pixel coordinates. If we knew with certainty which pixel in one image corresponded to which pixel in any one of the other images, we could perform triangulation to determine the exact depth of the corresponding point in space. Of course, there may be several pixels in an image that share the same color, so further constraints are required before matching can be done.

One fundamental advantage of stereo reconstruction is known as the "epi- polar line" constraint. If one traces a ray from the center of projection of one camera location through any one pixel in its image plane, the projection of the ray onto the image plane of any other camera will be a line. This is because the viewing transformatin is perspective, and perspective transformations map lines to lines. This means we only have to search for pixel matches along a given line in a stereo image.

Another advantage we can rely on comes from having multiple "baselines" at our disposal. "Baseline" is the computer-vision term for the displacement vector between the projection centers of two cameras. Each baseline provides a different epi-polar line through a different image, allowing for the better rejection of incorrect pixel matches. For example, we may be trying to stereo-match an orange pixel from camera #1 to determine it's depth, and find that three orange pixels lie along the epi-polar line for camera #2. Without additional camera positions, we would be unable to choose from among three equally valid depth values for the corresponding point in space. If we include a third camera, however, its epi-polar line will cross different points in the scene from the epi-polar line in image #2. Even if the new epi-polar line contains, say, four orange pixels, chances are that only one of the four will correspond to the same depth value as any of the three pixels from image #3. This depth value would be the correct depth for the point in space.

In determining the depth of a given pixel, our algorithm selects a regular sampling pattern of inverse-depth values from the near clipping-plane to infinity. By choosing regular spacing of inverse-depth instead of depth, it turns out that each increment of the depth-term corresponds to a constant pixel-increment along the epi-polar line in an image. This is a handy performance benefit. For each depth-sample, we accumulate an error quantity corresponding to how different the color of the projected pixel is in each of the other frames. When we are done, we choose the depth sample with the least error as the actual depth. Here is a simplified description of the algorithm:

	For each camera frame
          For each pixel
	    For all candidate depths 
	      For all other camera images
		Compute color difference with projection
			       

	 Choose best distance as minimizer of sum of squared error
Most previous depth reconstruction efforts have tried to match texture "windows" around pixels, instead of just single pixel values, to provide more confidence in matching. The problem with this approach is that widely spaced cameras will greatly distort the appearance of even small texture windows in other images, making matches more difficult to determine. Our code supports this mode of operation, but our hope was that the vast array of cameras available to us would allow matching to be done by checking only single pixels, thus avoiding the texture warping problem.


Challenges

The biggest challenge to any stero-matching algorithm is view-dependent shading. If the apparent color of a point in space changes depending on the angle of viewing, then it becomes very difficult to determine that the point as viewed in one image actually corresponds to the same point viewed from another. Specular highlights are the classic example of this. The depressed reconstruction of Buddha's chest is a good example of where specular highlights can fool a stereo matching algorithm. Since the highlights on his chest tended to follow the camera position, equal-color rays from different cameras tended to converge behind him.

Another difficulty faced by stereo-matching is occlusion. If a point in space is visible to one camera, but another camera's view of it is blocked by an obstruction, stereo matching cannot be done with the blocked camera. We avoid corrupting our error data with obstructed cameras by rejecting all input from a given camera if none of the pixels on its epi-polar line are similar to the target pixel. Because of the 2-dimensional array of camera locations available to us, we are able to determine depth fairly well, even for points that are invisible from large regions of space. The possibility of corruption when false matches are present is still a serious problem.

Extreme texture frequencies are another problem for stereo matching. Low- frequency regions (regions of roughly constant color) are very difficult to reconstruct precisely, because its likely that a range of depth-values will produce color matches. High-frequency textures are subject to aliasing. It's possible that a pixel in one image will have no correspondent in another image. An example of this is our algorithm's difficulty with the wall behind Buddha. This problem is especially apparent in real-world scenes if the cameras are not calibrated perfectly, making the epi-polar lines inexact.

Finally, depth discontinuities are a problem, because the likelyhood exists of pixels that are blends between foreground and background objects. It's likely that one such pixel will not have a match in any other image, or worse, if the background is uniform, false matches will be created as cameras follow the moving silhouette.



Our algorithm works best on scenes like this, where the object has a texture and few specular highlights.


Our algorithm does fairly well on this object, even though it only has two colors. However, due to sparse spacing in the u-v plane for this particular light field (~15 degrees per image), the lower right face has some aliasing artifacts.


This scene demonstrates some of the things that cause confusion in our algorithm. The background wall texture has a high frequency component, so that there is very little coherence between images. The Buddha and the table have very low frequency textures, which make it difficult to calculate an exact depth. Finally, the Buddha has a specular highlight that moves across his chest, and disrupts our algorithm.

Future Directions:

There are several possible directions of future exploration with our multi-baseline stereo-matcher. One possibility is to vary the conditions of pixel matching in response to local texture properties. Another is to create a multi-pass algorithm that determines a confidence level for each pixel depth determination. After depth has been determined with high-confidence for some rays, other rays will be wise enough to avoid using match information from these rays if they terminate before or after intersection. Another avenue to explore involves using "active" approaches that project light onto scenes to aid in the matching of features. While color information may be lost in this way, regions of low-frequency texture or view-dependent shading can be reconstructed with much better accuracy.

Compositing Light Fields With Other Renderings

After we have reconstructed the Z depth, we can use this information to composite the light field with other objects. Our current approach is to transform the light field Z values into the current viewing coordinate system, and then compare the Z values with those already in the Z-buffer. From this we can set the alpha blending values, and copy the lightfield image and Z values onto the screen.


This shows the results of combining a GL-rendered object (the gray cube) with a lightfield object (the colored cube). We are using the Z-values computed by our depth-reconstruction algorithm.

Generating the Light Field Image

For each ray in the light field, we store the depth value along the normal vector of the s-t plane. The depth represents the distance from the u-v plane to the object surface, along the vector perpendicular to the s-t plane. This makes it easier to transform the distance into any viewing direction efficiently.

For each ray, the lightfield data structure stores the color information. We added the Z data to the same structure. Thus when the lightfield code assembles the rays for a particular view, the Z values get passed in, too. The RGBA colors create a color image, and the Z values create a corresponding Z image. At this point, the Z values are still in world coordinates, perpendicular to the s-t plane.

Generating the Geometry-Based Image

We don't have to do anything special to a geometry-based image to combine it with a light field. We just render it into the color and Z buffers, as normal.

When we're done rendering the scene, we use lrectread() to copy the Z-buffer data back into memory, so that we can do Z-comparisons in software.

Transforming the Light Field Z

In order to composite the two images, we need to map their Z-values into the same coordinate system. We chose to map the light field Z-values into the current viewer coordinate system.

For each ray in the light field image, we compute the vector Vp, which is the vector from the intersection of the ray with the s-t plane to the viewer. We dot this with the normalized viewing direction vector, to get the projection of Vp along the viewing direction. We multiply the ray's Z-value by the ratio: (projection of Vp along viewing direction)/ (distance of viewer from the s-t plane) to get the correct Z-value for the ray along the current viewing direction.

Note that this essentially involves a matrix multiplication for every ray (pixel) in the light field. This can really slow down display performance. It might be more efficient to map geometry-based Z-values to the lightfield's coordinate system, rather than our current approach. The advantage is that the matrix multiplication could be folded into the projection matrix. Thus it wouldn't add any extra work. The major disadvantage is that this requires the geometry renderer to know about the lightfield paramaters.

Compositing

Once we have the Z-values in the same coordinate system, we can composite the two images together. We compare the Z-values, and determine which image is in front. We set the alpha blend values in the light field image to do the proper occlusion. If the two surfaces intersect within the pixel, then we linearly interpolate the Z values to determine the most likely intersection location. We use this information to set an in-between alpha blending.

When we finish generating the alpha values, we copy the image onto the screen. The alpha values control which pixels get modified, and how much. Then we write the new Z-values into the Z-buffer, so that more images can be composited, if desired.


This is another view of the composite image. This view shows an artifact that appears around the outside edges of lightfields. The multi-colored cube was rendered anti-aliased, so that it fades to black at the edges. These "partially black" pixels are matched by our reconstruction algorithm, and so they have the same depth as the object. Thus they partially occlude the cube behind the lightfield, and create a dark "fringe" around the edge of the lightfield object.


Buddha with a cube in his lap, and a cube on his head. Even with Buddha's specular reflections, we were able to reconstruct the depth information reasonably well in the areas where there was enough detail, such as his abdomen and his head.


References

Duff, Tom, Compositing 3-D Rendered Images, Siggraph 1985, vol. 19, no. 3, 1995.

Levoy, Marc and Pat Hanrahan, Light Field Rendering, Siggraph 1996.

Okutomi, Masatoshi and Takeo Kanade, A Multiple-Baseline Stereo, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 15, no. 4, April 1993.


Relevant Projects

Virtualized Reality: Concepts and Early Results, Presented at IEEE Workshop on the Representation of Visual Scenes, Boston, June 24, 1995.

The High Performance Computing Graphics Project, Carnegie Mellon.

The Stanford Light Field Project.