Real-time transport of 3-D from the camera to the screen

3-D stereoscopic signals for TV or digital cinema consist of a Left eye (Le) and Right eye (Re) pair. The signals are very similar, with the subtle differences carrying the depth information. The two signals need to be captured, transported, recorded, synchronized, processed, coded, decoded and displayed in a way that is as close as possible to identical. If they are not, the differences induced in the signals will create artificial and distorted depth cues, with disturbing results for the viewer.

Existing standards and procedures for video signal processing do not take into account this important requirement. They deal only with single streams, without regard for relative performance between pairs. Special equipment will be needed to address this issue.

Depth information

Depth information is coded as relatively small horizontal differences between the positions of objects in the Le and Re images. Much of the depth information has a disparity of no more than 1 percent of the picture width. Perceived depth is, therefore, sensitive to small errors in position, from whatever source.

This is in contrast to the relative insensitivity of the viewer to geometry errors. The widespread acceptance of incorrect aspect ratio settings, with a geometric error of 33 percent, is an extreme example of this insensitivity. Even within professional environments, the common use until recently of scanning electron beams for capture and display meant that geometric accuracy of better than 1 percent was difficult to achieve, and no great distress was caused by such errors. So a parameter of the broadcast chain that is insensitive to absolute error has become highly sensitive to matching error between 3-D Le and Re channels.

3-D capture

A 3-D camera typically consists of a pair of identical camera bodies and lenses, mounted with a controllable horizontal offset. They also have a variable relative angle, so that the plane at which their fields of view converge is controllable. For all but the smallest cameras, the horizontal offset required for good stereoscopy is smaller than the width of each camera, so they cannot simply be set side-by-side on a jig. The smaller intra-ocular distance is obtained by setting a half-silvered mirror at 45 degrees, with one camera viewing the scene through the mirror and the other viewing the reflected scene. This process inverts the image into the second camera. The image is then flipped electronically to correct its orientation. (See Figure 1.)

Camera matching

For side-by-side camera setups, the cameras must be accurately matched for geometry. This requires lens matching across all zoom settings. Figures 2 and 3 show the artificial depth information that will be introduced if there are geometric differences between the two cameras, even if they are nominally co-sited (i.e. with the intra-lens distance set to zero).

For mirror rig cameras, any asymmetrical distortions, even if they are the same for both cameras, will cause mismatch after one image is flipped. (See Figure 4.)

There is also the risk of timing disparity caused by a vertical image flip in one camera signal. (See Figure 5.)

Horizontal keystoning from converged cameras

As the cameras' convergence (toe-in) is increased to bring close objects to the neutral depth plane, there will be differential keystone distortion between the channels. This will create vertical differences between objects in the Le and Re channels, which are particularly disturbing to viewers as there is no natural mechanism for such differences. (See Figure 6.)

3-D over a 3G link

The 3-D signal consists of a pair of co-timed video signals with audio, metadata, etc. 3G SDI allows the transport of two separate co-timed HD signals over a single link. As SMPTE standard ST 425 describes, all 3G signals consist of a pair of virtual streams. When the 3-D signal consists of a pair of 1.5Gb/s HD signals, it is necessary only to define which stream carries which of the Le and Re channels, and where the audio and metadata should be carried. SMPTE is currently defining just this set of parameters for a 3-D signal, and also a new ST 352 signal identifier code to specifically identify a 3-D pair. (See Figure 7.)

3-D over multiple 3G links for 1080p production

The arguments for producing and archiving at the highest possible signal quality are equally strong for 3-D as they are for 2D. Indeed, the potential for mismatching in de-interlacers makes the argument for the use of 1080p/60 or 1080p/50 even stronger for 3-D. However, each 1080p signal needs a data capacity of 3Gb/s, with a 3-D stereo pair needing a total of 6Gb/s.

Standards are being generated for multilink 3G SDI, with new ST 352 signal identifier codes, data transport formats and timings. (See Figure 8.)

Frame synchronization

A frame synchronizer is essentially a flexible buffer that writes the input data to a store as it arrives based on the input timing and reads it back at the correct time for the desired output timing. At some point, the input and output timings start to overlap, and the synchronizer drops a frame from the input or repeats a frame, depending on whether the buffer is full or empty. The timing for this frame drop or repeat is not critically defined.

If two independent frame synchronizers were used for a 3-D stereo signal, it is practically certain that they would make their drop/repeat decisions at different times, leaving a potentially long period with a frame difference in timing between the Le and Re images. A 3-D optimized synchronizer needs to have a communication channel between the control circuits for the two channels. (See Figure 9.)

Standards conversion

A standards converter creates new fields and frames by combining information from several input fields. The quality of the conversion is dependent on proprietary algorithms, which are of considerable value to the individual manufacturers and so are not published.

Even with identical converters, the exact decisions made by the conversion process will not be the same. Standards converters depend on a combination of linear and nonlinear decisions as to which input data to use to construct each output pixel. These decisions are dependent on many pixels from many input fields and may have history going back several frames.

As with frame synchronizers, the decision-making processes for frame rate conversion need to be synchronized between the Le and Re signals.

If the video format is interlaced, there is an implicit de-interlacer in the standards converter. If the de-interlacer uses any nonlinear decision making, for example to improve the performance with near-horizontal diagonal lines, the control circuits should be matched between the two channels.

Even the relatively simple process of aspect-ratio conversion has similar risks, as it includes a de-interlacer function.

Compression

3-D storage and recording is dependent on compression. Video compression works by transforming the video signal to a coding domain where different parts of the coded signal have very different sensitivities to distortion. The less sensitive parts are then heavily quantized and reordered to allow efficient run-length coding. The transform and the run-length coding are essentially transparent, but the quantizing is not. To minimize the visibility of the quantization process, many aspects of the signal are examined and used to drive the coding decisions. As with standards converters, these decision processes are highly proprietary and differ between manufacturers, as well as between equipment models. In the case of long-GOP compression, the phasing of the GOP sequence is essentially arbitrary, and there is no reason for two long-GOP encoders to use the same GOP sequence. (See Figure 10.) If long GOP compression is used for 3-D signal pairs, the GOP sequences should be synchronized. (See Figure 11.)

Conclusion

There are many processes in the passage of an image from the origination scene to the final display with the potential to alter the position or timing of an object in the image. For 2D images, viewers are relatively insensitive to small alterations, but for 3-D stereoscopic images, any differential distortions between the two channels can cause disturbing depth anomalies.

In some cases, such as optical distortion or time-code based recording and editing, careful use of high-quality equipment will minimize the risks. In the case of equipment that manipulates the geometry and timing of the signal, such as synchronizers, aspect-ratio converters, format converters and standards converters, significant errors can occur unless the equipment is specifically designed for 3-D signals.

When the function of the equipment is lossy compression, the decision-making circuits that choose which aspects of the image to keep and which to discard need to be tuned to the viewer's need for adequate depth information, as well as the spatial and temporal information that has been considered for 2D images.

Many of these processes operate with a higher level of quality and consistency when the source material is progressive.

A key element of the signal path is the transport through the production and post-production environment. Matched transport of a pair of 1.5Gb/s signals is enabled using a single 3Gb/s link. Transport of dual 1080p signals requires a method of matching a pair of 3Gb/s links. SMPTE ad-hoc group TC-32NF20 AHG Multi-link 3G is currently defining the standards for this and other uses of multiple 3Gb/s links.

Nigel Seth-Smith is strategic technology manager for Gennum.