Streaming multichannel uncompressed video

Designing video equipment for streaming multiple uncompressed video signals is a new challenge, especially with the demand for 1080p 3G-SDI streams. This article examines a multichannel streaming PCI Express (PCIe) Gen2 DMA controller, which can interface up to four 3G-SDI video streams. Such a solution could be used in nonlinear editors, video servers and video-capture applications.

The proposed solution will support both SD and HD applications and will rely on an open-source video packet streaming-format protocol. This software will manage the video traffic between an SDI block and a PCIe block in an FPGA, where the majority of the functionality resides in the efficient PCIe DMA controller used to stream live video.

Video streaming background

Streaming video may involve either on-demand or live broadcast of compressed or uncompressed video. Most broadcast studio applications rely on uncompressed SDI to move video within the studio, typically between switchers, servers and cameras. Compressed video streams are used in certain studio applications for video over IP. Uncompressed video streams provide low delay and offer no compression artifacts. Compressed video may have more technical issues such as jitter, latency and the loss of quality.

Video streaming consists of three steps. The first step is to convert the incoming SDI stream into the 20-bit parallel domain. Next, the signal is converted to another interface standard, such as a PCIe or Ethernet. Finally, the video stream is assigned to a final destination.

Broadcast studio trends

The transition to HD, and now to 1080p using 3G-SDI, is one of the latest broadcast technology trends. Another is implementation of tapeless workflows that offer integrated production and content management for multichannel and multiplatform applications. A tapeless workflow can stream video from ingest to playout while relying on the integration of production video servers, central storage and related production management software.

A file-based workflow permits the development of a tapeless work environment. This solution provides flexibility and better management of ingested video as it is edited and finally uploaded to broadcast servers' playout. A file-based system suits applications such as IP-based newsgathering, streaming to the Web, and live and on-demand video streaming.

For now, it appears that triple-rate SDI data will be the typical interface used in tapeless and file-based systems. In this scenario, the SDI video is converted to a file format that can be edited in a workstation.

In Figure 1, a Gen1 or Gen2 PCIe bus is used to receive and transmit the converted video stream to and from the SDI domain. SDI-PCIe bridging is used in video servers and video I/O cards for nonlinear editing. The basic building blocks are used to implement video streaming with the PCIe bus. The solution also relies on SDI cable equalizers and drivers as the front-end interface. The core of the video streaming and processing resides in the center block. Programmable logic integrated circuits are used in this block to create a custom implementation for specific solution requirements. The PCIe bus interface can also reside in the same programmable chip.

Video-server and video-capture applications

A video-server and video-capture I/O card share the same basic hardware architecture front end, as shown in Figure 2. The ingest and playout functions can be located on a single card or separate cards, depending on the manufacturer. Some server applications provide the option of encoding or decoding the raw video using H.264 or MPEG-2 HD in the video processing block.

The PCIe block provides the video-streaming capability for the converted SDI signal to feed the workstation. This block is often the bottleneck, especially if multiple 3G-SDI streams are being processed. Four 3G-SDI full duplex video streams over a PCIe Gen2 card with four lanes would translate into a data rate of 13.5Gb/s. Therefore, this block has to be highly efficient and must provide a high QoS.

Implementing a multichannel 1080p SDI-PCIe bridge

The PCIe architecture within a modern PC workstation has more than sufficient bandwidth to simultaneously transfer several 1080p60 video streams. The challenge for the designer is to use this bandwidth without placing excessive demands on either the CPU or the local storage on the PCIe capture card. This choice of an efficient DMA controller is therefore central to the success of the project.

By packing three 20-bit pixels into two 32-bit words, an active video frame of 1920 pixels × 1080 lines requires just over 5.27MB. In an ideal world, the CPU would move this video frame from the capture card into the system memory of the PC with a single DMA transfer. Unfortunately, this usually is not possible due to the way PC operating systems allocate memory. System memory normally is allocated in 4KB blocks, and there is no guarantee that a request for 5.27MB of memory will result in consecutive physical locations being available.

This requires most PC DMA controllers to support a scatter-gather mode, as illustrated in Figure 3. In this mode, the CPU creates a linked list of DMA instructions, each of which transfers just 4KB. The DMA controller processes each segment of this list in turn, automatically fetching the next entry on the list as it is needed. In this way, the controller can be programmed to cope with the PC's fragmented memory allocation without placing excessive demands on the CPU.

One drawback of PCIe is that its complexity leads to a higher latency than previous bus architectures. This is especially a problem for read transactions where the requestor issues a packet to the completer asking for the data, which the completer then returns. A requestor or a completer is a PCIe definition for packet receive request or packet sent completed. A requestor will request a certain packet to be sent. A completer will acknowledge that the packets were sent to the requester. This transaction, therefore, means twice the PCIe link delay. Several mechanisms are available to mitigate the effect of this latency, such as supporting large packet sizes and multiple outstanding read requests. In any case, the user must be sure the issue is sufficiently resolved in the chosen solution.

If a DMA controller waits until the previous read acknowledgement by the completer before issuing the next read request, the overall transfer performance from the system memory to the capture card will be poor. To improve downstream efficiency, the DMA controller must support multiple outstanding read requests. This means the DMA controller must always be looking ahead and issuing the upcoming read requests, before the previous bus transaction has competed. In this case, the effects of the read latency on overall bandwidth utilization will be minimized.

The effect is still present, however, for each individual read access. This is a particular issue for scatter-gather DMA controllers. It is no longer acceptable for controllers to wait until the end of a scatter-gather segment before fetching the next set of instructions. To do so would further lower the overall efficiency of the system, and would increase the amount of local storage required to hold the extra video data while the next element of the linked list is retrieved.

A DMA controller core that provides efficient handshake timing requirements is readily available from chip vendors so that a circuit designer can simply drop the core into a complex design using only a graphical design entry tool. (See Figure 4 on page 37.) This tool allows the interconnection of the 1080p 3G-SDI front end and the PCIe DMA controller without having to actually write the needed code for this video streaming application.

The complete PCIe video streaming interface, together with the DMA controller, are included in the “streaming_dma_0” design module of the IP core. The tool environment uses an FPGA open-source generic video interface protocol to allow IP blocks from different vendors to communicate with each other. The design-entry tool is then used to connect the SDI IP core and the DMA streaming controller together in a single chip design, as shown in Figure 5.

Conclusion

The availability of video-specific development boards, IP cores and user-friendly design tools turn a complex video-streaming system design into a much simpler task for the average engineer. Free video reference designs from IC venders also can be used as starting points to create video-streaming application projects.

Tam Do is the senior technical marketing manager for Altera.