As broadcasters transition to file-based media production, large disk-based storage systems are becoming the fundamental media service of the production architecture. However, media traffic presents much more rigorous throughput requirements than classical IT solutions. Storage components must handle gigabyte-size files, large chunks of data in one I/O (typically up to 4MB) and continuous streams of traffic bursts over the storage network.
To increase throughput, media storage solutions distribute, or stripe, data over several distinct storage systems. Because every server needs parallel access to every storage system, media storage often relies on storage cluster concepts typically used in high-performance computing (HPC). These clusters employ a large number of devices, leading to complex storage network architectures. However, while HPC clusters typically exchange mostly small messages, media networks are continuously loaded to full capacity, leading to network congestion and sustained oversubscription of the switch ports.
These circumstances present a significant challenge: How can network engineers design a scalable storage network that can sustain the continuous throughput required by file-based media production while maintaining high efficiency and network use? As this article describes, the most significant barrier is traffic interference. Previously, VRT-medialab demonstrated that a storage cluster architecture employing Cisco's Data Center Bridging (DCB) technology and the PAUSE frame mechanism defined in the IEEE 802.3x standard can achieve higher link bandwidth use and scalability than traditional InfiniBand (IB) solutions. (See Broadcast Engineering, January 2010.) However, the fundamental impediment of traffic interference remains for both 802.3x- and IB-based clusters.
Our laboratory sought to address this with priority flow control (PFC). We performed a series of comparative tests between 802.3x- and PFC-enabled storage clusters. Ultimately, we found that PFC eliminates traffic interference and supports a highly scalable storage network that sustains 100-percent efficiency.
Media storage architectures
Because media storage systems stripe data over several storage systems, every server needs parallel access to every storage system. Like classical IT storage area networks (SANs), most first-generation media file systems use a single Fibre Channel (FC) storage network to connect every file server node with every storage controller. This leads to a complex network topology that is ill-suited for media environments. VRT-medialab demonstrated that under sustained media storage traffic loads, the long traffic bursts interfere with each other in the switch buffers and create severe efficiency loss. (See Figure 1.)
As shown, when multiple sources deliver long bursts of traffic to the same destination, throughput of the source links is limited by the bandwidth of the aggregating link. However, when a second destination requests data from the same source storage controller (see purple traffic flow in Figure 1), the second destination server does not receive the full bandwidth available at the shared source link. Because the switch port buffers are filled with “blue” traffic, the purple flow can only pass a data frame every time a blue packet is read by the left destination server — a problem exacerbated by the fact that the left destination is reading from four sources simultaneously. Traffic interference occurs, and traffic flow to the second destination slows. Extrapolating this effect to larger media storage network topologies, efficiency severely deteriorates, limiting the scalability of any FC-based media storage environment.
DCB-based WARP cluster network
These limitations can be partially overcome by splitting the storage network into two separate networks. (See Figure 2.) This can be accomplished using IBM's General Parallel File System (GPFS) and a Workhorse Application Raw Power (WARP) media storage cluster consisting of storage cluster nodes and network-attached cluster nodes (NAN). This architecture has a much simpler topology.
DCB transport is well-suited as the cluster network for this type of media storage architecture. DCB allows flows to be tightly controlled and load-balanced over the links and uses the 802.3x PAUSE mechanism to provide link-level flow control similar to FC, creating a “lossless” environment. The result is a notable improvement in scalability and link bandwidth use compared with FC or even IB; however, the fundamental effects of traffic interference remain. (See Figure 3.)
As shown, when multiple NAN nodes read traffic from the storage nodes, each storage node responds with large bursts of media traffic toward each requesting NAN node (depicted as different colors). At the converged network adapter (CNA) network interface of the storage node, the bursts are queued in the network interface buffer. These frames are sent to the switch (shown here as a Cisco Nexus 5000), where they end up in a single ingress queue buffer. Because 802.3x PAUSE link flow control is configured, the link sends a PAUSE frame to the storage node once the high threshold of the buffer is reached, thereby avoiding frame loss.
In this example, three different NAN nodes are reading frames out of this buffer and also from the other storage nodes. This limits the total reading bandwidth on this port to only 75 percent of the incoming traffic throughput. Hence, the buffer fills up, and the PAUSE mechanism kicks in. If, because of the bursty nature of the traffic per flow, the filling of the switch port buffer is not equally distributed over the three different “colors,” one of the colors (or traffic flows) can be depleted by the simultaneously reading NAN nodes before the buffer reaches its low threshold and unpauses the link. When this happens, no frame from the depleted color is available, resulting in a “read-miss” of the NAN node and a drop in efficiency. The issue continues until the link is unpaused and frames of the missing color are again provided out of the network interface queue of the storage node. This efficiency loss can cause significant performance degradation in the network. Fortunately, there is a solution to this dilemma.
Priority flow control
DCB provides another, more advanced flow control mechanism: PFC. IEEE 802.1Q defines a tag that contains a three-bit priority field, allowing engineers to assign priorities to different Layer 2 traffic flows. With PFC, the network can be configured to pause traffic labeled with a specific priority (or “p-value”) independent of the other traffic. The mechanism works the same way as 802.3x PAUSE but selectively, per traffic class, instead of pausing the whole link at once. Effectively, each traffic class gains its own independent buffers and pause mechanism.
Whereas an 802.3x DCB WARP cluster will have traffic interference at oversubscribed ports, PFC can link different priorities to the traffic flows between two specific nodes of the storage cluster, allowing engineers to implement flow control for each distinctive traffic flow. (See Figure 4.) Ultimately, this solution eliminates traffic interference.
Consider again the situation shown in Figure 3, in which multiple NAN nodes read traffic from the storage nodes and each storage node responds with large bursts of media traffic toward each requesting NAN node (again depicted as different colors). This time, however, each flow between a distinctive source-destination pair is labeled with a different priority value.
With PFC activated on the CNA, each p-value-labeled flow has a separate buffer in the network interface, and the bursts are queued into the dedicated network interface buffer for each respective color. On the other side of the link, the Nexus 5000 DCB switch port also uses dedicated queue buffers for each p-value, providing for separate sending and receiving queue buffers at both ends of the link for each color. Frames are picked in a round-robin fashion out of the different CNA queues and sent over the link, where they fill up their respective ingress queue buffers of the switch port.
In Figure 4, three different NAN nodes read frames out of the buffers for their respective colors, and also from the other storage nodes, once again filling the buffers. This time, the PFC PAUSE mechanism kicks in. Now, because each flow fills its own buffer independently, the switch can send a selective pause-frame to the server when necessary, pausing only one traffic flow without interfering with others. At the same time, the independent flow control mechanisms for each flow keep enough frames available in the independent receiving switch port buffers for each color.
Hence, none of the streams are depleted by the simultaneously reading NAN nodes, and the reading links continuously operate at maximum efficiency. As long as each storage-NAN server pair has an independent priority value and queue, no traffic interference occurs. Throughput scales linearly as the cluster is scaled, and the storage cluster network achieves 100-percent efficiency.
Our laboratory performed comparative tests between 802.3x- and PFC-enabled WARP clusters for both Linux and Windows NAN nodes. The tests included single-stream throughput to/from one to four NAN nodes and multiple (four) streams to/from one to four NAN nodes (for a more even saturation of the link bandwidth), independently verifying the efficiency for both reading and writing.
Tables 1 and 2 provide the results for the Linux cluster. The “percent single node” column compares the throughput per NAN node with the throughput obtained when using a single NAN node only. Tables 3 and 4 provide the results for the Windows cluster.
These results clearly demonstrate both the substantial impact of traffic interference on media storage networks and the extraordinary improvements in scalability and network bandwidth use when using PFC. In the 802.3x clusters, traffic interference causes a performance drop of up to 40 percent when using four NAN nodes simultaneously. The same traffic interference and performance drop has been previously measured in IB-based WARP clusters. When PFC is enabled, however, no traffic interference is observed at all. (The small performance drop when reading from four NAN nodes is caused by the fact that the file system can't launch prefetches for reading requests aggressively enough to overcome the statistical response fluctuation of the storage system when running continuously at full throttle. This effect is not observed when writing.)
The test proved unequivocally that the PFC-enabled cluster network can sustain 100-percent efficiency at continuous full throttle — demonstrating ideal scalability and an optimal storage solution for IP media environments. Windows performance is only marginally less than Linux performance but still displays linear scalability and almost 100-percent use of the available bandwidth. Clearly, the PFC-enabled cluster network outperforms similar IB-based cluster architectures in both throughput (especially for Windows) and linear scalability.
Luc Andries is a senior infrastructure architect and storage and network expert with VRT-medialab, the research and development arm of Flemish public radio and TV broadcaster VRT.
Table 1. Test results for Linux 802.3x-enabled, DCB-based WARP cluster Linux 802.3x DCB-based cluster Read (Gb/s) Percent single node Write (Gb/s) Percent single node One NAN node One stream (dd) 10 100 10 100 Four streams (dd) 9.9 100 10 100 Two NAN nodes One stream (dd) 16.3 82 17.1 86 Four streams (dd) 16.3 82 17.8 89 Three NAN nodes One stream (dd) 20.3 68 23.3 78 Four streams (dd) 20.3 68 21 70 Four NAN nodes One stream (dd) 24.2 61 24.2 61 Four streams (dd) 24.3 61 24 60 Table 2. Test results for Linux PFC-enabled, DCB-based WARP cluster Linux PFC DCB-based cluster Read (Gb/s) Percent single node Write (Gb/s) Percent single node One NAN node One stream (dd) 10 100 9.9 100 Four streams (dd) 10 100 10 100 Two NAN nodes One stream (dd) 19.9 100 19.9 100 Four streams (dd) 19.9 100 19.9 100 Three NAN nodes One stream (dd) 29.9 100 29.8 100 Four streams (dd) 29.9 100 29.8 100 Four NAN nodes One stream (dd) 36.1 91 39.5 99 Four streams (dd) 37.6 94 39.9 100 Table 3. Test results for Windows 802.3x-enabled, DCB-based WARP cluster Windows 802.3x DCB-based cluster Read (Gb/s) Percent single node Write (Gb/s) Percent single node One NAN node One stream (dd) 10 100 9.9 100 Four streams (dd) 10 100 9.9 100 Two NAN nodes One stream (dd) 16.2 81 17.3 87 Four streams (dd) 16.2 81 16.5 83 Three NAN nodes One stream (dd) 19.6 65 20.8 70 Four streams (dd) 19.9 66 20.6 69 Four NAN nodes One stream (dd) 22.6 57 23.5 59 Four streams (dd) 23.1 58 23.4 59 Table 4. Test results for Windows PFC-enabled, DCB-based WARP cluster Windows PFC DCB-based cluster Read (Gb/s) Percent single node Write (Gb/s) Percent single node One NAN node One stream (dd) 10 100 9.7 100 Four streams (dd) 10 100 9.8 100 Two NAN nodes One stream (dd) 19.9 100 19.4 100 Four streams (dd) 20 100 19.5 100 Three NAN nodes One stream (dd) 29.9 100 28.7 99 Four streams (dd) 29.9 100 28.8 98 Four NAN nodes One stream (dd) 36.3 91 37.3 96 Four streams (dd) 37.1 93 37.7 97