Ethernet for audio networks

As broadcast technology evolves, demands on it become more complex, driving the development of systems. While networking technology has been available for use in broadcast systems for some years, it has taken time for its use to become widespread — and with good reason.

Until recently, networked A/V systems were cumbersome and expensive, and the benefits they offered over hard-wired systems may not have justified the investment. Fixed-topology Ethernet broadcast architectures designed around traditional IT infrastructures have continued to dominate.

However, technology has reached a stage where the flexibility of operation it bestows makes it a worthwhile return on investment, and the evolution of this technology continues apace.

Networking hinges on understanding the basic unit of currency — the frame, which also is referred to as either the etherframe or Media Access Control (MAC) frame. (See Figure 1.) Without the frame, one can't know where the data stops and the noise begins.

The MAC header contains information about where data wants to go, where it's from and what kind of data it is. Addresses are six bytes-long, MAC-48 addresses that allow data dropped into a network made up of many routers and switches to always arrive at the destination address. The data could be text, image fragments or tiny amounts of audio (about 8/1000 of a second). The CRC ensures everything is as it should be.

This is the essential fabric of an Ethernetwork. It is self-learning, adaptable, and can be quite efficient. However, there are limitations. Such networks can easily overload because the switches have a finite capacity.

Network protocol levels

Open Systems Interconnect (OSI) looks at the levels of protocols on a network. The OSI Reference Model is useful as it applies to any kind of network scheme, including Ethernet/TCP/IP/LAN.

Figure 2 shows a model of a communication channels between two computers (or audio devices) on a simple Ethernet network. It depicts the different levels of service or operation that have clearly definable functions. Effectively, the data is being passed from one layer to the next.

The first three are Media layers, the most important for real-time audio networks:

Physical layer — Wires, circuits, connectors, hardware (bit layer).
Data link — In Ethernet, it is the frame. It provides the basic means of getting data around the local physical network (frame layer).
Network layer — Provides a means of addressing beyond just the local neighborhood. The Protocol is called IP, or Internet protocol, and allows addressing across the world. IP addresses are important; they are what allow us to send an e-mail to a colleague on the other side of the world, or access a website (packet layer).

The rest deal with segmentation, reassembly and the provision of higher-level applications such as e-mail, file transfer and HTTP.

Why change?

Why create an audio network like this when solutions such as star-quad, AES3 and MADI are already available?

There are several advantages. IT network infrastructures allow resources to be shared between lots of people. Ethernet-based audio networks also offer similar benefits, making it possible to link geographically diverse audio interfaces and allow shared access and control. The technology is cheap to install, and exploits available physical layer components such as IP blocks, open-source operating systems and stacks.

That said, audio and IT do have fundamentally different requirements. IT cares about integrity, with elegant coping mechanisms for traffic bursts and intermittent drops in service. Audio networks, on the other hand, require predictability, reliability and determinism. Therefore, they are derived from IT infrastructures that require tight control over the equipment use and management. The causes of unpredictable behavior must be eliminated or mitigated with buffering and retransmission strategies.

For this reason, several network protocols use different OSI layers.

Layer 1 protocols

Layer 1 protocols use Ethernet wiring and signaling components, but they do not use the Ethernet frame structure. They use bespoke hardware designs rather than standard MAC components, and use Ethernet-capable physical level drivers and receivers. Examples include A-Net by Aviom, SuperMAC and HyperMAC (formalized as AES50, originally developed by SonyOxford), and MaGIC by Gibson (consumer space).

They can be cost-effective and reliable because they do not use the Ethernet frame structure. But, commercial Ethernet components such as switches, hubs or media converters cannot be used. Therefore, topology can be limited.

Layer 2 protocols

Layer 2 protocols encapsulate audio data in standard Ethernet frames. Most can make use of standard Ethernet hubs and can use a variety of topologies, such as stars, rings and daisy chains, for example. Specific examples include:

CobraNet (originally by Peak Audio and now owned by Cirrus Logic) — designed primarily for large-scale audio installations such as convention centers, stadiums, airports, theme parks and concert halls;
Ethersound by Digigram; and
Hydra by Calrec (NB not Hydra2).

Layer 3 protocols

Layer 3 protocols encapsulate audio data in standard IP packets rather than MAC frames. This can be less efficient as the segmentation and reassembly (SAR) is more processor-intensive, which may mean fewer channels and higher latency or more expensive hardware. Examples include:

WheatNet — AoIP by Wheatstone;
Dante by Audinate; and
Q-Sys networking (Q-LAN).

Packing and routing

Limitations with these technologies are due in part to how switches respond to Ethernet frames, as well as the capacity of the links.

Packing efficiency needs to be managed — Each frame has overhead, and to put one channel in a frame is inefficient, as opposed to putting lots in, which is highly efficient. The possible worst case is 5 percent (one sample, one channel), while the best case is 98 percent. Packing more channels into a frame results in higher efficiency, but the trade off is that latency increases the more packets are sent.
Routing schemes — Ethernet communications are designed to be mostly single point to single point. The assumption is that one computer will want to talk to another computer or a printer, but not to the whole network. This is called unicast. It is possible, however, to send a packet to all network addresses, whether they want them or not. This is called broadcasting or multicasting. Consider a unicast network with six nodes as shown in Figure 3.

Node A wants to transmit a 5.1 signal to E. It can start sending unicast frames with six channels of data in them. But now B, C and D want them too, so the network has to send four separate unicast frames. Since packing is inefficient, the network may reach its limit of, say, 40. If a second 5.1 channel is required, the network can't cope.

Suppose A sends out 2X5.1 signals in a multicast packet to all destinations. (See Figure 4.) They all get it. But then, B wants to send two streams to E, and the segments connecting D and E then become dangerously full. If further streams need to be sent, then the whole network could be brought to its knees.

Different products have used various strategies to mitigate this, but they are all a compromise in the end, trading efficiency against accessibility against latency.

Audio network needs

These issues, however, can be minimized. For example, we can achieve packing efficiency by using variable packet sizes — between two and 42 channels — on a Layer 2 protocol unicast network. This process needs to be in the host interface hardware (for packing and unpacking audio streams into MAC frames).

Also, limits can be imposed to the number of consoles and I/O boxes, meaning that while a lot of this bandwidth is undersubscribed much of the time, the system can handle large demands, making resources, effectively, always available.

Also, third-party equipment must be sourced responsibly, with different switches introducing different delay amounts — some variable, some fixed and others proportional to packet length. Choosing certain switches can give some control over this.

Continual evolution

To guarantee completely deterministic networks, some manufacturers can now use purely Layer 1 technology with gigabit signaling and physical-level drivers, without incorporating Ethernet frames at all. This allows up to 512kHz × 48kHz bidirectional signals along Cat 6 cable or fiber with sufficient space to carry non-audio data as well. Opting for 512 channels, with only one sample from each, results in high capacity and low latency, but this can only be done with manufacturers' proprietary hardware. (See the BSkyB case study on page 28.)

This technology also possesses a couple of other key benefits, such as 100-percent redundancy, ease of use (true Plug and Play networking to create ad hoc networks), and a wide range of network resources from single consoles to large, multi-console facilities with tens of thousands of signals.

Broadcast audio networks do not work if there is any lack of access to the resources needed, even if it is due to someone else's legitimate use of network resources. Bandwidth should not have to be reserved, nor should other users have to stop what they are doing. The network must not storm, flood or block if someone connects up a new piece of gear. It must have the predictability of a true TDM router.

Modern networks do, in fact, do this. Networking evolution has involved a move from the end points to the network, where the only limit, at this stage, is the network designer's imagination.

Henry Goodman is head of sales and marketing at Calrec Audio.