Achieving high availability for video programming

Service providers who have deployed next-generation video services on massive IP networks are striving for 99.999 percent reliability to enhance quality, be competitive and increase profitability. They are investing billions of dollars in delivery infrastructure by purchasing complex, advanced equipment such as routers, QAM modulators, ad insertion servers and splicers, multiplexers, passive optical network (PON) gear, and STBs. (See Figure 1.)While this equipment is expected to handle the demanding and unique nature of video content, the unfortunate reality is that manufacturers often don't test it for the stress of a video-centric network.

The performance of these devices is cumulative, i.e., one device's impairments add to the impairments from other devices in the long winding path to the subscriber. (See Figure 2.)With the explosion of digital video, the need for video performance testing has reached a critical stage, and the network equipment manufacturers (NEMs) who provide these devices will need to play a key role.

Service providers have already recognized the need to monitor program availability performance after deployment to ensure their systems are operating as expected. Service providers will also need NEMs to provide per video program availability performance reports on all components and systems before deployment to ensure proper component selection and configuration.

This concept, supported by several industry experts, will improve video quality by verifying video component device quality before deployment and lead to increased revenues for all. Without such verification, the current ad-hoc approach to video testing will continue to produce transient operational issues with no accountability and systematic method for improvement.

Recognizing these needs, the SCTE standards organization in the hybrid management sublayer (HMS) subcommittee has been developing specifications and practices for video system monitoring and equipment testing with direction from both service providers and equipment manufacturers based on experience from current deployments.

Measuring the performance of video devices

While there are many metrics that are already gathered during network device testing, video payload availability is often ignored. Some video service providers now use measurements of per-program availability in evaluating the operation of their deployed equipment. They expect their systems will have 99.999 percent availability, which requires that programs be available for all but five minutes in a year. If any single device delivers only 99.000 percent availability, the provider's five-nines goal will be unattainable.

What's needed is a standardized program availability methodology that would give video service providers a way to measure the expected performance of their delivery networks and NEMs a way to measure the reliability of their devices on these networks. This approach would also provide a common language at a crucial point where the two industries meet with a measurement that shows the impact to subscriber experience.

Is program availability an appropriate measurement?

A group of industry experts were asked to comment on this approach. There is widespread agreement that delivering high-quality video services is challenging.

“Video services are critical to Comcast and becoming even more so as additional high definition channels become available,” said Charlotte Field, senior vice president, NETO infrastructure and operations, Comcast. Field said that poor video quality tends to frustrate customers, and that frustration is increased if the customer stays home for a half day for a service call that does not resolve the problem.

Stuart Elby, vice president of advanced technology networks, Verizon, pointed out that customers often call to complain about a problem that occurred sometime during the previous few days and that because the problem is not currently occurring, identifying the cause of the problem is “very, very difficult.”

Hung Nguyen, the HMS subcommittee chairman at SCTE, highlighted a fundamental challenge associated with delivering high-quality video services. He said that the industry has done a good job of delivering relatively high-quality analog video signals. However, he added that delivering a digital video signal is relatively new and is much more complex because it requires so many components, any of which could either fail or could introduce some form of degradation.

Another fundamental challenge in delivering high-quality video services is that the networking equipment deployed to support these services lacks the same, embedded management capabilities and product resiliency that customers are used to having in the traditional telecommunications environment.

“There is nothing built into routers that will give us meaningful management data about the quality of video services,” said Jerry Murphy, senior design specialist, TV assurance, Telus Communications.“Troubleshooting video quality based just on use of the command line interface (CLI) of a router is pretty much impossible.”

Achieving high availability is a two-part process

First, service providers must enable 24/7 monitoring of each program at multiple locations. Without continuous precision monitoring, they will not be able to measure program availability to five-nines granularity, much less guarantee it. Second, NEMs must test and certify their devices for five-nines availability before they are deployed in these video networks. The devices need to be tested under real-world conditions, over extended periods of time, using real video. Many of the network devices currently used across large deployments were never even tested with real video but are expected to keep pace with years of growth in video loads and evolving SD/HD/MPTS traffic mixes. Consider that even a simple link up and link down (link flap) can drop the program availability of all carried programs to four-nines. The prevailing ad-hoc approach to testing and measurement does not scale to service the needs of the video industry.

Recognizing these needs, the SCTE HMS subcommittee is currently developing specifications for video system monitoring and equipment testing with contributions from both service providers and equipment manufacturers based on experience from current deployments.

What exactly is availability?

Availability/unavailability status is defined in various network standards in different ways depending on the type of network. Traditionally, entering the unavailable state occurs when the performance of a service is highly degraded. Using this definition, if the service is slightly degraded, it is identified as available but with degraded performance. In some standards, the degradation must be completely removed before reentering the available state. Further, the criteria to enter a degraded or unavailable state may require the persistence of a degraded condition for a specified number of seconds. Likewise, the degraded or unavailable state may require the persistence of no degradation for a specified number of seconds, and the durations to enter and leave the impaired states may not be equal. An availability definition for general IP networks as described in ITU-T Y.1540 bases availability on a threshold of IP loss ratio (IPLR) performance, for example.

Video service networks are different

As described in various references (TR-126 and HMS draft “Recommended Practice for Monitoring”), even a single lost packet or lost second can cause a user perceptible video and/or audio impairment. An errored second is any second that includes one or more lost program packets. An errored second may also include seconds in which other stream characteristics exceed a preset threshold such as out of order packets, duplicate packets, or unacceptable packet jitter. An errored second may be considered an unacceptable highly degraded condition by a subscriber. With this criteria, a program is considered as entering an unavailable state for any errored seconds. This definition is recommended for many common types of video service networks.

Some quality assurance policies subject subscribers to potentially more severe and frequent impairments because they fail to set a specific, time-elapsed definition of network unavailability. Shorter network programming failures of varying lengths of time — such as tiling, blocking or black screens — are a frequent problem, because most monitoring policies are set up to detect longer failures and outages. These policies are not useful for long-term quality improvement initiatives, because they do not consider the program to be unavailable, and report a different duration of “no program errors.”

An errored second might also be defined such that a minimum count of packet loss events (loss period length) must occur before being tallied as an errored second.

In any case, whatever specific definitions and acceptable threshold policies that are adopted by a particular service provider, they should be simple in order to facilitate monitoring and evaluation by operational means.

Calculations and acceptable targets for availability

Availability indicates the number of per-program unimpaired seconds delivered by a device or system under test as a percent of evaluated seconds. (See Figure 3.) For example, for two impaired seconds of HBO (unavailable time) in a one-day period (measured time interval), see Figure 4.

The draft SCTE HMS Recommended Practice for Monitoring document and TR-126 suggest an acceptable per-program performance criteria of one errored second per four hours for HD and one errored second per hour for SD, with an equivalent minimum availability of 99.993 percent (four nines) and 99.972 percent (three nines) per day, respectively.

Different providers may have different performance targets based on considerations such as available CAPEX and OPEX resources, age of plant and environment. Refer to your corporate policy for specific system requirements.

Some service providers are currently targeting five nines or 99.999 percent availability for their systems. Note that each deployed component must have higher availability than the target availability for the component ensemble. For example, if 10 devices are connected in series and each has 99.999 percent availability, the ensemble would have 0.99999**10 or 99.990 percent availability. Consider too whether the measured availability results of an equipment component will meet deployed system goals.

Availability by network region

While a subscriber may only really care about the availability of programs delivered to the STB, the provider needs to know the program availability at the headend, core and edge distribution locations to effectively direct resources for repair and system improvement. Providers have already deployed monitoring and reporting systems to collect needed real-time information for proactive system fault detection and isolation and data storage for trending. For these systems, availability report generation over time intervals of days or weeks by network region is a straightforward calculation using stored impairment data. Availability by region provides the information critical for effective repair dispatch and prioritizing system upgrades.

Systems without deployed end-to-end monitoring can begin by collecting automated per-channel availability statistics at the edge with a continuous real-time program monitor to represent subscriber experience. This will give the needed visibility to determine what availability the system is currently delivering — the first step needed for implementing a continuous improvement strategy.

Availability reports

Availability reporting is intended to reflect the user acceptability of delivered programs and indicates the availability of “good” program time.

An example report would typically include information shown below, including program name, measurement location if relevant and percent availability. The tested configuration, along with specifications about how the availability is calculated, should also be included. (See Figure 5.)

How to benefit from high-availability measurements

How can service providers achieve high availability (99.999 percent) on video services? The first step is to measure video programs 24/7 at key areas across the live system so per-program statistics can be collected. Next, compartmentalize availability into three key areas:

Program availability out of the headend or any video origin point;
Program availability across the wide area distribution system to each hub or drop site; and
Program availability on the last mile network, post QAM or DSLAM devices.

These key areas allow the service provider to understand how each part of the system is contributing to video service quality. For example, if ESPN is measured for one day (24 hours or 86,400 seconds) and several events occur:

The headend encoder has three seconds of audio dropouts across the day;
The core network drops five packets in five separate seconds across the day; and
The QAM dropped a video PID for two seconds across the day.

The program availability for ESPN for a customer that day at the end is: PA = ((86,400 - (3+5+2))/86,400 ) * 100 = 99.988%. Ten seconds of impairment causes the program availability to drop to three nines. The SCTE draft “Recommended Practice for Monitoring” suggests that a program should have no more than six (HD) to 24 (SD) seconds of errors in a 24-hour period. This example would meet that criteria and be OK.

Service providers need to measure for this so they know what the availability of their service is. Many service providers have no idea how good or how bad their systems are. Frequently, OPEX is spent trying to improve system quality with no feedback mechanism as to how good the results are. Without compartmentalizing the measurement, there is no way to know which systems need improvement. Take the ESPN example:

The headend's program availability is 99.996 percent, and the issue is fixed by looking at audio in the encoder. (The specific fault isolation is key to improving systems, a benefit of simultaneous, live measurement.)
The core network's program availability is 99.994 percent.
The QAM's program availability is 99.997 percent.

Clearly, the program availability figure is cumulative; i.e., headend errors add to the errors of the network and both add to the QAM and down to the last mile into the home network and STB availability. (Errors that happen during the same second in multiple systems due to the same cause do not get counted twice.)

Consider another simple example: The link is lost in a single router at a headend servicing 250,000 homes carrying 300 live video programs, and the link return takes the router 25 seconds to return. The program availability for all 300 programs would be 99.971 percent, assuming there are no other errors for the rest of the day.

Delivering high availability begins before program delivery

High-availability capability of all of the components of a live video system is critical to delivering high-availability end-to-end service. Considering how many devices are in a system, how many software updates there are across the year and the number of new services being rolled out, how can a service provider be expected to deliver 99.999 percent availability or any other high availability figures, much less make improvements unless they have accurate measurements?

Service providers are not alone in needing these measurements. Equipment manufacturers, including encoder vendors, VOD vendors and router vendors, need to verify availability to account for the complexity of video. For example, some router manufacturers do not even test with live video at the volumes seen at the service provider. The first time some of these systems ever see thousands of live videos in 10Gb loads is at the service provider after being deployed. Manufacturers of such equipment components must be sure their test beds in QA test labs, engineering labs, manufacture and test labs include data and voice test loads, as well as live video in realistic volume. Also they must ensure it is measured in the same way service providers will do in monitoring deployed systems. They can also produce program availability reports as part of the hand off of equipment and software from the OEM to the service provider. If the equipment manufacturer can only achieve two nines (99.000 percent) or three nines (99.900 percent) under long-term, normal operation with configurations that mimic service provider networks, then the service provider will at least know that this is the best they can do with the system being deployed. They will also know what is causing the loss in availability so they can predetermine how to handle issues before customers call in with quality complaints, causing soaring OPEX. Or, given the test results, the provider may simply choose to deploy a device better suited for video service delivery.

If availability tests are completed in a lab environment, the results may be better than a field-deployed system subject to more unpredictable sources, physical interconnect stresses, environmental condition stresses such as temperature and humidity, power line transients, and human errors. This may cause the acceptable criteria as measured in these tests to be set somewhat higher than might otherwise be considered.

Of course, other tests should also be executed to complement these baseline tests before final equipment selection and deployment. Such tests would typically include, but are not limited to, the intended load levels; number, type, and speed of active ports; level and type of nonvideo converged traffic expected; forwarding protocols; and management protocols, as well as common voice and data tests that are expected in the operational environment.

Conclusion

Service providers are deploying continuous, real-time per-program quality assurance solutions for their IP distribution systems, which created the need for video device vendors to upgrade their testing suites. Delivering high-availability systems that handle the unique requirements of video payloads depends on each component of the system being up to the task but, in many cases, today's components have not been verified as being able to deliver the high availability needed, which dooms providers' goals before they get started. Testing new equipment for program availability with real video loads and with common operational impairments is needed by service providers who are under increasing competitive pressures to improve quality. Designing and deploying a high-availability delivery system begins with the selection of components that have a tested and proven high availability.

Jim Metzler is an independent telecom consultant for Metzler and Associates.