Proactive workflow monitoring

Television facilities are facing tough challenges, not only as the result of rapid and continuous change in the business environment, but also because both the quantity and complexity of the equipment in facilities today is higher than ever. As requirements for facilities' redundancy, performance and the number of mission-critical applications increase, so does the complexity of their overall architecture. This significantly adds to the strain on monitoring workflows.

Facility workflow administration has long been reactive. While this may seem like a good concept, the industry has started to realize that this attitude is detrimental to its business. This introduces the need to anticipate and correct potential problems in the workflow before they reach a crisis point.

Visibility: the key to workflow monitoring

A broadcast workflow represents video content as it flows through the equipment, people and environment required to support each step of the business cycle. Typical operations include acquisition, screening, play-to-air, reconciliation, archiving and various intermediate tasks.

Workflow monitoring is driven by an initial business question: How can the workflow be managed to minimize, if not eliminate, downtime?

This line of inquiry raises additional business questions, such as:

Are performance issues with mission critical business applications caused by problems with the workflow?
Is the workflow performance meeting its users' expectations?
How does one measure the workflow performance against service level agreements (SLA) between the broadcaster as a service provider and end users such as advertising agencies?
How does one know if the enterprise investment is being maximized?

These questions identify the need to routinely collect and record measurements from various components that comprise the television workflow.

These business questions thus lead to technical questions, such as:

What kind of data is to be tracked, and from what kind of components?
What data values define negative performance?
How can this information be used to predict when to increase specific resources in the workflow or how to better balance the use of the resources in the workflow?
What are the exact response times and latencies that the workflow is experiencing?

Understanding measurement

Today's television workflows are heterogeneous environments comprised of a multitude of interconnected components from various vendors. These components expose operational parameters typically as statistical information.

The challenge in gathering this information is the number and variety of diverse statistics associated with each component. It takes time and expertise to examine this deluge of data in order to troubleshoot problems and fine-tune workflow performance.

This is an inexact science for many television workflow administrators. Without a clear understanding of what to measure, many administrators are forced to blindly swap out components in an attempt to identify the problem source. This haphazard approach could be futile or, worse, exacerbate the issue, causing a major disruption before the source of the problem is even detected.

Another related characteristic is that the ideal 100 percent of the workflow components are not monitored, simply due to the challenge of data saturation — so much data, most of which is not the least bit useful. This incomplete workflow monitoring is like playing Russian roulette. As long as none of the unmonitored components is the culprit, critical services survive. But when one of them causes a failure, the operations engineers have no way of managing the failure or understanding how it occurred. This means that not only will critical services go down, but also that the time to correct the problem can be enormously elongated as operations try to navigate through a complex veil of unmonitored components.

Taking measurement

Though the television industry does not have a formally defined principle for measurement, one can appreciate that when attributes on a component are being measured, the act of measurement further taxes the component's resources. It becomes imperative to ensure that this tax is minimal. For instance, if measurement data is continually being logged to the same disk drive whose performance is being measured, the resulting performance values will be skewed no matter how lightweight the actual measurement instrumentation is.

This is one reason vendors typically include “raw” statistics in their components to minimize computational bandwidth on the components. These component-centric statistics need to be used in arithmetic expressions based on a study and evaluated into meaningful numbers that can then be recorded. For instance, the number of used disk sectors on a video archive is meaningless by itself. One needs to divide it by the total number of disk sectors available and multiply by 100 before it is possible to determine if the percent disk usage on the video archive is too high.

The other consideration when taking measurement is its frequency. Taking measurements at small regular intervals, such as every second, can potentially consume processing cycles on the component and the recording system such that the operation of both entities can be affected. Similarly, taking measurements at large intervals such as every hour causes all interesting variations to average into an unrepresentative and virtually useless measurement. A general rule would be to ensure that the act of measurement does not consume more than 10 percent of resources on both the component being measured and the recording system, while still providing useful information.

While taking measurements, a set of readings is recorded for a specific component when it functions under normal and controlled conditions. This reference point is typically called the baseline. When components are first installed, the baseline readings are obtained for the component before it actually goes online. Hereafter, the measurements give an idea of how the component performs under stress. When these measurements are compared with the baseline, one can estimate the change in the component's usage over time. This enables the user to predict when the workflow would be confronted with potential bottlenecks. IT refers to this collective proactive process as capacity planning.

Threshold instrumentation

While measurements are being regularly recorded, there are situations when the operations team needs to know that a threshold had been crossed or a particular situation has occurred, without constantly checking the values being recorded.

The threshold lets the administrator specify the point at which the measured value becomes interesting. Events that it triggers provide mechanisms for warning administrators about problems or warnings for potential problems, if the values were judiciously selected.

In most cases of a threshold event, no immediate action needs to be taken. For instance, a news editor could stream a large clip to the editing workstation and trim or delete it a few minutes later after reviewing the large clip, and it would be improper to take actions for the temporary increase in disk usage. A threshold event should be classified as a mere warning and not acted upon directly by the staff.

The process of determining threshold values often has been informal and experimental instead of rigorous. This can lead to an overly conservative setting, effectively increasing inappropriate alerts, or it can result in aggressive or even arbitrary settings that may contribute to component failures.

Vendors typically specify static threshold values for their components. This is the easiest approach, using guidelines found in capacity planning textbooks.

But in a specific workflow setting, where the component configuration and usage may be different from anticipated, the values may not necessarily make the thresholds meaningful. To ensure that thresholds follow tighter constraints, typically a baseline measurement has to be first observed for the component being measured.

The thresholds can be adjusted by factoring in a tolerance band around the baseline readings, allowing for alerts based on measurements “outside the norm” but not necessarily above a strict, invariable threshold. Although this approach is labor-intensive (because each component has to be individually monitored and tweaked based on its specific configuration in the workflow), it aids significantly in long-term, proactive monitoring.

Analytics also can be used in conjunction with thresholds to ensure rigid threshold constraints. For instance, in the earlier example of the temporary disk usage condition on the news editing workstation, the administrator is concerned about the disk filling up and staying full. Therefore, setting the threshold alert to be triggered only after two consecutive readings are observed above the threshold will help reduce the number of false alarms.

Presenting measurements

Measurements are used at every stage in a television workflow's evolution. These stages include workflow requirements, architecture, design, implementation, routine maintenance and upgrades.

Consumers of measurement information include troubleshooters, the help desk, operations engineering, SLA administrators, financial planners and system administrators.

Troubleshooters need both real-time and historic measurements to diagnose the root cause of user complaints. If problem symptoms are active, they use the real-time view. If the symptoms have passed, then they use historical data.

The help desk staff needs to view capacity or usage and error information as it collects basic information about a user complaint.

The operations organization needs metrics to comply with SLAs and the SLA end users. Each SLA end user is a special interest group with specific needs, and an SLA is an agreement between them and the operations organization about the expected level of service.

The engineering staff needs operational metrics to validate workflow changes and upgrades. The measurement data is often input to a more comprehensive capacity planning process. This validates that a given facility design provides the necessary workflow performance.

Financial planners use historic performance data to demonstrate to management the revenue needed to upgrade the facility. A properly presented line graph depicting historical usage plus a credible forecasting technique can ethically and accurately depict the urgency for an upgrade.

Enough historical measurements need to be retained to cover the following periods of the workflow: busiest hour of the day, busiest day of the week, busiest day of the month, busiest day of the quarter, busiest day of the year and busiest day of a special event. This enables checking whether the workflow utilization is comparable with what was seen at a similar time in the past.

Summary

The television industry's transition to a multichannel, multi-platform digital environment requires a new way of managing workflows. Using baselines as a reference point from which to track trends and set thresholds will continue to gather steam as television networks realize that the shifting competitive environment requires closely monitoring relationships between components.

By employing an effective monitoring system to address the television workflow's technical issues, its workflow-related business concerns are solved. Downtime is minimised, and administrators can be confident that the workflow is continuing to meet expected performance without degrading other applications.

Mohit Tendolkar is a software design engineer and Northon Rodrigues is an engineering manager for Grass Valley. Stephanie Bishop is a technical writer for Systems Application Software.