Even a small change to one piece of equipment could
ignite a chain reaction down the system line.
To a broadcaster that word can mean anything — activating a new cell phone, moving from one building to another or a major system upgrade.
Perhaps you don't worry about transitions. Maybe despite never making detailed plans for changes to your operation, you have been lucky enough to avoid disaster. Even so, chances are that sooner or later, you'll encounter a system change with the potential to create substantial problems.
This article explores some of the situations broadcasters could encounter, particularly in dealing with today's server-based, IT-centric systems. A seemingly simple software upgrade or transition can turn into a major undertaking fraught with potential disasters. More importantly, you'll get some practical advice on how to pave the way to a smooth upgrade from two people who've been there.
“No problem,” says the equipment manufacturer. “Everything will go just fine. The software has been fully tested and is working perfectly in other locations.”
Even for supposedly simple transitions, whenever you hear, “No problem,” “It's just a simple upgrade,” “It won't take long at all,” or “Don't worry,” beware. It's time to take a good look at what you're about to undertake and plan the change carefully. Don't think that such problems only happen to other people.
Let's take a look at some of the risks in a system upgrade and what to do to mitigate as many of them as possible by following a careful transition process. Take a look at the example system in Figure 1 on page 80. The system uses two network switches, one for the broadcast high-res system and the other for the newsroom systems. It is a good practice to keep the broadcast system separate from the office systems.
For now, let's assume that this system has been operational for eight months. During that time, there have been several known bugs and issues, some of which are fairly critical and others of a lower priority. Finally, a new software release that claims to fix many of the bugs and, specifically, some of the critical issues the operations people have been putting up with and working around. It's time to plan a clean, trouble-free transition.
Read the instructions!
First read the release notes carefully. Compare the issues the new software fixes with the system's critical issues to ascertain which of these the new release addresses. Next, determine which of the other, less critical issues in your system the new release is supposed to fix. Then, ask these key questions:
Does the new software add or change any of the system's features or functionality?
If there is a change, will it affect the workflow?
If your answer to either of these is “yes,” you will need to communicate with the experts whose areas these changes impact. For instance:
If there's a potential impact on the system's high-res editors, you will want to discuss the change with the lead editor for this area.
If a change affects the metadata server and it affects archiving or the process by which you have been purging or searching for files, you'll want to talk with the operations lead for that function.
If the new software affects the database, you'll need a thorough analysis. A change that affects the database can also affect search procedures. That, in turn, can have an effect on any device that has access to the media and on how each device interfaces with the search function and uses data.
To keep our example simple, let's assume that there are no feature changes and no changes to the overall system workflow. This new release simply fixes bugs.
First things first
That makes everything pretty easy, doesn't it? Maybe. But take a closer look, and you may find additional hurdles.
Because the edit interface is affected in our example, it requires an upgrade of the main file system controllers in the SAN servers before you can upgrade the high-res editors. In addition, upgrades to the ingest and playback servers and the database server cannot be undertaken until the SAN server has been upgraded. Furthermore, another change — this one to the database server software — must be completed before the system can handle I/Os and high- and low-res editors.
Still another issue to address: This upgrade affects both software and hardware. For example, you'll need to increase the RAM in the SAN servers to support the software upgrade.
What at first appeared to be a simple upgrade, and one that may well be quite logical and fairly simple to actually implement, has turned out to be more complicated. It will be time-intensive and must be carried out in a precise order, or it won't fix the problems it is meant to address. In fact, doing things in the wrong order could introduce more issues.
Ready to go? Not quite!
Now that we've peeled back the layers, it's clear that this simple upgrade touches critical components across the core of your system. If any part of the upgrade runs into a problem, it's possible that the database or even the entire online content could be corrupted. No wonder the chief engineer and operations managers are losing sleep!
Minimizing risk is no simple task, but it is far easier than recovering from a disaster such as total wipe out of database or online content. The first step is to make a prioritized list of all the processes affected by the upgrade. That makes it easier to ensure that processes that have to be changed before others will be scheduled at the right point in the upgrade.
Second, further break down the list by detailing the step-by-step task for each of the processes. Make sure that each task within each process is scheduled where it needs to be in the upgrade schedule.
Let's assume that every product category within the broadcast network requires an upgrade. Therefore, every device in the system needs to be placed on the task list. It's important to keep in mind that even though the upgrade process is the same for like devices, each unit needs to be in the task list to ensure proper time allocation.
Even in a relatively simple upgrade, there is a lot to do, and performing the upgrade will mean taking the entire system offline for some period of time. Typically, this requires working through the night when the system workload is light. And, of course, the system absolutely has to be back up and running reliably by the predetermined deadline.
Now that there is a prioritized task list of every step, it is time to take another look for potential risks throughout the upgrade process. Can the list be further broken down into essential tasks? If so, continue to revise the task list until every step is clearly defined and in the right order.
The point of no return
Now, it's time to go back through the list, identify the key critical points and add tasks for test and verification along the way. It's also important to add contingencies to handle potential issues that may be revealed through testing and verification. Also, evaluate the entire process and identify the point of no return — the point at which you must decide whether to halt or continue with the upgrade.
The plan also needs to include time to back up essential data such as the metadata database. If there's even a possibility that other data, such as online storage, might be corrupted during the process, build time into the process for backup of that data, too. Backup needs to be scheduled and completed immediately prior to taking the system offline. It is far better — and less disruptive — to back up data than to deal with the loss of key data after a system fails.
As part of contingency planning, allow time for a worst-case scenario that would require porting the backup data back onto the system. Above all, build in test cycles at all appropriate points in the process so that you do not proceed to the next interdependent step or task until you are sure that the new component is doing its job properly.
It won't be known for sure if the entire system will work properly until the completion of all the interdependent upgrades. However, testing and verifying along the way, offers more assurance that the upgrade should come together and only require clean-up tweaking to operate correctly.
One last point
Because there are multiple system components, such as editors, ingest ports and decoders, save time by upgrading and testing several of these before bringing up the core of the system. If you don't find any issues after testing a few of each of the peripheral devices, proceed with upgrading the remaining components and move forward with core system testing.
Earlier, we established a point of no return. This is a very important milestone in the upgrade process, particularly if trouble happens along the way. For example, what happens if, after adding the required RAM upgrade for the SAN server, the server will not boot? What if the added RAM does not pair well with the RAM already in the server? The first thought is to scramble for additional RAM, but the clock is ticking away. Is there time?
If, while trying to troubleshoot this issue, the predetermined point of no return arrives, it's time to revert to the original configuration and schedule a new time for the upgrade — with the needed RAM ready and at hand. Do not make the mistake of going forward. To do so could spell disaster for your system — missed deadlines or, even worse, a crippled, inoperable system.
While the balance of the peripheral equipment is being upgraded, it's time to start a series of end-to-end tests and the shakedown of the core system components. Once every component upgrade has been tested individually, it is a good practice to perform a complete system worst-case load test to ensure that the upgrade process hasn't introduced restrictions in your system's capacity.
If the system performs as it should, the next step is to hand the system to operations to test again. If no major issues are revealed at this point, the system can be handed back over and put back online.
During the test cycles, confirm that the issues listed as fixed in the new release really have been fixed. If not, record discrepancies and report them so they can be addressed in a future release. Often, as an issue is resolved in a new release, it reveals other issues — hopefully issues that are less critical than the ones the release corrected.
It bears repeating that careful planning for each step in the upgrade transition is really the only way to proceed to protect the systems and to make the transition as trouble-free as possible. So, be careful, plan wisely, and mitigate risk by seeking help from experts who have been through the process before.
Michael Wright is president of IT Broadcast Solutions Group, and Brian Redmond is vice president of Broadcast Consulting Services.