Making the Cloud Invisible, Part 2

What does it take to make the cloud “invisible?” Put another way, in the context of an end user running a media-focused application, what parameters create an environment such that the user cannot discern if the app is running locally or in a remote cloud? For example, using a video editor app and doing the classic jog/shuttle function across the timeline, can a user tell by the “feel of the app” that the runtime code is local or remote? If the app feels local in all aspects then the cloud is invisible to the end user. For SaaS apps in particular it’s good to aim for this goal; users will demand it.

For sure, it’s not easy to create an invisible cloud environment. There are many aspects of Quality of Service that determine the user experience. In part one of this series, the concepts of cloud access across the Internet transport chain QoS were examined. In this column, compute, storage, and reliability are examined; networking is out of scope for this discussion. Fig. 1 outlines the four domains of focus and loosely models the Infrastructure-as-a Service (IaaS) cloud architecture. Let’s look each at each of these four areas:

CLOUD COMPUTE QOS
In cloud-speak, a CPU “Instance” provides a “predictable” amount of dedicated compute capacity. It may be charged per hour consumed or by some other means. The Instance is comprised of one or more virtual CPU core(s), some DRAM, some local hard disk memory, and I/O. They vary from small to massive with ~100x the power compared to a small Instance. Machine Images (MIs) are preconfigured with an operating system; Linux, Windows, other. These run on CPU Instances.

Instance QoS determines the expected running performance. The two most valued metrics are benchmark results and jitter (execution time uncertainty). Here is an example of compute QoS for file transcode;

Benchmark A: 10 GB file transcode from MPEG2 to H.264. The average execution time running on Instance Type X is 12 minutes.

Repeated tests running Benchmark A over different days and hours yields a minimum run time of 6 minutes and a maximum run time of 18 minutes, with a 95 percent confidence level. The time difference is the jitter uncertainty of the test.

Ideally, the jitter would be nearly zero and it would be so if the instance was totally dedicated to your execution needs. In reality, the instance is sharing the CPU hardware with other users in a virtualized environment. Plus, the usage loading will change day to day and hour to hour. So the fastest execution time may fall during a weekend night and the slowest time likely on Monday morning, for example.

You are probably thinking, “This is a problem.” True, it’s not ideal but rather a tradeoff in a shared hardware environment. Planning for worst case times (18, not 6 minuntes), or adding more horsepower (faster instances) or paying for dedicated resources are practical ways to set acceptable limits on the execution uncertainty. Users are suggested to run their own benchmarks since cloud vendors rarely provide them. Without paying for dedicated resources, strict determinism is not easy to achieve in a cloud environment.

CLOUD STORAGE QOS
Cloud storage is a bit of an elephant. Why? There are so many different types and uses of cloud storage and each has associated QoS metrics. First, there is the storage coupled to each Instance. Let’s skip this type since it is Instance bound. Other types are persistent (lives after an Instance is closed) and include:

• Block Storage. Provides block level storage volumes for use with Instances. The Instance can interact with the volume just as it would with a local drive, formatting it with a file system and installing applications.

• Object Storage. R/W “unlimited” data objects into repository. Each stored object is retrieved via a unique key value. These stores can be accessed for general file storage. Apps such as DropBox and GoogleDrive use cloud-based object storage. OpenStack’s Swift Store is an example. This type makes a good extension for second tier media facility storage.

• Archive Storage. Typically, for infrequently accessed data with retrieval times longer than for Object Storage. Higher access latency is traded off for less expensive storage. Amazon’s Glacier product is an example. This type is appropriate for offsite archiving with excellent durability.

The QoS metrics of interest are (1) R/W transfer rate, (2) latency for a single transaction and (3) I/O operations per second (IOPS). Each storage type will have different metrics. Obtaining QoS metrics from a cloud provider may be a challenge. Some vendors spec IOPS and transfer rates for select levels of storage while being silent on other (cheaper) kinds.

As with CPU metrics, benchmarking is a solid method to get a handle on storage QoS. Try then buy. Cloud providers will hard-spec some storage products but not all. So, users beware.

CLOUD RELIABILITY QOS
Compute and networking system reliability is often measured by uptime percentage. For example, a compute Instance’s uptime may be quoted as 99.95 percent. This means the Instance may be down for about 4.5 hours per year. This is on par, or better, with the reliability of an in-house data center. Storage reliability is measured by availability percentage and durability. Availability is similar to uptime. Durability is the integrity of the stored data. At least one cloud vendor offers object storage with 99.999999999 percent durability. It’s very difficult to achieve this level of durability using in-house means.

Beyond native reliability, good design practices may be applied to build systems with virtually ultra-reliability and durability. Methods including load balancing, mirrored components, dual linking, error correction, and geographic diversity are in the reliability arsenal of good design practices in the cloud.

Understanding QoS is vital for successful use of the cloud. Maintaining target QoS metrics will result in services and apps that meet your business needs.

Al Kovalick is the founder of Media Systems consulting in Silicon Valley. He is the author of “Video Systems in an IT Environment (2nd ed).” He is a frequent speaker at industry events and a SMPTE Fellow. You can reach Al via TV Technology.