Storage primer

The first question that might come to mind seeing the title of this article is: Why would I care to learn about storage? Isn't it a mundane component of the infrastructure that simply stores data? The reality is that of all components of the infrastructure, storage is truly unique. Not only does it store some of the most valuable assets of the organization, business-critical data, but unlike the other resources (network and compute), storage demand grows continuously along with the data accumulated by the organization. If that was not convincing enough, it is worth mentioning that lack of understanding of the storage options can lead to expensive storage hardware that does not optimally match needs. The world of storage is one of the most dynamic and hot technology areas today. This article provides a high-level overview of storage and the trends relevant to content distribution.

Storage types

The key thing to remember is that there are two dominant storage technologies: storage area networks (SAN) and network attached storage (NAS). (See Figure 1.) SAN is a block level storage technology (fixed sized blocks of data or collections of disk sectors are read or written to storage) wherein the storage devices are made accessible to servers in such a way that the devices appear as if locally attached to the operating system. A SAN typically has its own infrastructure connecting storage devices that are generally not accessible through the local area networks by regular devices. By contrast, NAS is file-level data storage technology (entire files are read or written to storage) connected to a local area network providing data access to heterogeneous clients. This ubiquitous connectivity led to NAS gaining popularity, as a convenient method of sharing files between multiple computers. Potential additional benefits of NAS include faster data access, easier administration and simple configuration. In the end, price and performance are the main differentiators between NAS and SAN. The selection of one technology over the other comes down to deciding how much complexity is acceptable, and what is needed to meet the performance needs of the application and the budget.

Storage system components

The main components of the storage system are the data containers, which typically are hard-disk drives (HDD) or solid-state drives (SSD). The drives distinguish themselves through capacity and read/write speed. Technology enabled disk sizes to grow from megabytes to gigabytes and today to petabytes of data. The access speed is measured in terms of number of I/O operations. The faster the drive spins, the higher the I/O. Solid-state drives provide the highest performance (especially on read but less differentiated in write), and that is naturally reflected in price.

The performance of the overall storage system depends also on the interfaces and protocols used to connect to disks. In fact, most often, disks are named by the name of the connecting interface: FC drives, SATA drives or SAS drives. Fibre Channel (FC) is the cornerstone of SAN. Serial Advanced Technology Attachment (SATA) is a standardized interface replacing ATA and delivering higher speeds than its counterpart Parallel ATA (PATA). Serial Attached SCSI (SAS) is a new serial protocol compatible with SATA; however, it's much faster. From a selection perspective, it is thus important to understand what level of performance your application requires and select a cost-effective technology that supports it. For example, even though more expensive, Fibre Channel at 4Gb/s is going to be marginally faster than a 3.2Gb/s SAS drive.

The disk arrays prevalent in today's enterprises are front ended by purpose-built servers responsible for facilitating and optimizing disk access. These storage appliances distribute data over multiple disks and perform many other operational optimizations and management functions as well.

Storage system operation

When it comes to writing data on the disks, the simple option would be to place the entire data set of files on a single disk (assuming it fits). One challenge is that the I/O is not optimal because the process is serial. The alternative would be to write chunks of the original data set on multiple disks in parallel. The other challenge is that all data would be lost if that particular disk fails; there is no redundancy unless data is duplicated to another disk. Finally, when dealing with such large blocks of data, disk space cannot be used very efficiently. Today, storage appliances are responsible for writing data across multiple disks while embedding various error recovery mechanisms. They are responsible for orchestrating the pool of disks into a redundant array of independent (or inexpensive) disks (RAID). Using this distributed method, cheaper, less reliable disks can be used without the fear of losing data.

RAID's primary goals are to optimize input/output and create reliability. Based on the techniques used in the process, they are identified as RAID 0 through 6, where RAID 0 means data is block striped but without any parity or mirroring. At the other extreme is RAID 6, with block-level striping and double distributed parity. (See Figure 2.) It is important to note that there will always be a trade-off between disk space use and the amount of redundancy built by increasing RAID levels. The important takeaway is that good choices of disk types and RAID levels can significantly reduce costs. Back to the earlier example, instead of expensive FC disks at RAID 5, one can use cheaper SATA drives at higher RAID level.

There are several protocols used to access storage disk arrays over an enterprise network, via the storage appliance (or filer in the case of NAS). The main ones are iSCSI, where the SCSI protocol is transported over IP, FCoE discussed below and NFS, a file level protocol that runs on top of UDP or TCP over IP.

Trends in storage

One of the key recent technology innovations in storage consolidates infrastructure and reduces operational costs. It enables the transport of FC frames over Ethernet (FCoE). This is a block level protocol that delivers the same reliability and security as Fibre Channel while using underlying 10Gb/s Ethernet infrastructure. FCoE enables both local network traffic and storage traffic traverse over the same wire, which results in I/O consolidation.

An important area of optimization for storage is the elimination of duplicate information. With data continuously growing, the last thing we need is having the same data in multiple locations just under different names. Data deduplication is a specialized data compression technique that eliminates coarse-grained redundant data, typically to improve storage use. In the deduplication process, duplicate data is deleted, leaving only one copy of the data to be stored along with pointers to the unique copy. The size of the savings depends on the workloads of the enterprise and the type of data.

Continue on next page

Further optimization of storage can be achieved through proper design. Instead of using a “one size fits all” setup, which has to support the highest performance needs, the design should tier storage to match the right disk and access protocol to the right (usually three) group of applications or services. Recently, dynamic mechanisms and technologies for tiering have emerged. For example, fully automated storage tiering (FAST) moves data from one tier (slow/cheap) to another (fast/expensive) based on how often the data is accessed.

Cloud storage for video content

Nowadays cloud computing and cloud-based services are on the minds of technologists and business people alike. This overview would not be complete if we would not discuss the cloud-related trends in storage. Object storage organizes data in flexible-sized (unlike block storage) containers along with metadata that helps not only locating data but also applying policies to it. Compared with complex, difficult-to-manage, antiquated file systems, object storage systems leverage a single flat address space that enables the automatic routing of data to the right storage systems, specifies the content lifecycle, and keeps both active and archive data in a single tier with the appropriate protection levels. This allows object storage to align the value of data and the cost of storing it without requiring significant management overhead typically created by manually moving data to the proper tier. Object storage is also designed to run at peak efficiency on commodity server hardware. More importantly, object storage provides the scalability necessary to support the on-demand capacity delivered by cloud storage.

Cloud storage is changing the way companies think about storage in an era of runaway growth of unstructured data (video is a typical example of unstructured data) by enabling capacity on-demand and other benefits. While simple and scalable, object storage makes it easier to search data, and it enables administrators to apply policies that enforce data lifecycle and prioritization. Going forward, object storage will enable true cloud storage, the most efficient and cost-effective option for storing content. While the technology is relatively new, several cloud storage offerings are currently available. The migration, however, should be carefully planned and implemented.

Ciprian Popoviciu is the director of the infrastructure/cloud group and Mohamed Khalid is chief architect at Technodyne.