Erasure Coding: The Generalized RAID

erasure coding the generalized raid

Erasure coding is the de facto standard for data resiliency in the era of HCI and cloud storage. Various acceleration techniques—GPUs, SSEs and FPGAs—are opening new frontiers in accelerated storage solutions including erasure coding.

Data is the digital economy’s new currency. Important dimensions of the digital economy include growth, availability, speed and resiliency. The quantity of data and storage-device capacity are both on exponential trajectories. Availability is all pervasive now: on-premise, in the cloud and across the hybrid cloud. Speed is determined by artificial-intelligence and machine-learning applications, which require extremely fast IO speeds. Resiliency is the cornerstone of today’s digital economy.

Use of traditional storage-area networks (SANs) has been driven by OEMs that implemented RAID solutions using dedicated hardware and firmware for host-side and target-side use cases. Striping, mirroring and parity defined the RAID implementations providing various degrees of protection.

Given today’s more demanding requirements, however, RAID is proving to be inadequate. Cost of resiliency, recovery time and protection issues during the recovery process of traditional RAID are paving the way for alternatives.

Erasure coding (EC) is one such alternative, which is distinctly different from hardware-based RAID solutions. Implementation of EC is software/algorithm based implementation so it is not dependent on any specific hardware. Essentially, EC breaks the data into fragments and augments, or encodes, the fragments with redundant information. It then distributes the encoded fragments across disks, storage nodes and locations. These redundant augmentations are used to reconstruct data in case of failures. The process of reconstruction is CPU-intensive using polynomial interpolation and can cause latency issues.

Unlike RAID, EC does not require a specialized hardware controller, so it provides better resiliency. Most important, EC provides protection during the recovery process. Depending on the degree of resiliency, complete recovery is possible from just half of the data elements, which can be any combination of elements.

Reed Solomon (RS) codes—which is one example of maximum distance separable (MDS) code—are the most prevalent codes in use today by the likes of Facebook, Google, Microsoft and Yahoo, to name a few. RS has two key performance measures: storage efficiency and fault tolerance. Storage efficiency is an indicator of additional storage required to assure resiliency whereas fault tolerance is an indicator of the possibility of recovery in the event of element failures. Both of these measures are inversely proportional to each other: more fault tolerance reduces storage efficiency and vice versa.

Hyperscale clusters in datacenters pose challenges for the data resiliency solution in terms of node failures and degraded reads. Low repair bandwidth and low repair degree separate the best EC from the rest.

Modern Resiliency Codes

Highly distributed storage clusters are commonplace for capacities ranging from terabytes to petabytes. These storage nodes, which can be at different geographical locations, can pose a challenge for EC—especially in the recovery phase—by way of node failures and degraded reads. The RS code has high repair bandwidth and constant repair degree that has triggered the evolution of the modern resiliency codes. Code theory responded to this challenge by giving rise to two proposals: regenerating codes and locally recoverable codes.

Regenerating codes are often MDS codes, but importantly they are vector codes. The vectorization process splits codes into sub-parts, thus forming a vector rather than a single scalar value. Vectors are striped across storage locations to improve the repair bandwidth.

Local recoverable codes (LRC) trade a little more storage efficiency to improve the recovery process by optimizing repair degree. LRC uses MDS code in a hierarchical manner by performing the encoding at multiple levels. These are scalar codes, sometimes called code on codes.

There are some extreme resiliency implementations, which work with 128 parity elements as against the normal trends of up to eight parity elements. Research in code theory—one application is distributed storage—is vast and evolving.

Modern erasure code algorithms include local regeneration codes, codes with availability, codes with sequential recovery, coupled layer MSR codes, selectable recovery codes and others that are highly customized.

Erasure coding technology evolution is poised for CPU off-load with various acceleration candidates—ranging from SSE instructions to GPUs and FPGA—and highly customized hardware.

Acceleration Aspects

Research work for optimizing EC for various aspects is ongoing in academia including the University of Texas, Austin, University of California, Berkeley, University of Maryland, the Indian Institute of Science, Bangalore, University of Tennessee, Chinese University of Hong Kong, Massachusetts Institute of Technology and Nanyang Technological University, Singapore. Also, research is underway by the likes of EMC, Facebook, IBM, Microsoft and NetApp.

Erasure codes are compute intensive. In the era of data deluge and extreme availability requirements, it has become necessary to offload the compute from the main CPU. On the other hand, the datacenter hardware—virtual or bare metal—has a higher probability of having other resources that are suitable for computation such as GPUs and FPGAs.

Intel’s Intelligent Storage Acceleration Library (ISA-L) platform contains low-level functions highly optimized for the Intel architecture. It also offers Quick Assist Technology (QAT) to provide hardware acceleration. Intel claims to have achieved about a 40% improvement in the encoding process with ISA-L, which is based on Intel SSE, vector and encryption instructions.

One of the requirements of GPU-based acceleration is the vectorization of the EC algorithms. The modern resiliency codes have some cases of the vector codes. These vector approaches make it possible to leverage GPU cores and high-speed on-core memory—such as Texture Memory—to achieve parallelism.

Another trend of the EC offload is the acceleration for the fabric. Next-generation host channel adapters (HCA) are offering calculation engines, which may be GPUs. These implementations make full use of features like RDMA and Verbs. Encode and transfer operations are handled in HCA. With RDMA, there is more acceleration for storage clusters.

Data resiliency, compression and deduplication are evolving at breakneck speed. Extremely low latencies of NVMe technologies, tighter integration of storage with application characteristics and newer virtualization options are opening new use cases for erasure coding.

Forward-Looking Observations

We foresee following interesting trends in this space, some defined by the application workload characteristics, some defined by a need for more processing power and others by the emergence of new storage device technologies.

Application Workload Dependent Resiliency. Way back when, the ATA standard came out with commands, which complied with the timely delivery of data rather than complete error correction. These commands were suitable for video applications that needed to comply with frame rates more stringently than the loss of a few pixels in a frame.

Now, we are seeing similar application-specific infrastructure solutions. Modern datacenters are catering to various types of workloads. These workloads can demand different levels of resiliency. We believe that application workload-dependent resiliency will become part of workload provisioning. This means potentially two distinct workloads could be executing under different resiliency requirements; such a dynamic configuration would be addressed by datacenters.

Storage Technology Dependent Resiliency. For a certain period, both magnetic and various flash storage media will be the backbone of datacenter storage. The EC will be performed on cold data in many HCI solutions. The process of encoding in EC is generally read-modify-write, which causes write amplification for flash storage. Hence, many HCI vendors alert users about the tradeoff between storage efficiency and the write amplification problem with all-flash storage. We foresee that the write amplification issue will have an impact on EC algorithms and their implementation meant for all flash storage. Hence the target storage device technology would become an influencing factor in the choice of EC for workload provisioning.

Integration of EC with File System. Industry is responding to the problem of EC write amplification with a similar approach that was taken in the case of flash storage, by integrating EC requirements in file system design. For example, EC based on Mojette Transform is highly integrated into RozoFS, which is the file system. This file system differentiates between sequential and random storage workloads and treats them differently from an EC point of view. In some implementations, the write amplification is handled in the backend. Bluestore is the new backend for Ceph which promises to handle flash devices vis-à-vis EC read-modify-write cases in a better way. Similar approaches are being taken in HPC-specific storage solutions like DDN storage.

Data Migration for Resiliency Optimization. The Ceph supports the EC profile to choose the specific EC and other resiliency parameters. In the lifecycle of data, there are situations that require a change in the EC profile and hence the required data migration. Even in the case of HCI, OEM’s newer EC profiles are getting introduced in new product lines. Therefore, it would be common for the customer to face a situation where their data sets are in different EC formats. The simplest use case would be migration from legacy RAID systems to EC systems. Creating another pool with a new profile and migration of data—from EC to RAW to new EC—is the current solution. Transcoding would be a possible solution to this problem. This approach would convert data from one profile to another directly and there would arise a constant need for data migration to ensure resiliency optimization for primary as well as secondary data.

As part of its converged systems R&D, Aricent is developing GPU- and GPGPU-based storage acceleration libraries for erasure coding, deduplication and encryption.

Aricent is actively pursuing storage acceleration using converged systems. GPU-, GPGPU- and CUDA-based acceleration are addressed under the accelerated IO storage IO (AIOS) framework. The Aricent Erasure Coding Engine supports accelerated technologies that cater to Ceph and SWIFT solutions. Extending the expertise to address ISA-L as well as FPGA-based acceleration is on the Aricent roadmap. Application-specific and device-specific EC are important areas for Aricent.

Aricent is a global design and engineering company innovating for customers in the digital era. We help our clients lead into the future by solving their most complex and mission critical issues through customized solutions. For decades, we have helped companies do new things and scale with intention. We bring differentiated value and capability in focused industries to help transform products, brands and companies. Based in San Francisco, frog, the global leader in innovation and design, is a part of Aricent. Click here to learn more about Aricent.


Leave a Reply

Your email address will not be published. Required fields are marked *