The explosion of Big Data from the IoT is posing challenges for conventional storage, be it on hard drives or SSDs. A new storage architecture, Non-Volatile Memory Express, implemented in hardware and SSDs is now able to radically increase performance.
BY SHREYAS SHAH, XILINX
In today’s connected world, the volume and variety of data being generated is enormous, putting a tremendous burden on storage. The rise of Cloud computing, big data, social media and IoT makes the storage problem even worse. The cost of acquisition (capital expenses) and management of data (operating expenses) are skyrocketing for Cloud service providers. The computational power is not enough to make sense of relevant data, which can get lost within the rapidly growing sea of information. Big data analytics can be a big nightmare for Cloud service providers attempting to quickly monetize data. When information is relevant and has economic value, the data loses that value in time. It’s imperative to extract value from relevant data almost in real time.
Current data center operators and enterprises are scratching their heads to solve their immense storage needs and are attempting to monetize the data stored by applying big data analytics technologies. This old storage architecture has challenges with regard to performance, power consumption, management and monetizing data while scaling systems to zettabytes and beyond. Figure 1 shows old storage architectures implemented storage services in software running on a processing sub system.
Generic storage system architecture
The older generic system architecture uses the most powerful processors (2 x86s), switch, and I/O cards—with support for a variety of protocols like FC, FCoE, Infiniband, Ethernet at various speeds—to connect to the fabric of data centers. The expanders connect to Serial Attached SCSI (SAS)/ Serial ATA (SATA) hard disk drives (HDDs) and solid state drives (SSDs). All storage services run in software on processors.
Even when SSDs replace HDDs, the older architectures are still limited to ~50K IOPs vs Non-Volatile Memory Express (NVMe) SSDs with performance measured in excess of 1M+ IOPs. The NVMe is a new data bus that supports memory-based storage. SSDs with ~1M+ IOPs (input/output operations per sec) at <1/2 the latency of traditional SSDs are creating huge waves in the market. Performance advantages of NVMe SSDs compared with traditional SAS and SATA-based SSDs come out at half the latency and three times the IOPs. This case study was performed by the Storage Networking Industry Association (SNIA).
The NVMe over Fabrics subcommittee created a standard to scale out architectures offering higher performance SSDs connected over the switching fabric with large capacity storage.
These high-capacity (zettabytes and beyond) storage devices coupled with real time analytics allow end customers to extract efficiency out of their storage systems and optimize management of these systems. FPGAs enable the implementation of higher-level storage services in NVMe over Fabrics systems including compression/decompression, security, de-duplication, hashing, and erasure coding, thus delivering significant system-level performance benefits.
Programmable logic is, indeed, a key component in reducing data center power consumption and accelerating computation. FPGAs can be used as hardware accelerators and can be reconfigured, as in the shell and role model, thus significantly increasing their value in the data center. The Xilinx SDAccel™ Development Environment for data center workload acceleration can be used to reconfigure FPGAs to be purpose-built while supporting different applications on the same hardware.
The new NVMe over Fabric architecture shown in Figure 2 is scalable and optimized for a 3x-5x increase in performance and half the latency with services implemented in FPGA-based hardware acceleration. The implementation of these services in Xilinx’s Multiprocessor System on Chip (MPSoC) has resulted in a 30x latency improvement compared to a standard x86 (for compression of files).
Vertically integrated, scalable NVMe over Fabrics with storage in FPGA
The new storage architectures are evolving around scale-out storage, aka fabric-attached storage. The storage servers are distributed across multiple servers with NVMe-based all-flash storage devices, all connected via fabric. This scalable architecture supports multiple data centers as a single storage domain to scale the storage needs across the globe. The advantage users get is that they can independently scale the network attached storage (NAS) heads, and additional storage can be attached without forklift upgrade of NAS heads.
The initiative that started around NVMe over Fabrics has developed into hardware-based accelerators in storage systems. These hardware-based accelerators implement functions such as matrix multiplication for machine learning, caching, de-duplication, comp/de-comp, storage security, hashing, erasure codecs, key value stores and more.
As shown in Figure 3, the NVMe over Fabrics architecture supports Network File System (NFS) or Common Internet File System (CIFS) or block storage over Internet Wide Area RDMA Protocol (iWARP) or RDMA over Converged Internet (ROCEV2) to transfer data from application servers to storage servers. The storage servers are distributed across multiple servers in a scale-out architecture. The storage devices are connected to storage servers via fabric. The fabric technology is implementation-dependent and could support PCIe, Ethernet, Converged Ethernet, Fibre Channel or Infiniband. The storage services can run on storage devices (aka target devices) to be accelerated in hardware. The services are configurable based on the end user, and the capacity of the service is use-case dependent.
Storage architecture with NVMe over Fabrics
Xilinx provides its SDAccel Development Environment which supports C/C++/OpenCL language as input. The toolset converts this input file format to Register Transfer Level (RTL) in Verilog or VHDL, and Xilinx’s Vivado Design Suite converts the data to bit stream that gets downloaded into the FPGA. The bit stream configures logic functions in the FPGA.
Figure 4 shows the SDAccel and Vivado toolset with its partial reconfiguration flow along with the shell and role model for configuring/reconfiguring storage services in Xilinx FPGAs
SDAccel with Vivado and partial reconfiguration with storage services
The purpose of the partial reconfiguration flow is to implement and reconfigure a portion of the FPGA on the fly while the rest of the FPGA is still running other functions. Making use of this partial reconfiguration flow supports the industry-wide shell and role model for configurability, where shell includes connectivity such as PCIe, NVMe controllers, DDR memory controller, NVMe over Fabrics module etc. The shell is always on vs the role of the FPGA which is design-dependent. In this case, the role implements various storage services such as hashing, comp/de-comp, erasure codecs, storage security, de-dupe etc The role has standard AXI interfaces so that various IP can come from different sources. This type of hardware acceleration has shown performance benefits in excess of 30x-50x compared to storage services implemented in software on processors.
Figure 5 depicts one of these storage services, compression/de-compression, with the shell and role model. Initially, the PCIe, memory controller and NVMe controllers are configured in FPGA with Flash- based PROM in less than the PCIe time limit of 120 ms. The processor enumerates and PCIe links get enumerated. Once this task is completed, the processor can download the function or set of functions via PCIe. This partial reconfiguration of the FPGA allows the user to purpose-build the accelerator. In this example, we built the compression algorithm in C language and compared that with the same algorithm running on an x86 processor.
Industry-wide shell and role model in NVMe over Fabrics
The compression performed in software for a 100-GB file took 2.5 hours. That compares to FPGA hardware that took 4 minutes to compress the file. The experiment suggests that hardware accelerators are incomparable to software. The algorithms are implemented in C/C++/OpenCL as shown with the SDAccel tool flow with partial reconfiguration (Figure 6).
Performance comparison of X86 with FPGA
Xilinx’s SDAccel flow provides C/C++/OpenCL language programmers the ability to code their algorithms in their preferred language and download the bit stream to FPGAs without much knowledge of the hardware. Hardware-based accelerators are becoming more popular as software hits the wall and as programming languages shift from Verilog/VHDL to higher level languages like C/C++/OpenCL.
Xilinx’s NVMe over Fabrics implementation along with partial reconfiguration and SDAccel Development Environment provides various services including comp/de-comp, security, hashing, erasure coding, LDPC error correction, caching etc. as shown in Figure 7.
Xilinx’s implementation of NVMe over Fabrics solution
As shown earlier, comp/de-comp was the first service implemented inside the solution. The RNIC inside a Xilinx MPSoC could include low-latency Ethernet MACs with PFC, IP and TCP terminated in hardware. The RDMA portion including iWARP or ROCEV2 can also be terminated inside FPGA fabric. The host bridge of various NVMe drives is implemented in the solution in Figure 7. The advantage of this architecture is that it supports AXI interfaces on all IP making it easy to add/delete/shrink/increase the capacity of the storage services on the fly with the SDAccel toolset. The services can be configured and reconfigured based on end customer requirements.
Xilinx’s implementation of NVMe over Fabrics solution
With SDAccel flow, coupled with partial reconfiguration and connectivity interfaces like PCIe, memory controllers, Ethernet MACs, NVMe controllers, you can use the shell and role reconfigurability model with FPGAs to implement hardware accelerators in the NVMe over Fabrics architecture. This fabric-attached storage with scalable performance and highly efficient platform for analytics provides much lower total cost of ownership for Cloud service providers. Future areas will provide updates on porting other services like security, matrix multiplication, Spark machine-learning (ML) libraries acceleration for analytics and de-duplication in hardware accelerators in these same architectures.
San Jose, CA