Using Azure and Pure Storage to Scale EDA Workloads
Recently, Microsoft needed a partner to help a leading chip producer with their expansion of their HPC clusters bursting into the cloud, to solve both scalability and time to implement issues they were having with on-prem space and rack/stack delays. They turned to Six Nines to help them burst to the Azure cloud with on-demand virtual machines and test several of their storage offerings.
Due to the nature of EDA workloads having many hundreds of thousands of very small files, disk IO and access times are the bottlenecks of their regression tests along with backward compatibility with file permissions limiting some offerings, an all-flash storage backend was decided on. When comparing offerings Pure Storage stood on top and was prepared to work alongside the leading chip producer and Microsoft in testing their offering.
High-Performance Computing (HPC) workloads are synonymous with a high number of compute resources that can scale the number of jobs in the queue. A scheduler manages the batch jobs that are noncontinuous and run in parallel in an HPC environment. There is a strong requirement to scale performance and capacity for storing the high density of data during the design and modeling process. Many core engineering HPC workflows use a distributed file system on a Network Attached Storage (NAS) over NFS, SMB, and S3 to read and write structured and unstructured data at the same time from different clients. FlashBlade provides the performance and capacity scalability required for simulation, modeling, analytics, machine learning, rendering, medical images, etc. workloads on a standard data management platform.
However, not all modern HPC applications and workloads are created equal. Some workloads have massive amounts of metadata; others require high throughput and some both. In more than one workload scenario the filesystem size and the directory structure determine the access pattern and the workload generated by the HPC applications. Traditionally HPC workloads run in on-premises datacenters. In recent times the majority of the HPC workflows like chip design and manufacturing for semiconductor companies are shifting to hybrid cloud.
This blog highlights how semiconductor companies and others use Electronics Device Automation (EDA) tools from vendors like Synopsys, Cadence, Mentor Graphics for the design and simulation of Silicon on Chip (SoC). In recent times the logical and physical chip simulation process tends to extend beyond the local datacenter boundaries into the cloud for cloud bursting purposes that require compute resources on-demand at scale. EDA workloads are mostly file-based and exhibit high metadata and throughput in different parts of the workflow.
The FlashBlade stores millions of files in deep directory structures with various file sizes that are primarily accessed over NFSv3 by the EDA tools used in the chip design workflow. FlashBlade is a distributed file system that scales independently of the compute resources with respect to capacity and performance for various parts of the frontend and backend chip design workflows.
The HPC workloads are gaining momentum in a hybrid cloud environment compared to the business-owned data centers. However, the transition to the cloud has its own share of challenges for the HPC workloads.
- The provisioning and scaling infrastructure needs for business requirements are slow and lack elasticity.
- Internal data silos and complex workflows impair the ability to extend the HPC workloads to cloud on-demand thus impacting faster-time to results.
- The availability of a higher number of computing resources like CPU, GPU, FPGAs, etc., on-demand for HPC workflows is not easy to provision.
- Not many HPC applications are validated nor have the proper license to function in cloud environments.
The chip design and manufacturing environments are shifting and going through a major makeover. The modern chip design and manufacturing for sub-10nm SoCs are having an increased design complexity with high transistor density requirement, significant growth in the number of parallel jobs submitted in the queue during logical and physical chip verification process that requires to scale the CPU and GPU resources on-demand. At the onset, the public cloud seems to be the logical solution for the infrastructure scalability requirements, however, it does not provide a pertinent answer to many technical and legal challenges.
- Data Security and control. Almost all the data consumed and generated during the design and manufacturing process are intellectual properties of the business owners. The respective business owners are paranoid to move data into the public cloud where they do not have any control or knowledge about the data location.
- Data Management and Mobility. Data has gravity. As mentioned above, EDA workloads tend to have a high file count with various file sizes. Moving and protecting data to the cloud is limited and challenging at times.
- Cloud-lock in. Moving CIP design data into a particular public cloud provider lock-in the data long term. The egress and ingress data costs get higher as the volume of data is created and moved between on-premises and cloud. There is reduced flexibility to move data into other public cloud providers.
- Complexity and Cost. Any alternative infrastructure hybrid cloud solution to provide data sovereignty to the semiconductor companies involves high cost and complexity to manage and scale infrastructure. The legal ramifications outweigh the desire to move to the cloud in many business scenarios.
Azure cloud has many inherent capabilities like the ability to offer the choice of CPU and GPU resources at scale for cloud bursting during the chip design phase. Azure also has various network speed offerings to handle the EDA workloads at scale. FlashBlade is the preferred choice of storage used in the on-premises datacenter by many semiconductor companies that extend into any cloud provider.
FlashBlade in an Equinix colocation provides the extension to on-premise datacenter and cloud adjacent capabilities connected to Azure without any cloud lock-in as shown in the following diagram.
The FlashBlade platform is available in many Equinix locations globally and is in close proximity to Azure regions. The FlashBlade in the Equinix locations provides all the native data management and mobility capabilities in the cloud adjacent to Azure. FlashBlade is connected to the Azure Express Route using Equinix Fabric. Pure Storage and Equinix provide a complete managed service in the collocations that eliminates the overhead of owning and managing the infrastructure with a high degree of simplicity and provide the data sovereignty that is needed to the business owner.
As part of the test plan, Azure and Pure Storage jointly listed some of the commonly used EDA tools to validate this cloud adjacent solution on datasets that can be used in Azure for testing. SpecStorage2020 benchmarking tool that has a default EDA workload profile was used with VMs that pointed all the data operation to the FlashBlade using the Express Route. The Azure VMs were provisioned and scaled on-demand using Azure CycleCloud. EDA tools like Cadence SpectreX and Microsoft designed EDA workload simulated tool called Hasbro was used to stress the network and FlashBlade capabilities. Additional Software build workload with open source Linux kernel was used to test at scale with high metadata operations to the FlashBlade.
All the tests indicated that most of the EDA tools that are qualified to run in the cloud can use FlashBlade in Equinix over Express Route without any problems. The Express Route with Ultra performance gateway is recommended to run EDA tools with FlashBlade over NFSv3. The following table and graphs summarize the comparative results between the Azure and FlashBlade Cloud adjacent validation with Azure NetApp Files(ANF). The “orange” curve that represents the FlashBlade performance scales linearly with respect to latency for both NFSv3 and NFSv4.1 filesystem access until the Express Route is the bottleneck with the network speed. For more information, refer to this Microsoft blog.
Figure: Azure and FlashBlade in Equinix compared to ANF in Azure datacenter
The Software build test with open source Linux kernel exhibited better performance with Azure and FlashBlade over Express Route. As the the
Figure: Average Linux Kernel Build Time (eight threads per build) vs. Builds
For more information refer to the Pure Storage blog.
The Azure cloud adjacent solution with FlashBlade in Equinix is the first of its kind for EDA workloads to burst into the cloud on-demand. There were apprehensions about this solution at the beginning.
- Conformance – The premise of this validation was to identify if the EDA tools running in the Azure cloud worked properly with FlashBlade in Equinix location over Express Route.
- Network speed – A minimum 10Gb/sec Express Route with a standard gateway configuration enough for EDA tools in Azure VMs to run at scale?
- Performance – Will a 2ms round trip time between the Azure VMs and FlashBlade in Equinix location allow scaling the performance of the EDA workloads?
Apart from IP protection, FlashBlade provides complete encryption for data at rest. FlashBlade also provides protection against Ransomware using SafeMode snapshots.
The test results so far have highlighted that with the VMs with ultra gateway and Express Route set to a minimum of 10Gb/sec network speed connecting the FlashBlade enables and provides the best performance for EDA tools at scale. Apart from the conformance, network speed, and workload performance; data mobility from on-premise to cloud and vice versa is a critical component of the entire data lifecycle in this validation.
The following figure illustrates how array level bi-directional file system replication can accelerate the data transfer between FlashBlade endpoints to stage data for cloud bursting using Azure VMs. For more information refer to the Pure Storage blog.
Figure: Data Staging in Cloud adjacent FlashBlade in Equinix with Azure Cloud.
The other use case was to move and archive the data to Azure blob storage for long-term retention after the chip design project is completed. The following diagram shows how data from an NFS share on FlashBlade is moved to Azure blob storage using an external tool like “fpsync” for long-term storage. For more information refer to this Pure Storage blog.
Figure: Long term retention to Azure blob storage using “fpsync”
The EDA tools tested in this Azure Cloud adjacent to FlashBlade in Equinix are one of many HPC workloads that benefit from this architecture. The EDA validation resurrected that chip design and manufacturing workloads can provide similar performance and cost-efficiency compared to the workloads on-premises. The cost savings are up to 2.6x compared to our competitive storage offering in the Azure cloud. With Azure’s no data egress cost; there is a significant cost saving with respect to cost vs. bandwidth when data is moved to FlashBlade that connects to Azure Cloud. With the managed services from Equinix and Pure Storage, business owners will have less overhead to manage infrastructure and more elasticity with more data ownership and control.
For more information and a deeper dive into scaling EDA workloads, check out the Pure Storage White Paper.