Skip to content
FarmGPU Blog
5 min read

Bare metal performance with FarmGPU and RunPod Blackwell Instant Clusters

Bare metal performance with FarmGPU and RunPod Blackwell Instant Clusters

The world of AI development is about to get a massive speed boost. Today, FarmGPU and RunPod are thrilled to jointly announce the immediate availability of RunPod Instant Clusters, featuring NVIDIA's cutting-edge Blackwell architecture. As of today, you can spin up 6-node B200 HGX clusters, with full cluster expansion planned in the fourth quarter of 2025. This collaboration is designed to eliminate the long waits and complex procurement cycles traditionally associated with deploying high-performance GPU infrastructure by giving you access to the computing power you need, right when you need it. RunPod instant clusters.

Launch a Multi-Node GPU Cluster in Minutes, Not Months

Gone are the days of waiting for hardware. RunPod's instant clusters are designed for speed and simplicity, allowing you to provision and deploy powerful multi-node GPU infrastructure in just a few minutes through an intuitive self-service console.

Key Features of RunPod Instant Clusters:

Screenshot 2025-09-06 103833.png

Better Together: The Technology Behind the Performance

To deliver the incredible speed of bare-metal performance, we've collaborated with industry leaders to build a truly next-generation infrastructure stack.

image.png

Backend Fabric: 400 GB/s GPU-to-GPU Communication

In partnership with Celestica and Hedgehog Cloud Open Network Fabrics, we've deployed a new 800G backend fabric for East-West GPU communication between B200 HGX nodes. This state-of-the-art network offers significant advantages over traditional designs. We chose this OCP-compliant, open-spec design for several key reasons:

We look forward to sharing more details about this exciting collaboration at the OCP Global Summit this fall.

image.png

Solidigm PCIe 5.0 NVMe - 116 GB/s of Blazing-Fast local storage

Each B200 HGX node is equipped with eight Solidigm PCIe 5.0 NVMe drives, each with 15.36 TB of capacity. This setup provides an incredible 116 GB/s of local storage bandwidth, ensuring your workloads are never bottlenecked by disk I/O. This PCIe Gen 5 storage array set records for model loading benchmarks during the latest clustermax.ai round.


Bare metal performance with NCCL

Why NCCL Performance Defines AI Infrastructure Excellence

In the race to build efficient AI infrastructure, NCCL (NVIDIA Collective Communications Library) performance has emerged as the critical differentiator between merely having GPUs and actually delivering competitive training and inference capabilities. As SemiAnalysis aptly notes, a network that's half as slow on AllReduce operations translates to a 10% MFU (Model FLOPs Utilization) drop for 70B parameter model training and a crushing 15-20% penalty for mixture-of-experts architectures. Our recent benchmarks on 32 NVIDIA B200 GPUs demonstrate why this matters: we're achieving 390 GB/s bus bandwidth on large-scale AllReduce operations. FarmGPU optimizations deliver up to 2.3x performance improvements at the critical 16 MB message size—precisely the range (16-512 MB) that SemiAnalysis identifies as most important for real-world workloads. While the industry debates 400GbE versus InfiniBand, our results show that proper NCCL optimization and configuration can extract dramatic performance gains regardless of the underlying fabric. With sub-50 microsecond latencies for small messages and sustained 200+ GB/s algorithm bandwidth even at 32 GB transfers, we're proving that meticulous attention to NCCL performance—from topology-aware scheduling to in-place operation selection—separates production-ready AI infrastructure from a mere collection of expensive GPUs. Without validated NCCL performance, even the most powerful GPUs become bottlenecked by communication overhead, turning what should be breakthrough AI training runs into costly waiting games.

Get Started Today

The combination of RunPod's seamless cloud experience, FarmGPU's powerful infrastructure, and a cutting-edge network and storage stack from our partners creates an unparalleled platform for AI innovation. Ready to experience the power of Blackwell? Launch your first instant cluster today, go to RunPod - select instant clusters and US-CA-2 (that is FarmGPU), and deploy a B200 cluster in one click.