The world of AI development is about to get a massive speed boost. Today, FarmGPU and RunPod are thrilled to jointly announce the immediate availability of RunPod Instant Clusters, featuring NVIDIA's cutting-edge Blackwell architecture. As of today, you can spin up 6-node B200 HGX clusters, with full cluster expansion planned in the fourth quarter of 2025. This collaboration is designed to eliminate the long waits and complex procurement cycles traditionally associated with deploying high-performance GPU infrastructure by giving you access to the computing power you need, right when you need it. RunPod instant clusters.
Launch a Multi-Node GPU Cluster in Minutes, Not Months
Gone are the days of waiting for hardware. RunPod's instant clusters are designed for speed and simplicity, allowing you to provision and deploy powerful multi-node GPU infrastructure in just a few minutes through an intuitive self-service console.
Key Features of RunPod Instant Clusters:
- Run Any Workload: Bring your own Docker containers or choose from a library of optimized templates for inference, training, and research. With support for all major AI frameworks, you have the flexibility to build your way.
- Pay-by-the-Second Billing: Forget about costly commitments and upfront fees. With precise, per-second billing, you only pay for the compute time you actually use. Flexibly accommodating even intermittent or experimental workloads.
- Complete Control and Freedom: Start and stop your cluster at any time without penalty. There are no minimum runtime requirements or termination fees, giving you the freedom to manage your resources as your projects demand.
Better Together: The Technology Behind the Performance
To deliver the incredible speed of bare-metal performance, we've collaborated with industry leaders to build a truly next-generation infrastructure stack.
Backend Fabric: 400 GB/s GPU-to-GPU Communication
In partnership with Celestica and Hedgehog Cloud Open Network Fabrics, we've deployed a new 800G backend fabric for East-West GPU communication between B200 HGX nodes. This state-of-the-art network offers significant advantages over traditional designs. We chose this OCP-compliant, open-spec design for several key reasons:
- 50% lower cost than the NVIDIA reference design with Spectrum X.
- SONiC NOS and Hedgehog Cloud open network fabric powered hyperscale management and zero-touch provisioning.
- Higher bandwidth per storage node (800 Gbps vs. 400 Gbps). (We will expand on this in a future post focused on front end / networked storage)
- Custom optics tuned for CX7 and BF3, with full support for RoCEv2 and NVMe-oF.
We look forward to sharing more details about this exciting collaboration at the OCP Global Summit this fall.
Solidigm PCIe 5.0 NVMe - 116 GB/s of Blazing-Fast local storage
Each B200 HGX node is equipped with eight Solidigm PCIe 5.0 NVMe drives, each with 15.36 TB of capacity. This setup provides an incredible 116 GB/s of local storage bandwidth, ensuring your workloads are never bottlenecked by disk I/O. This PCIe Gen 5 storage array set records for model loading benchmarks during the latest clustermax.ai round.
Bare metal performance with NCCL
Why NCCL Performance Defines AI Infrastructure Excellence
In the race to build efficient AI infrastructure, NCCL (NVIDIA Collective Communications Library) performance has emerged as the critical differentiator between merely having GPUs and actually delivering competitive training and inference capabilities. As SemiAnalysis aptly notes, a network that's half as slow on AllReduce operations translates to a 10% MFU (Model FLOPs Utilization) drop for 70B parameter model training and a crushing 15-20% penalty for mixture-of-experts architectures. Our recent benchmarks on 32 NVIDIA B200 GPUs demonstrate why this matters: we're achieving 390 GB/s bus bandwidth on large-scale AllReduce operations. FarmGPU optimizations deliver up to 2.3x performance improvements at the critical 16 MB message size—precisely the range (16-512 MB) that SemiAnalysis identifies as most important for real-world workloads. While the industry debates 400GbE versus InfiniBand, our results show that proper NCCL optimization and configuration can extract dramatic performance gains regardless of the underlying fabric. With sub-50 microsecond latencies for small messages and sustained 200+ GB/s algorithm bandwidth even at 32 GB transfers, we're proving that meticulous attention to NCCL performance—from topology-aware scheduling to in-place operation selection—separates production-ready AI infrastructure from a mere collection of expensive GPUs. Without validated NCCL performance, even the most powerful GPUs become bottlenecked by communication overhead, turning what should be breakthrough AI training runs into costly waiting games.
Get Started Today
The combination of RunPod's seamless cloud experience, FarmGPU's powerful infrastructure, and a cutting-edge network and storage stack from our partners creates an unparalleled platform for AI innovation. Ready to experience the power of Blackwell? Launch your first instant cluster today, go to RunPod - select instant clusters and US-CA-2 (that is FarmGPU), and deploy a B200 cluster in one click.