FarmGPU #1 in ClusterMax 2.0 Storage

Beating every major neocloud and hyperscaler with sub-1-second PyTorch imports.

Source: SemiAnalysis — ClusterMax 2.0: The Industry Standard

FarmGPU ranked #1 overall in SemiAnalysis's ClusterMax 2.0 benchmarks, outperforming every major neocloud and hyperscaler in the PyTorch import time test — the single most practical indicator of storage I/O performance for AI workloads.

The import torch benchmark is deceptively simple. It measures how long it takes to cold-import PyTorch in a fresh Python process. This exercises sequential read throughput, random read IOPS, metadata operations, and filesystem caching — all at once. Slow storage means slow imports, slow checkpoint loads, and slow data pipelines. Every second you wait is a second your GPUs sit idle.

FarmGPU's result: 0.98s on H100 and 0.72s on B200. For context, most hyperscalers clock in at 5–30 seconds on the same test.

ClusterMax 2.0 Storage Requirements

ClusterMax 2.0 evaluates GPU cloud providers across multiple categories. The Storage category tests performance, scalability, and reliability:

Out-of-the-box parallel filesystem (e.g., Weka, DDN, VAST)
Out-of-the-box managed S3-compatible object storage
Storage integration with Kubernetes for PVCs/storage class
Proper mounting configuration out-of-the-box
Mount reliability (no random flaking on and off)
Read performance testing
Write performance testing
Throughput and latency measurements
Scalability testing for performance and capacity

FarmGPU earned top marks across the board. The key differentiator? Local NVMe storage that's properly configured from the start — we will discuss the network storage in a future post.

How We Did It

No exotic distributed filesystem. No proprietary storage appliance. Just good hardware, configured correctly:

FarmGPU hardware — 8x Solidigm PS-1010 PCIe 5.0 NVMe SSDs per node in RAID10
RunPod PyTorch container — Docker image with PyTorch and CUDA pre-installed
XFS filesystem — battle-tested, high-performance, passed through to the container as /workspace

RAID10 gives us the best of both worlds: mirrored pairs for redundancy, striped across all pairs for maximum throughput. Eight PCIe 5.0 NVMe drives in RAID10 deliver massive sequential and random I/O, far exceeding what any network-attached solution can offer at the same latency.

Every FarmGPU NVIDIA HGX server ships with this stack ready to go. No setup, no tuning, no waiting for a storage admin. You get a RunPod container with /workspace backed by the full NVMe array if you want guns blazing local NVMe RAID performance.

Benchmark Results: NVIDIA H100

System: NVIDIA H100 80GB HBM3 · Intel Xeon Platinum 8462Y+ (128 cores) · 2,015 GB RAM · 8x Dell NVMe P5520 3.84TB in RAID10 · PyTorch 2.8.0+cu128 · Ubuntu 24.04

Highlights: import torch in 0.98s · Random Read 49.55 GB/s · Seq Write 12.19 GB/s · RND4K Q1T1 Read 14.1K IOPS · RND4K Q1T1 Write 73.9K IOPS

H100 — Storage Bandwidth

Sequential I/O

Random I/O

H100 — PyTorch Import Time

───────────────────── FarmGPU SILO Disk Bench ─────────────────────

  Test                   Dir    Bandwidth       IOPS      Latency
  ──────────────────── ───── ──────────── ────────── ────────────
  SEQ1M Q8T1            Read    9.26 GB/s       9.3K     863.7 us
  SEQ1M Q8T1           Write   12.19 GB/s      12.2K     655.7 us
  SEQ1M Q1T1            Read    8.17 GB/s       8.2K     122.1 us
  SEQ1M Q1T1           Write    3.89 GB/s       3.9K     256.8 us
  SEQ128K Q32T1         Read   11.59 GB/s      92.7K     344.9 us
  SEQ128K Q32T1        Write    8.43 GB/s      67.4K     474.1 us
  RND128K Q128T8        Read   49.55 GB/s     396.4K      2.58 ms
  RND128K Q128T8       Write   13.81 GB/s     110.5K      9.25 ms
  RND4K Q32T16          Read   10.57 GB/s      2.71M     188.8 us
  RND4K Q32T16         Write  782.38 MB/s     200.3K      2.55 ms
  RND4K Q1T1            Read   55.07 MB/s      14.1K      70.6 us
  RND4K Q1T1           Write  288.78 MB/s      73.9K      13.2 us

  Seq Read Max:   11.59 GB/s
  Seq Write Max:  12.19 GB/s
  Rnd Read Max:   49.55 GB/s
  Rnd Write Max:  13.81 GB/s

Benchmark Results: NVIDIA B200

System: NVIDIA B200 183GB HBM3e · AMD EPYC 9555 64-Core (224 threads) · 2,267 GB RAM · 8x Solidigm PS-1010 7TB in RAID10 · PyTorch 2.8.0+cu128 · Ubuntu 24.04

Highlights: import torch in 0.72s · Random Read 100.99 GB/s · Seq Read 34.75 GB/s · RND4K Q1T1 Read 18.5K IOPS · RND4K Q1T1 Write 95.6K IOPS

B200 — Storage Bandwidth

Sequential I/O

Random I/O

B200 — PyTorch Import Time

Full benchmark output:

───────────────────── FarmGPU SILO Disk Bench ─────────────────────

  Test                   Dir    Bandwidth       IOPS      Latency
  ──────────────────── ───── ──────────── ────────── ────────────
  SEQ1M Q8T1            Read   34.75 GB/s      34.7K     230.0 us
  SEQ1M Q8T1           Write   30.11 GB/s      30.1K     265.5 us
  SEQ1M Q1T1            Read    4.05 GB/s       4.1K     246.6 us
  SEQ1M Q1T1           Write   10.45 GB/s      10.5K      95.4 us
  SEQ128K Q32T1         Read   25.13 GB/s     201.0K     159.0 us
  SEQ128K Q32T1        Write   16.25 GB/s     130.0K     245.9 us
  RND128K Q128T8        Read  100.99 GB/s     807.9K      1.27 ms
  RND128K Q128T8       Write   15.60 GB/s     124.8K      8.19 ms
  RND4K Q32T16          Read    6.58 GB/s      1.68M     303.8 us
  RND4K Q32T16         Write  787.89 MB/s     201.7K      2.54 ms
  RND4K Q1T1            Read   72.07 MB/s      18.5K      54.0 us
  RND4K Q1T1           Write  373.62 MB/s      95.6K      10.2 us

  Seq Read Max:   34.75 GB/s
  Seq Write Max:  30.11 GB/s
  Rnd Read Max:   100.99 GB/s
  Rnd Write Max:  15.60 GB/s

Source Code

Full source available on GitHub.
https://github.com/FarmGPU/clustermax-storage

FarmGPU Takes #1 in ClusterMax 2.0 Storage Benchmarks

ClusterMax 2.0 Storage Requirements

How We Did It

Benchmark Results: NVIDIA H100

Benchmark Results: NVIDIA B200

Source Code