Building an AI Cluster: Our 17-Day Crash Course in Open Networking

When we set out to build our first GPU cluster at Farm GPU, we thought we had it figured out. Six servers, some cables, an afternoon of work—how hard could it be?

Seventeen days later, after debugging everything from optics initialization bugs to BIOS settings that silently killed performance by 50%, we emerged with something remarkable: a B200 cluster running at 392 GB/s with industry-leading performance, built on an open Ethernet fabric that cost roughly half of the proprietary Infiniband alternative.

This is the story of what we learned, presented at the recent OCP Summit alongside our partners Celestica and Hedgehog. And it's a case study in why building open networks matters—even when it's hard.

The Economics That Drove Our Decision

Let's be blunt: we chose Ethernet because of cost. The OCP-based solution was 50% cheaper than the reference Infiniband architecture. When your customers won't pay a premium for training infrastructure, that cost difference isn't just nice to have—it's the difference between a viable business and sitting on idle hardware.

The choice was simple: we could either spend our budget on networking overhead, or we could spend it on more GPUs. We chose GPUs.

But that choice came with a price—just not the one we expected.

The Reality: 90% Network Debugging

Here's what nobody tells you: the network is about a lot more than just a switch. It's about the entire stack—cabling, optics interoperability, server BIOS settings, OS kernel versions, NVIDIA drivers, and NIC configurations. Get one thing wrong, and you'll reduce performance by 50%.

We learned that lesson the hard way. Multiple times.

The Optics Ordeal

The biggest wake-up call? Optics. We finally understand why hyperscalers obsess over optics sourcing and validation. When we tried to order the validated, compatible optics for our Celestica DS5000 switches, we hit a six-month lead time. Six months of B200s sitting idle wasn't an option, so we ordered what was available.

That decision kicked off a multi-week debugging saga. We discovered a critical bug where certain Broadcom/Celestica optics wouldn't initialize if you broke out the ports before inserting them. It required a hard reboot of the switch to recover. Finding this took Hedgehog, Celestica, Broadcom engineers, and our optics supplier on calls together for multiple days.

Then there were the simpler mistakes that caused just as much havoc: a speck of dust on a connector, reversed lanes on MPO cables (because the optics on the bottom of the switch are upside down—of course), and technicians plugging cables into the wrong ports because they didn't know the labeling conventions.

One speck of dust can bring down an entire cluster. We learned that the hard way.

The Configuration Web

Getting a ConnectX-7 NIC to link to a SONiC switch isn't plug-and-play. You have to:

Force the link speed to the required speed
Turn off auto-negotiation
Enable breakout in the correct order

Miss any of these steps, and nothing links. It's like an act of God just to get connectivity.

Component	Lesson Learned	Better Next Time
Cabling	Easy to make mistakes, different types of MPO, dust, etc.	Use host and switch software to confirm cabling
Optics	Very little interoperability, need to validate EVERY optic with switch	Validate BOM to ensure compatible optics. Management software to provide detailed optics status. Software to identify anomalies.
BIOS	Disable IOMMU and PCIe ACS for max performance on NCCL	Management software to validate host BIOS settings
OS kernel	Blackwell NVIDIA driver workaround for Ubuntu 24.04 / Kernel 6.8	Management software to validate versions and check known issues
Drivers	Mellanox OFED drivers, RDMA setup, Blackwell support	Management software to automate configuration of host networking
Kernel module	`nvidia-peermem`, DOCA	(See above)
NIC	MST tools, disable autoneg, 400G force link, turn CX7 from IB → Eth	(See above)
NCCL	NCCL tests must show continuous ranking: 0 → 1 → 2 → ... → N	Have host-management software run NCCL test scripts
Switch	RoCE QPN Hashing Mode	Enable RoCE QPN in Hedgehog Fabric from the beginning

But the real gotcha? BIOS settings. We discovered that if you don't disable IOMMU and PCIe ACS in the BIOS, you get a 50% performance penalty. One setting. Fifty percent performance. The tradeoff? You can't support KVM VMs anymore—it's either Kubernetes/Slurm training clusters or VMs, not both.

These "magic configs" aren't documented anywhere. You discover them through painful iteration, late-night debugging sessions, and a lot of help from people who've been through it before.

The Supply Chain Reality

Supply chain resilience isn't just a buzzword—it's survival. When a single vendor has a six-month lead time on a critical component, you need alternatives. The open, multi-vendor ecosystem gives you options. If one vendor has supply chain issues (and they will), you can source compatible switches or optics from another. You're not held hostage by a single manufacturer's production line.

This was our lived experience. Without the flexibility to source alternative optics, our B200s would have been sitting idle for half a year.

Why Open Networks Matter

Despite the pain, we'd make the same choice again. Here's why:

Breaking Vendor Lock-In

Traditional networking means buying into a single vendor's ecosystem—their hardware, their proprietary software, their pricing, their roadmap. SONiC changes that equation. It's an open-source Network Operating System that runs on bare metal hardware from multiple vendors, like the Celestica switch we used.

You're no longer locked to a single company. You have choice. And choice creates competition.

Massive Cost Savings

When vendors have to compete on price, customers win. The open model turns switch hardware into a commodity—you can get quotes from multiple hardware vendors who all support the SONiC standard. Plus, SONiC is open source and community-driven through the Linux Foundation. No per-port licensing fees. No per-feature upsells.

Our real-world result: 50% cost savings compared to proprietary Infiniband. For a large AI cluster, that translates to millions of dollars you can redirect to GPUs instead of networking tax.

Control and Customization

Hyperscalers and cloud providers don't need all the enterprise bloat of traditional switches. With SONiC, you can run a slimmed-down version that increases stability and performance. If you find a bug, you or a partner can fix it directly—you're not at the mercy of a vendor's roadmap. If you need specific features (like advanced telemetry for RDMA), you can build them yourself.

This control was essential during our build. When we hit that optics initialization bug, having Hedgehog as an open-source software partner meant we could debug and fix it directly rather than filing a support ticket and waiting weeks.

The Performance That Made It Worth It

After all the debugging, tuning, and configuration archaeology, we achieved something remarkable: 392 GB/s in all-reduce NCCL benchmarks—nearly perfect scaling and better than many H100 clusters in production today.

Each B200 HGX has eight ConnectX-7 400 Gig NICs, giving us 400 GB/s of total bandwidth between each GPU node on Ethernet. We achieved 392 GB/s, which is about as close to perfect as you can get.

The performance was superb. And compared to previous-generation Infiniband XDR clusters, we're not talking about a small price difference—it's a 3x price difference. Can you charge customers 30 cents more per hour for a GPU instance to justify Infiniband for training? Usually not. The economics speak for themselves.

What We're Contributing Back

The 17-day crash course was painful, but it was a one-time cost. Now that we have the recipe, it's repeatable. More importantly, we're contributing everything we learned back to the OCP community as an AI Training Reference Architecture.

This includes:

Complete bill of materials
Validated hardware compatibility lists (switches, optics, cables)
BIOS and kernel configuration guides
NIC tuning parameters for optimal RDMA performance
Ansible playbooks for automated configuration
NCCL testing procedures and benchmarks

Our goal: the next team that builds an open AI cluster shouldn't spend 17 days debugging. From the time everything's plugged in and cabled, a 64-GPU cluster should be up and running in an afternoon.

The Upshot

Yes, building an open network is hard. The complexity is real—90% of our cluster build time was spent on networking and debugging network issues, not racking hardware.

But the economics are undeniable. We achieved industry-leading performance at a fraction of the cost of proprietary solutions. And by contributing our learnings back to the OCP community, we're making it easier for everyone who comes after us.

The open networking ecosystem is maturing rapidly. Projects like SONiC, combined with open hardware designs from OCP, are proving that you don't need proprietary lock-in to achieve elite performance. You just need partners who are willing to do the hard work of validation, testing, and documentation.

We're already planning our next cluster—a B300 build with 800 Gig ConnectX-8 NICs. We'll be pathfinding again, probably learning new lessons the hard way. But we'll be sharing everything we learn.

Because that's what open infrastructure is about: making the next build easier than the last one.