Part I: The Evolving Landscape of AI Storage
Section 1: The Data Gravity Well: Why Traditional Storage Fails the Modern AI Factory
The relentless pace of innovation in Artificial Intelligence (AI) and Machine Learning (ML) has created an insatiable demand for computational power. This technological wave is underpinned by two parallel explosions: the exponential growth in the size and complexity of datasets and the remarkable performance gains of specialized accelerators like GPUs. The AI-powered storage market reflects this reality, with projections showing a surge from approximately $31.88 billion in 2024 to over $103 billion by 2029. This expansion is fueled by the need to manage and process the massive volumes of unstructured data necessary for training sophisticated models. This rapid progress has exposed a critical vulnerability in modern AI infrastructure: the "data bottleneck."
This performance inhibitor manifests when powerful and expensive GPUs are stuck in costly idle cycles, waiting for data to be delivered from storage. In a traditional storage architecture, this bottleneck is systemic; the data path is inefficient and CPU-centric: data is first read from a high-speed NVMe SSD into the system's main memory (RAM) then the CPU must intervene, processing it before initiating a second copy operation from system RAM to the GPU's dedicated memory. This intermediate step, often called a "bounce buffer," introduces significant latency and consumes valuable CPU cycles. This results in systems where the primary bottleneck is no longer the speed of the endpoints—GPUs and SSDs—but the efficiency of the data pathway connecting them.
Underutilized compute hardware translates directly to delayed time-to-insight for AI initiatives. This represents a significant operational drain and a diminished return on investment (ROI). AI and ML workloads vary widely in their storage requirements, from I/O patterns such as the small, random reads typical of metadata-intensive tasks to the large, sequential reads characteristic of loading massive datasets. Storage systems must be able to rapidly service these diverse demands.
The central challenge in AI infrastructure has evolved. It is no longer sufficient to simply procure faster components; the focus must shift to intelligent, holistic system design. The industry is increasingly recognizing that the AI infrastructure arena is not restricted to raw performance, but Total Cost of Ownership (TCO) and operational efficiency. The high capital expenditure required to build out AI infrastructure remains a major barrier to entry for many organizations. Mitigating the data bottleneck is therefore not merely a technical optimization; it is a fundamental business imperative that requires a new way of thinking about the relationship between hardware, software, and system architecture.
Section 2: The Industry's Response: Scale-Out Architectures and Direct Data Paths
In response to this critical data bottleneck, enterprises have largely converged on a set of powerful technologies designed to create more direct and efficient data paths to the GPU. An example of this modern approach is NVIDIA® GPUDirect® Storage (GDS). GDS is a transformative technology that enables a direct data path for Direct Memory Access (DMA) transfers between storage devices, such as NVMe SSDs, and the memory of one or more GPUs. This architecture fundamentally redraws the I/O path, completely bypassing the CPU and main memory, thereby avoiding the convolutions of the "bounce buffer".
The prevailing architectural pattern for deploying GDS involves scale-out, multi-node systems. In this paradigm, a cluster of dedicated storage servers, often running a parallel filesystem (like Lustre or Spectrum Scale) or a modern object storage platform, serves data over a high-speed network fabric (such as InfiniBand or 200 GbE Ethernet) to one or more GPU client nodes. This distributed approach is capable of delivering massive aggregate bandwidth, and it is the strategy employed by many participants in the industry and commonly used for benchmarks like MLPerf®. The multi-node submission referenced in our benchmark comparison, which utilized over 15 hosts to achieve its result, is a perfect real-world example of this scale-out philosophy.
This convergence around GDS and scale-out architectures reflects a broader market perception that top-tier performance is synonymous with large, complex, and costly distributed systems. While undeniably effective at addressing the I/O problem, such an approach introduces a new set of challenges. The capital expenditure required for numerous servers, specialized high-speed networking, rack space, power, and cooling can be prohibitive. Additionally, the operational overhead of deploying, managing, and maintaining a distributed parallel filesystem or object store is inherently more complex than managing a single system. Thus, the industry's standard solution effectively trades one problem—I/O path inefficiency—for another: infrastructural and operational complexity.
Part II: A New Paradigm: The Power of Single-Node Optimization
Section 3: Our Philosophy: Maximizing Potential Before Scaling
At FarmGPU, we champion a philosophy that challenges the industry consensus. We believe that AI acceleration transcends simply acquiring more hardware; it demands intelligent, holistic system design and a commitment to unlocking the latent potential of existing resources. Our approach is encapsulated by the popular adage: "Less is More." This principle guides our engineering efforts and defines our value proposition to the market.
We contend that before an organization undertakes the significant capital expenditure and operational complexity of a massive scale-out deployment, the first, and most critical, step should be to maximize the performance of a more constrained infrastructure footprint. This positions our optimization strategy as a financially prudent pathway for businesses at any stage of their AI journey. For startups and research institutions, it offers an accessible entry point to world-class performance without prohibitive upfront investment. For large enterprises, it provides a radically more efficient and scalable building block, fundamentally improving the economics of their AI factories.
This perspective directly confronts the belief that impactful performance gains can only be achieved through infrastructure expansion. We demonstrate what we believe to be a superior solution, superior solution where deep, system-level optimization yields substantial performance returns. Our core differentiator is not merely the hardware we operate, but the comprehensive engineering expertise we apply to it. The components used in our record-breaking system—the server, CPU, and SSDs—are commercially available. In theory, any competitor could assemble an identical bill of materials.
Our groundbreaking achievement was not born from component selection, but from the relentless optimization we apply across every component in the FarmGPU stack, from data center infrastructure, networking, observability and monitoring, to storage.
The true advantage we offer is our intellectual property: the engineering methodology that allows us to extract unprecedented performance from a single node. This reframes the conversation from a simple hardware transaction to a strategic partnership built on expertise and a shared goal of maximizing efficiency and ROI.
Section 4: Deconstructing Our MLPerf® Triumph: A Technical Analysis
Our philosophy of single-node optimization is not merely a theoretical position; it is a demonstrable reality, objectively validated by our groundbreaking results in the MLPerf® Storage v2.0 benchmark.
The Gold Standard: Why MLPerf® Matters
MLPerf®, developed and governed by the open engineering consortium MLCommons®, represents the AI/ML compute industry's most respected and rigorous suite of benchmarks for machine learning systems. Its fundamental purpose is to provide fair, relevant, and reproducible performance measurements that enable objective, "apples-to-apples" comparisons. In a market often clouded by proprietary benchmarks and anecdotal performance claims, MLPerf® serves as the critical standard for transparency and verifiable results. Achieving a top-tier result in MLPerf® is not an assertion of strength; it is an objective proof point, validated by a neutral third party against a globally recognized standard.
Within the MLPerf® Storage benchmark, performance is evaluated on two primary metrics. The first is Throughput, a measure of how quickly the storage system can deliver data, reported in both gigabytes per second (GB/s) for storage practitioners and samples per second for ML researchers. The second, equally critical, metric is Accelerator Utilization (AU%), which quantifies how effectively the storage system keeps the (simulated) GPUs busy and productive. The MLPerf® Storage rules require submissions to maintain a minimum AU, ensuring that the reported throughput is not achieved at the cost of starving the accelerators. A high throughput score coupled with high AU% signifies that the storage system is operating efficiently and not bottlenecking the overall training process.
The 3D U-Net Gauntlet: A Real-World Stress Test
For our submission, we particularly focused on the 3D U-Net workload, which is known to be especially demanding on storage subsystems. It simulates the training of a deep neural network for medical image segmentation, a task critical to modern diagnostics, such as identifying tumors in 3D CT or MRI scans. From a storage perspective, it involves processing large, volumetric data files, which translates to an I/O pattern dominated by large, sequential reads. This type of workload is expected to saturate the bandwidth between the storage device and system memory, making it an excellent stress test for raw throughput capabilities. The direct relevance of 3D U-Net to mission-critical, real-world applications in healthcare and scientific simulation means that success in this benchmark is not just an academic exercise. It proves that our solution is not merely generically fast, but is precisely optimized to accelerate some of the most demanding and valuable AI applications in production today.
The Record-Breaking Result: Single-Node Efficiency vs. Scale-Out Complexity
Our single-system submission established a new benchmark for performance, achieving the fastest single-node storage throughput in the MLPerf® Storage v2.0 benchmark for the 3D U-Net training workload.
The performance of our single-node solution is not just impressive in isolation; it rivals and, in some cases, surpasses the capabilities of complex multi-node systems submitted by established industry players. The following table provides a comparison between our optimized single-node system and a representative multi-node scale-out solution that achieved similar performance in the same benchmark. This contrast makes the argument for resource efficiency and lower TCO both visceral and undeniable.
Metric / Configuration | Farm GPU (Single Node) | Competitor (Multi-Node Scale-Out) | Efficiency Advantage |
---|---|---|---|
Workload | 3D U-Net | 3D U-Net | N/A |
Total Throughput | 116 GB/s | 100 GB/s | +16% Higher Throughput |
Simulated H100 GPUs | 40 | 36 | +11% More GPUs Supported |
System Configuration | 1 Host | 15+ Hosts | 15x Reduction in Server Count |
Storage Server Rack Units | 2U (1x Server) | 15-30U (15x Servers) | Up to 93% Reduction in Storage Server RU |
Implied Complexity | Single Point of Management | Distributed System, High-Speed Fabric | Radically Simplified Architecture |
Performance per Host | 116 GB/s | ~6.7 GB/s | ~17.3x Greater Per-Host Efficiency |
Performance Analysis: From Baseline to Peak
Comparing unoptimized, production-optimized, and theoretical maximum throughput
View Full Performance Tiers
Key Insights
• The unoptimized baseline represents typical out-of-box performance without specialized tuning
• Our MLPerf submission achieves 116 GB/s while maintaining all production safeguards and compliance requirements
• The hero run demonstrates the raw hardware capability of 322 GB/s, showing significant headroom for future optimizations
• Even with production constraints, we achieve 36% of theoretical maximum - significantly higher than typical systems
As the data clearly shows, our single 2U server not only delivered 16% more throughput while supporting 11% more simulated GPUs, but it did so with a 15-fold reduction in server hardware and a potential 93% reduction in rackspace. The per-host efficiency of our system is over 17 times greater than that of the scale-out competitor, a testament to the power of holistic system optimization.
Section 5: The Engineering Behind the Numbers: A Holistic Stack Approach
Our record-breaking achievement was not the result of a single breakthrough, but the culmination of a deliberate, end-to-end engineering process that optimized every layer of the software and hardware stack. This holistic approach is the true differentiator behind our performance.
The Hardware Foundation
The foundation of our record-setting system was a single Supermicro® SYS-212H-TN 2U Rack Mount server, chosen for its density and support for the latest-generation components. At its heart was an Intel® Xeon® 6781P processor, whose 80 high-frequency cores provided the immense parallel processing capability required to orchestrate and drive massive I/O operations without creating a CPU bottleneck. The system was equipped with 256 GB of high-speed 6400 MT/s DDR5 DRAM, ensuring that data could be cached and buffered efficiently to feed the workload.
The raw I/O power was supplied by 24 Solidigm® D7-PS1010 7.68 TB PCIe 5.0 NVMe SSDs. These state-of-the-art drives represent the pinnacle of modern flash storage, offering tremendous bandwidth and low latency. We extend our sincere gratitude to our technology partners, Intel® for generously donating the server platform and Solidigm® for providing the exceptional NVMe drives that made this achievement possible.



Subsection 5.2: The Software and Configuration Edge
While the hardware provided a powerful foundation, it was our software configuration and low-level tuning that unlocked its true potential.
First, for the sake of maximizing throughput, we configured the 24 NVMe SSDs in a RAID 0 array. RAID 0 is a configuration that combines multiple physical drives into a single, large logical volume. Data is written in blocks, or "stripes," that are spread evenly across all drives in the array. This allows the system to read and write to all disks in parallel, effectively aggregating their individual performance. While this configuration offers no data redundancy—the failure of a single drive results in the loss of all data on the array—it is an ideal choice for performance-critical workloads. In many AI training pipelines, datasets are treated as ephemeral or are read-only, with data resilience and backup handled at a higher level in the MLOps workflow. For the singular goal of maximizing performance, RAID 0 was the optimal engineering choice. Depending on the application and environment alternative RAID configurations may be preferable.
Second, before attempting the MLPerf® benchmark, we conducted extensive internal testing to characterize the absolute limits of the system. For this, we utilized industry-standard, low-level performance tools, including the Storage Performance Development Kit (SPDK) and the Flexible I/O Tester (FIO). SPDK is a suite of open-source libraries and tools that enables the creation of high-performance, user-space storage applications, bypassing kernel overhead to get closer to the hardware. FIO is the definitive tool for generating and measuring a wide variety of I/O workloads, allowing for granular simulation of different access patterns. Our internal testing with FIO and SPDK demonstrated the system's astonishing raw capability, achieving a peak performance of
MLPerf Storage v2.0: Industry Landscape
3D U-Net workload performance comparison across submissions
View Full Analysis & Comparison
Performance Leaderboard
Efficiency Analysis
Key Findings
- Farm GPU achieves record single-storage-node throughput: 116 GB/s with just one 2U server
- Multi-node competitor requires 15+ storage nodes to achieve 100 GB/s throughput
- 17.3x better per-host efficiency compared to scale-out architecture
- Demonstrates that single-storage-node optimization can outperform complex distributed systems
This internal result is nearly three times higher than our official MLPerf® submission of 116 GB/s. This gap is expected, as a synthetic "hero run" with FIO measures the theoretical maximum of the hardware, whereas MLPerf® measures performance under the constraints of a complex, real-world application. However, the sheer scale of the FIO result is critically important. It proves that the hardware foundation we selected is immensely powerful, and it demonstrates our expertise in using advanced tools to characterize and understand the absolute performance limits of our systems. Most importantly, it shows that our single-node solution was not even fully saturated by the demanding 3D U-Net workload, suggesting that it possesses significant performance headroom to strive to unleash in the future.
Part III: The Tangible Business Impact and Future Outlook
Section 6: Beyond Benchmarks: Translating Performance into Business Value
While benchmark victories are a testament to engineering excellence, their true value lies in the tangible business outcomes they enable. Our system-level optimizations deliver not only record-breaking speed but also a profound improvement in overall workload efficiency, which translates directly to cost savings, increased productivity, and a faster return on investment.
To quantify this impact, we compared the performance of the 3D U-Net training workload on our fully optimized system against the same hardware without our specific tuning. The results, presented below, highlight a dramatic reduction in both the time and energy required to complete the task.
Metric | Without Our Optimizations | With Our Optimizations | Business Implication |
---|---|---|---|
Time to Complete 3D U-Net | 66.92 minutes | 28.75 minutes | 57% Faster Time-to-Insight |
57% Faster Time-to-Insight | Row 2, Data 2 | 0.3891 kWh | 57% Lower Operational Costs |
A 57% reduction in the time required to complete the training workload is a transformative gain. For data science and ML engineering teams, this means more experiments can be run per day, accelerating iterative development cycles and allowing for more rapid model refinement. Ultimately, this faster time-to-insight shortens the path from concept to production, directly accelerating the ROI for critical AI initiatives.
Furthermore, the corresponding 57% reduction in total energy consumed provides a significant operational expenditure (OpEx) benefit. By completing the same amount of work in less than half the time, the system operates at peak power for a much shorter duration, drastically cutting electricity costs. This aligns perfectly with the growing industry trend toward "green AI" and sustainable computing, a key priority for modern data center operators seeking to manage costs and reduce their environmental impact. This efficiency dividend demonstrates that our performance-centric approach is also an economically and environmentally responsible one.
Section 7: Our Vision for the Future: Accessible, Efficient, and Powerful AI Infrastructure
We have clearly proven that a single, well-engineered, and holistically optimized storage server can deliver AI storage performance that not only rivals but, in critical metrics, surpasses the capabilities of complex and costly multi-node solutions. This groundbreaking achievement validates our core philosophy: peak performance does not always require massive scale-out from day one. Instead, a smarter, more efficient approach focused on maximizing the potential of each component can yield superior results.
Our ongoing mission at FarmGPU is to forge a more accessible, cost-effective, and efficient pathway to accelerating AI initiatives. By removing the infrastructure barriers that have traditionally limited access to high-performance computing, we are enabling a broader range of businesses, from agile startups to large enterprises, to build, train, and deploy sophisticated AI models. This approach helps to democratize AI, fostering innovation across diverse industries.
It is crucial to understand that our single-node achievement is not an argument against scaling. Rather, it is an argument for a smarter way to scale. Our optimized server represents a fundamentally superior building block for AI infrastructure. An organization needing to achieve 200 GB/s of throughput could do so with just two of our nodes, whereas a less-optimized approach might require thirty or more servers to achieve the same result. By starting with the most powerful and efficient unit possible, the entire growth trajectory becomes more cost-effective, more power-efficient, and simpler to manage. We offer not just a point solution, but the most powerful and economical building block for constructing the AI factories of the future.
We are committed to continuing our work at the cutting edge of AI infrastructure. We believe that by focusing on system-level efficiency, we can continue to push the boundaries of what is possible, making advanced AI capabilities more powerful and accessible for everyone.
Want to learn more about how this groundbreaking performance was achieved? Read our full MLPerf® Storage v2.0 submission details and results.
Ready to eliminate the storage bottleneck in your AI pipeline and accelerate your time-to-insight? Contact the FarmGPU performance engineering team today to discuss how we can revolutionize your AI infrastructure.