OCP 2025 Meta keynote: Scaling the AI Infrastructure to Data Center Regions

At the OCP Global Summit 2025 in San Jose, CA, Meta detailed its strategy for scaling AI infrastructure to regional data center deployments, emphasizing open, collaborative, and highly scalable designs to support growing AI workloads. The October 14th keynote presentation by Meta’s VP of Data Center Infrastructure, Dan Rabinovitsj, discussed strategies for deploying and operating AI at scale across various data center regions at OCP 2025. The session highlighted innovations for building AI-ready data centers, focusing on open hardware, power innovation, and challenges in next-generation AI infrastructure.

Initiatives discussed included: new Ethernet standards for AI clusters, integration of the Ultra Ethernet Consortium standard, Meta’s vision for open networking hardware, AMD’s “Helios” rack-scale AI platform, MSI’s integrated OCP solutions, next-gen liquid cooling, and solutions for distributed and edge AI.

Rabinovitsj highlighted Meta’s contributions to open standards and hardware innovations, including the Open Rack Wide standard and advanced networking concepts for AI clusters.

Meta also announced several new milestones for data center networking:

The evolution of Disaggregated Scheduled Fabric (DSF) to support scale-out interconnect for large AI clusters that span entire data center buildings.
A new Non-Scheduled Fabric (NSF) architecture based entirely on shallow-buffer, disaggregated Ethernet switches that will support our largest AI clusters like Prometheus.
The addition of Minipack3N, based on NVIDIA’s Ethernet Spectrum-4 ASIC, to our portfolio of 51 Tbps OCP switches that use OCP’s SAI and Meta’s FBOSS software stack.
The launch of the Ethernet for Scale-Up Networking (ESUN) initiative, focused on making Ethernet suitable for connecting high-performance processors, or GPUs, within a single rack by emphasizing requirements like low latency, high bandwidth, and lossless transfers. Meta has been working with other large-scale data center operators and leading Ethernet vendors to advance using Ethernet for scale-up networking (specifically the high-performance interconnects required for next-generation AI accelerator architectures.

OCP Summit 2025: The Open Future of Networking Hardware for AI

Key hardware projects discussed by Meta included:

Open Rack Wide (ORW) standard: Meta introduced the ORW specification, a new open standard for double-wide equipment racks designed to meet the extreme power, cooling, and serviceability demands of next-generation AI systems. AMD, a partner of Meta, showcased its “Helios” rack-scale platform built to be compliant with this new standard.
Networking fabrics for AI clusters: Meta detailed its networking architecture, revealing the following innovations:
- Disaggregated Scheduled Fabric (DSF): An updated version of DSF was discussed (see below), which now provides non-blocking interconnects for clusters of up to 18,432 XPUs (AI processors), enabling communication between a larger number of GPUs.
- Non-Scheduled Fabric (NSF): Meta unveiled NSF, a new fabric for its largest AI clusters, which runs on shallow-buffer, disaggregated Ethernet switches to reduce latency. NSF is planned for Meta’s upcoming multi-gigawatt “Prometheus” clusters. See next section below for details.
FBNIC: Meta announced FBNIC, a network ASIC of their own design.
51T switches: Meta revealed new 51T network switches, which utilize Broadcom and Cisco ASICs.
Next-generation optical connections: For faster and higher-capacity optical interconnections, Meta discussed its adoption of 2x400G FR4-LITE and 400G/2x400G DR4 optics for its 400G and 800G connectivity.
Sustainable hardware: As part of its 2030 net-zero goals, Meta presented a new AI-powered methodology for tracking and estimating the carbon emissions of its IT hardware. The methodology will be open-sourced for the wider industry

……………………………………………………………………………………………………………………………………………………………………………………………………………………………………

Deep Dive into DSF and NSF:

1. Disaggregated Scheduled Fabric (DSF):

DSF is designed to provide a highly efficient, lossless, and scalable network. First introduced at OCP in 2024, Meta announced a major upgrade to its design.

Non-blocking scale: An updated, two-stage architecture for DSF can now support a non-blocking fabric for up to 18,432 XPUs (AI processors). This allows all-to-all communication between a significantly larger number of GPUs without performance degradation.
Proactive congestion avoidance: DSF uses a Virtual Output Queue (VOQ)-based system to manage traffic flow. By scheduling traffic between endpoints, it proactively avoids congestion before it occurs, which improves bandwidth delivery and overall network efficiency.
Open and standardized: The fabric is built on open standards like the OCP-SAI (Switch Abstraction Interface) and is managed by Meta’s own network operating system, FBOSS. This vendor-agnostic approach allows Meta to use components from different suppliers and avoid vendor lock-in.
Optimal load balancing: Traffic is “sprayed” across all available links and switches, ensuring an equal load and smooth performance for bandwidth-intensive workloads like AI training.

2. Non-Scheduled Fabric (NSF):

Meta unveiled NSF as a new fabric specifically for its most massive AI installations, including the multi-gigawatt “Prometheus” cluster scheduled for 2026.

Low latency: Unlike DSF, which relies on scheduling, NSF operates on shallow-buffer, disaggregated Ethernet switches. This reduces round-trip latency, making it ideal for the most latency-sensitive AI workloads.
Adaptive routing: The NSF architecture is a three-tier fabric that supports adaptive routing for effective load-balancing. This helps minimize congestion and ensure optimal utilization of GPUs, which is critical for maximizing performance in Meta’s largest AI factories.
Disaggregated design: Like DSF, NSF is built on a disaggregated design. This allows Meta to scale its network by using interchangeable, industry-standard components instead of a single vendor’s closed system.

3. A dual-fabric strategy for the future:

Meta’s decision to pursue both DSF and NSF reflects its strategy for tackling the diverse and growing networking challenges posed by modern AI.

DSF: Provides a high-efficiency, highly scalable network for its large, but still modular, AI clusters.
NSF: Is optimized for the extreme demands of its largest, gigawatt-scale “AI factories” like Prometheus, where low latency and robust adaptive routing are paramount.

This parallel, dual-fabric strategy allows Meta to build and operate AI infrastructure with unprecedented scale, performance, and flexibility, using open standards to accelerate innovation and reduce costs.

Image Credit: Meta

………………………………………………………………………………………………………………………………………………………..

References:

OCP Summit 2025: The Open Future of Networking Hardware for AI

https://www.opencompute.org/blog/introducing-esun-advancing-ethernet-for-scale-up-ai-infrastructure-at-ocp

Networking at the Heart of AI — @Scale: Networking 2025 Recap

OCP 2025 Meta keynote: Scaling the AI Infrastructure to Data Center Regions

Big tech spending on AI data centers and infrastructure vs the fiber optic buildout during the dot-com boom (& bust)

Gartner: AI spending >$2 trillion in 2026 driven by hyperscalers data center investments

AI Data Center Boom Carries Huge Default and Demand Risks

Analysis: Cisco, HPE/Juniper, and Nvidia network equipment for AI data centers

Qualcomm to acquire Alphawave Semi for $2.4 billion; says its high-speed wired tech will accelerate AI data center expansion

Cisco CEO sees great potential in AI data center connectivity, silicon, optics, and optical systems

Data Center Networking Market to grow at a CAGR of 6.22% during 2022-2027 to reach $35.6 billion by 2027