Understanding RoCE

This article explores RoCE fundamentals, traffic handling, and congestion management.

Ori Acoca

3/11/2025

Note:
This article assumes NVIDIA devices (NICs and switches).
While vendor-specific implementations may vary, the core concepts are widely applicable.


What is RoCE?

RoCE (RDMA over Converged Ethernet) enables RDMA operations over Ethernet network, delivering high performance, low latency, and reliability for data-intensive applications.

To understand RoCE (RDMA over Converged Ethernet), we first need to first understand what is RDMA (Remote Direct Memory Access).

RDMA

Enables direct memory transfers between computers without CPU involvement, bypassing the OS kernel.
This results in:

  • Lower latency: Eliminates data copying from application to kernel buffer and using of kernel's networking stack (zero-copy)

  • Higher throughput: Allows direct memory-to-memory communication

  • Reduced CPU overhead: Frees resources for other tasks


RDMA is sensitive to packet loss and retransmissions, hence, requiring a reliable, lossless network.

RoCE

Extends RDMA capabilities to traditional Ethernet networks that are lossy by their nature by implementing protocols that create near-lossless conditions.

RoCE vs. InfiniBand

RDMA is natively supported in InfiniBand, which is inherently lossless and includes built-in congestion management.

InfiniBand

InfiniBand ensures data integrity through credit-based flow control—data is only sent when the receiver has buffer space available. This prevents packet loss in advance.

RoCE

RoCE takes a reactive approach. It sends traffic and handles congestion as it occurs through the following mechanisms:

  • Explicit Congestion Notification (ECN): Signals senders to slow down

  • Priority Flow Control (PFC): Pauses transmission when buffers are full

This means that RoCE is a congestion management technique, unlike InfiniBand, which is a lossless fabric. With RoCE, packet loss may still occur.

Why Use RoCE and not InfiniBand?

RoCE allows organizations to run RDMA over existing Ethernet networks, avoiding the cost and vendor lock-in of dedicated InfiniBand infrastructure.

QoS Key Concepts
Traffic Classification

For switches to correctly classify and prioritize RoCE traffic, it is the server's responsibility to mark the traffic. Without this marking, the switch has no way to distinguish RoCE from regular traffic.

Servers can mark traffic at two levels:

  • Layer 2 (L2 Trust): Uses a specific TOS value in Ethernet frames

  • Layer 3 (L3 Trust): Uses a DSCP value in IP packets

Key Consideration
If RoCE traffic spans multiple subnets, L3 marking (DSCP) is required because Ethernet TOS values are lost during routing. DSCP markings persist across network boundaries, making them the preferred approach for scalable architectures.

Traffic Pools

Modern ASICs in switches use a shared internal buffer for packet queuing across all ports. This buffer can be divided into traffic pools for handling different types of traffic:

  • Lossy Pool (default): Reserved for standard Ethernet traffic, where packet drops are acceptable

  • Lossless Pool: Reserved for RoCE traffic, where packet drops are unacceptable


    Packets are mapped to these pools based on their classification (DSCP / TOS).

Traffic Classes

Traffic classes define distinct types of traffic that share the same traffic pool. RoCE uses two traffic classes, which are mapped to two different traffic pools in the switch's ASIC:

  • Traffic Class 3 (TC3): RDMA data forwarding (DSCP 24)

  • Traffic Class 6 (TC6): Congestion notification (DSCP 48)

The other traffic class is Traffic Class 0 (TC0), which is the default for all other non-RoCE traffic (DSCP 0).
Traffic classes are mapped to traffic pools.

Congestion Management

Since lossless networks prevent packet drops, congestion must be managed through flow control mechanisms:

  • ECN (Explicit Congestion Notification): A congestion signaling protocol that announces network congestion. ECN-enabled devices experiencing congestion mark packets' "ECN bits" to indicate the congestion. When a receiver gets an ECN-marked packet, it replies with an "ECN echo" to the sender, signaling them to slow down.

  • PFC (Priority Flow Control): Controls congestion by pausing only the affected traffic class (RoCE TC3 in our case), allowing unaffected classes to continue flowing. This improves upon the older global-pause mechanism, which paused all traffic.



To summarize:

Servers mark RDMA (RoCE) traffic using DSCP or TOS values, allowing switches to classify and map packets into the appropriate traffic pools.
Congestion is managed using ECN, which signals senders to slow down, and PFC, which selectively pauses traffic based on priority.
When receivers detect ECN-marked packets, they respond with “ECN echo” messages, prompting senders to adjust their transmission rate accordingly.