net: minor update to Documentation/networking/scaling.txt
Incorporate last comments about hyperthreading, interrupt coalescing and the definition of cache domains into the network scaling document scaling.txt Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
This commit is contained in:
committed by
David S. Miller
parent
b88cf73d92
commit
320f24e482
@@ -52,7 +52,8 @@ module parameter for specifying the number of hardware queues to
|
|||||||
configure. In the bnx2x driver, for instance, this parameter is called
|
configure. In the bnx2x driver, for instance, this parameter is called
|
||||||
num_queues. A typical RSS configuration would be to have one receive queue
|
num_queues. A typical RSS configuration would be to have one receive queue
|
||||||
for each CPU if the device supports enough queues, or otherwise at least
|
for each CPU if the device supports enough queues, or otherwise at least
|
||||||
one for each cache domain at a particular cache level (L1, L2, etc.).
|
one for each memory domain, where a memory domain is a set of CPUs that
|
||||||
|
share a particular memory level (L1, L2, NUMA node, etc.).
|
||||||
|
|
||||||
The indirection table of an RSS device, which resolves a queue by masked
|
The indirection table of an RSS device, which resolves a queue by masked
|
||||||
hash, is usually programmed by the driver at initialization. The
|
hash, is usually programmed by the driver at initialization. The
|
||||||
@@ -82,11 +83,17 @@ RSS should be enabled when latency is a concern or whenever receive
|
|||||||
interrupt processing forms a bottleneck. Spreading load between CPUs
|
interrupt processing forms a bottleneck. Spreading load between CPUs
|
||||||
decreases queue length. For low latency networking, the optimal setting
|
decreases queue length. For low latency networking, the optimal setting
|
||||||
is to allocate as many queues as there are CPUs in the system (or the
|
is to allocate as many queues as there are CPUs in the system (or the
|
||||||
NIC maximum, if lower). Because the aggregate number of interrupts grows
|
NIC maximum, if lower). The most efficient high-rate configuration
|
||||||
with each additional queue, the most efficient high-rate configuration
|
|
||||||
is likely the one with the smallest number of receive queues where no
|
is likely the one with the smallest number of receive queues where no
|
||||||
CPU that processes receive interrupts reaches 100% utilization. Per-cpu
|
receive queue overflows due to a saturated CPU, because in default
|
||||||
load can be observed using the mpstat utility.
|
mode with interrupt coalescing enabled, the aggregate number of
|
||||||
|
interrupts (and thus work) grows with each additional queue.
|
||||||
|
|
||||||
|
Per-cpu load can be observed using the mpstat utility, but note that on
|
||||||
|
processors with hyperthreading (HT), each hyperthread is represented as
|
||||||
|
a separate CPU. For interrupt handling, HT has shown no benefit in
|
||||||
|
initial tests, so limit the number of queues to the number of CPU cores
|
||||||
|
in the system.
|
||||||
|
|
||||||
|
|
||||||
RPS: Receive Packet Steering
|
RPS: Receive Packet Steering
|
||||||
@@ -145,7 +152,7 @@ the bitmap.
|
|||||||
== Suggested Configuration
|
== Suggested Configuration
|
||||||
|
|
||||||
For a single queue device, a typical RPS configuration would be to set
|
For a single queue device, a typical RPS configuration would be to set
|
||||||
the rps_cpus to the CPUs in the same cache domain of the interrupting
|
the rps_cpus to the CPUs in the same memory domain of the interrupting
|
||||||
CPU. If NUMA locality is not an issue, this could also be all CPUs in
|
CPU. If NUMA locality is not an issue, this could also be all CPUs in
|
||||||
the system. At high interrupt rate, it might be wise to exclude the
|
the system. At high interrupt rate, it might be wise to exclude the
|
||||||
interrupting CPU from the map since that already performs much work.
|
interrupting CPU from the map since that already performs much work.
|
||||||
@@ -154,7 +161,7 @@ For a multi-queue system, if RSS is configured so that a hardware
|
|||||||
receive queue is mapped to each CPU, then RPS is probably redundant
|
receive queue is mapped to each CPU, then RPS is probably redundant
|
||||||
and unnecessary. If there are fewer hardware queues than CPUs, then
|
and unnecessary. If there are fewer hardware queues than CPUs, then
|
||||||
RPS might be beneficial if the rps_cpus for each queue are the ones that
|
RPS might be beneficial if the rps_cpus for each queue are the ones that
|
||||||
share the same cache domain as the interrupting CPU for that queue.
|
share the same memory domain as the interrupting CPU for that queue.
|
||||||
|
|
||||||
|
|
||||||
RFS: Receive Flow Steering
|
RFS: Receive Flow Steering
|
||||||
@@ -326,7 +333,7 @@ The queue chosen for transmitting a particular flow is saved in the
|
|||||||
corresponding socket structure for the flow (e.g. a TCP connection).
|
corresponding socket structure for the flow (e.g. a TCP connection).
|
||||||
This transmit queue is used for subsequent packets sent on the flow to
|
This transmit queue is used for subsequent packets sent on the flow to
|
||||||
prevent out of order (ooo) packets. The choice also amortizes the cost
|
prevent out of order (ooo) packets. The choice also amortizes the cost
|
||||||
of calling get_xps_queues() over all packets in the connection. To avoid
|
of calling get_xps_queues() over all packets in the flow. To avoid
|
||||||
ooo packets, the queue for a flow can subsequently only be changed if
|
ooo packets, the queue for a flow can subsequently only be changed if
|
||||||
skb->ooo_okay is set for a packet in the flow. This flag indicates that
|
skb->ooo_okay is set for a packet in the flow. This flag indicates that
|
||||||
there are no outstanding packets in the flow, so the transmit queue can
|
there are no outstanding packets in the flow, so the transmit queue can
|
||||||
|
Reference in New Issue
Block a user