linux-kernel-test/net
Tom Herbert 0a9627f264 rps: Receive Packet Steering
This patch implements software receive side packet steering (RPS).  RPS
distributes the load of received packet processing across multiple CPUs.

Problem statement: Protocol processing done in the NAPI context for received
packets is serialized per device queue and becomes a bottleneck under high
packet load.  This substantially limits pps that can be achieved on a single
queue NIC and provides no scaling with multiple cores.

This solution queues packets early on in the receive path on the backlog queues
of other CPUs.   This allows protocol processing (e.g. IP and TCP) to be
performed on packets in parallel.   For each device (or each receive queue in
a multi-queue device) a mask of CPUs is set to indicate the CPUs that can
process packets. A CPU is selected on a per packet basis by hashing contents
of the packet header (e.g. the TCP or UDP 4-tuple) and using the result to index
into the CPU mask.  The IPI mechanism is used to raise networking receive
softirqs between CPUs.  This effectively emulates in software what a multi-queue
NIC can provide, but is generic requiring no device support.

Many devices now provide a hash over the 4-tuple on a per packet basis
(e.g. the Toeplitz hash).  This patch allow drivers to set the HW reported hash
in an skb field, and that value in turn is used to index into the RPS maps.
Using the HW generated hash can avoid cache misses on the packet when
steering it to a remote CPU.

The CPU mask is set on a per device and per queue basis in the sysfs variable
/sys/class/net/<device>/queues/rx-<n>/rps_cpus.  This is a set of canonical
bit maps for receive queues in the device (numbered by <n>).  If a device
does not support multi-queue, a single variable is used for the device (rx-0).

Generally, we have found this technique increases pps capabilities of a single
queue device with good CPU utilization.  Optimal settings for the CPU mask
seem to depend on architectures and cache hierarcy.  Below are some results
running 500 instances of netperf TCP_RR test with 1 byte req. and resp.
Results show cumulative transaction rate and system CPU utilization.

e1000e on 8 core Intel
   Without RPS: 108K tps at 33% CPU
   With RPS:    311K tps at 64% CPU

forcedeth on 16 core AMD
   Without RPS: 156K tps at 15% CPU
   With RPS:    404K tps at 49% CPU

bnx2x on 16 core AMD
   Without RPS  567K tps at 61% CPU (4 HW RX queues)
   Without RPS  738K tps at 96% CPU (8 HW RX queues)
   With RPS:    854K tps at 76% CPU (4 HW RX queues)

Caveats:
- The benefits of this patch are dependent on architecture and cache hierarchy.
Tuning the masks to get best performance is probably necessary.
- This patch adds overhead in the path for processing a single packet.  In
a lightly loaded server this overhead may eliminate the advantages of
increased parallelism, and possibly cause some relative performance degradation.
We have found that masks that are cache aware (share same caches with
the interrupting CPU) mitigate much of this.
- The RPS masks can be changed dynamically, however whenever the mask is changed
this introduces the possibility of generating out of order packets.  It's
probably best not change the masks too frequently.

Signed-off-by: Tom Herbert <therbert@google.com>

 include/linux/netdevice.h |   32 ++++-
 include/linux/skbuff.h    |    3 +
 net/core/dev.c            |  335 +++++++++++++++++++++++++++++++++++++--------
 net/core/net-sysfs.c      |  225 ++++++++++++++++++++++++++++++-
 net/core/skbuff.c         |    2 +
 5 files changed, 538 insertions(+), 59 deletions(-)
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-03-16 21:23:18 -07:00
..
9p 9p: Change the name of new protocol from 9p2010.L to 9p2000.L 2010-03-13 08:57:29 -06:00
802 sysctl net: Remove unused binary sysctl code 2009-11-12 02:05:06 -08:00
8021q percpu: add __percpu sparse annotations to net 2010-02-16 23:05:38 -08:00
appletalk net: appletalk: use seq_hlist_foo() helpers 2010-02-10 11:12:09 -08:00
atm net: atm: use seq_list_foo() helpers 2010-02-10 12:31:10 -08:00
ax25 net: ax25: use seq_hlist_foo() helpers 2010-02-10 11:12:09 -08:00
bluetooth Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6 2010-03-13 14:50:18 -08:00
bridge bridge: Make first arg to deliver_clone const. 2010-03-16 14:37:47 -07:00
can can: deny filterlist access on non-CAN interfaces 2010-02-02 07:21:34 -08:00
core rps: Receive Packet Steering 2010-03-16 21:23:18 -07:00
dcb const: struct nla_policy 2010-02-18 14:30:18 -08:00
dccp net-2.6 [Bug-Fix][dccp]: fix oops caused after failed initialisation 2010-03-15 16:00:50 -07:00
decnet net: Add checking to rcu_dereference() primitives 2010-02-25 09:41:03 +01:00
dsa
econet net: use net_eq to compare nets 2009-11-25 15:14:13 -08:00
ethernet llc: use dev_hard_header 2009-12-26 20:38:23 -08:00
ieee802154 net: use net_eq to compare nets 2009-11-25 15:14:13 -08:00
ipv4 route: Fix caught BUG_ON during rt_secret_rebuild_oneshot() 2010-03-16 14:15:47 -07:00
ipv6 ipv6: Send netlink notification when DAD fails 2010-03-13 12:23:29 -08:00
ipx net: ipx: use seq_list_foo() helpers 2010-02-10 12:31:10 -08:00
irda const: struct nla_policy 2010-02-18 14:30:18 -08:00
iucv const: constify remaining dev_pm_ops 2009-12-15 08:53:25 -08:00
key xfrm: SP lookups signature with mark 2010-02-22 16:21:12 -08:00
lapb
llc net: backlog functions rename 2010-03-05 13:34:03 -08:00
mac80211 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6 2010-03-13 14:50:18 -08:00
netfilter Merge branch 'for-next' into for-linus 2010-03-08 16:55:37 +01:00
netlabel net: remove INIT_RCU_HEAD() usage 2010-02-17 00:03:27 -08:00
netlink netlink: Adding inode field to /proc/net/netlink 2010-02-28 01:29:49 -08:00
netrom net: netrom: use seq_hlist_foo() helpers 2010-02-10 11:12:08 -08:00
packet af_packet: move strict addr_len check right before dev_[mc/unicast]_[add/del] 2010-03-03 01:04:38 -08:00
phonet phonet: use for_each_set_bit() 2010-03-15 16:00:47 -07:00
rds RDS: Enable per-cpu workqueue threads 2010-03-16 21:17:02 -07:00
rfkill rfkill: Add support for KEY_RFKILL 2010-03-02 14:28:49 -05:00
rose net: rose: use seq_hlist_foo() helpers 2010-02-10 11:12:08 -08:00
rxrpc net: use net_eq to compare nets 2009-11-25 15:14:13 -08:00
sched Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 2010-02-09 11:44:44 -08:00
sctp Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6 2010-03-13 14:50:18 -08:00
sunrpc Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6 2010-03-13 14:50:18 -08:00
tipc tipc: fix lockdep warning on address assignment 2010-03-16 14:15:45 -07:00
unix AF_UNIX: update locking comment 2010-02-18 14:12:06 -08:00
wanrouter
wimax const: struct nla_policy 2010-02-18 14:30:18 -08:00
wireless Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next-2.6 2010-02-25 23:26:21 -08:00
x25 net: backlog functions rename 2010-03-05 13:34:03 -08:00
xfrm ipsec: Fix bogus bundle flowi 2010-03-03 01:04:37 -08:00
compat.c net: use compat helper functions in compat_sys_recvmmsg 2009-12-11 15:07:57 -08:00
Kconfig
Makefile
nonet.c
socket.c fs: no games with DCACHE_UNHASHED 2009-12-17 10:51:40 -05:00
sysctl_net.c net: spread __net_init, __net_exit 2010-01-17 19:16:02 -08:00
TUNABLE