net: Add Open vSwitch kernel components.
Open vSwitch is a multilayer Ethernet switch targeted at virtualized environments. In addition to supporting a variety of features expected in a traditional hardware switch, it enables fine-grained programmatic extension and flow-based control of the network. This control is useful in a wide variety of applications but is particularly important in multi-server virtualization deployments, which are often characterized by highly dynamic endpoints and the need to maintain logical abstractions for multiple tenants. The Open vSwitch datapath provides an in-kernel fast path for packet forwarding. It is complemented by a userspace daemon, ovs-vswitchd, which is able to accept configuration from a variety of sources and translate it into packet processing rules. See http://openvswitch.org for more information and userspace utilities. Signed-off-by: Jesse Gross <jesse@nicira.com>
This commit is contained in:
@@ -144,6 +144,8 @@ nfc.txt
|
||||
- The Linux Near Field Communication (NFS) subsystem.
|
||||
olympic.txt
|
||||
- IBM PCI Pit/Pit-Phy/Olympic Token Ring driver info.
|
||||
openvswitch.txt
|
||||
- Open vSwitch developer documentation.
|
||||
operstates.txt
|
||||
- Overview of network interface operational states.
|
||||
packet_mmap.txt
|
||||
|
195
Documentation/networking/openvswitch.txt
Normal file
195
Documentation/networking/openvswitch.txt
Normal file
@@ -0,0 +1,195 @@
|
||||
Open vSwitch datapath developer documentation
|
||||
=============================================
|
||||
|
||||
The Open vSwitch kernel module allows flexible userspace control over
|
||||
flow-level packet processing on selected network devices. It can be
|
||||
used to implement a plain Ethernet switch, network device bonding,
|
||||
VLAN processing, network access control, flow-based network control,
|
||||
and so on.
|
||||
|
||||
The kernel module implements multiple "datapaths" (analogous to
|
||||
bridges), each of which can have multiple "vports" (analogous to ports
|
||||
within a bridge). Each datapath also has associated with it a "flow
|
||||
table" that userspace populates with "flows" that map from keys based
|
||||
on packet headers and metadata to sets of actions. The most common
|
||||
action forwards the packet to another vport; other actions are also
|
||||
implemented.
|
||||
|
||||
When a packet arrives on a vport, the kernel module processes it by
|
||||
extracting its flow key and looking it up in the flow table. If there
|
||||
is a matching flow, it executes the associated actions. If there is
|
||||
no match, it queues the packet to userspace for processing (as part of
|
||||
its processing, userspace will likely set up a flow to handle further
|
||||
packets of the same type entirely in-kernel).
|
||||
|
||||
|
||||
Flow key compatibility
|
||||
----------------------
|
||||
|
||||
Network protocols evolve over time. New protocols become important
|
||||
and existing protocols lose their prominence. For the Open vSwitch
|
||||
kernel module to remain relevant, it must be possible for newer
|
||||
versions to parse additional protocols as part of the flow key. It
|
||||
might even be desirable, someday, to drop support for parsing
|
||||
protocols that have become obsolete. Therefore, the Netlink interface
|
||||
to Open vSwitch is designed to allow carefully written userspace
|
||||
applications to work with any version of the flow key, past or future.
|
||||
|
||||
To support this forward and backward compatibility, whenever the
|
||||
kernel module passes a packet to userspace, it also passes along the
|
||||
flow key that it parsed from the packet. Userspace then extracts its
|
||||
own notion of a flow key from the packet and compares it against the
|
||||
kernel-provided version:
|
||||
|
||||
- If userspace's notion of the flow key for the packet matches the
|
||||
kernel's, then nothing special is necessary.
|
||||
|
||||
- If the kernel's flow key includes more fields than the userspace
|
||||
version of the flow key, for example if the kernel decoded IPv6
|
||||
headers but userspace stopped at the Ethernet type (because it
|
||||
does not understand IPv6), then again nothing special is
|
||||
necessary. Userspace can still set up a flow in the usual way,
|
||||
as long as it uses the kernel-provided flow key to do it.
|
||||
|
||||
- If the userspace flow key includes more fields than the
|
||||
kernel's, for example if userspace decoded an IPv6 header but
|
||||
the kernel stopped at the Ethernet type, then userspace can
|
||||
forward the packet manually, without setting up a flow in the
|
||||
kernel. This case is bad for performance because every packet
|
||||
that the kernel considers part of the flow must go to userspace,
|
||||
but the forwarding behavior is correct. (If userspace can
|
||||
determine that the values of the extra fields would not affect
|
||||
forwarding behavior, then it could set up a flow anyway.)
|
||||
|
||||
How flow keys evolve over time is important to making this work, so
|
||||
the following sections go into detail.
|
||||
|
||||
|
||||
Flow key format
|
||||
---------------
|
||||
|
||||
A flow key is passed over a Netlink socket as a sequence of Netlink
|
||||
attributes. Some attributes represent packet metadata, defined as any
|
||||
information about a packet that cannot be extracted from the packet
|
||||
itself, e.g. the vport on which the packet was received. Most
|
||||
attributes, however, are extracted from headers within the packet,
|
||||
e.g. source and destination addresses from Ethernet, IP, or TCP
|
||||
headers.
|
||||
|
||||
The <linux/openvswitch.h> header file defines the exact format of the
|
||||
flow key attributes. For informal explanatory purposes here, we write
|
||||
them as comma-separated strings, with parentheses indicating arguments
|
||||
and nesting. For example, the following could represent a flow key
|
||||
corresponding to a TCP packet that arrived on vport 1:
|
||||
|
||||
in_port(1), eth(src=e0:91:f5:21:d0:b2, dst=00:02:e3:0f:80:a4),
|
||||
eth_type(0x0800), ipv4(src=172.16.0.20, dst=172.18.0.52, proto=17, tos=0,
|
||||
frag=no), tcp(src=49163, dst=80)
|
||||
|
||||
Often we ellipsize arguments not important to the discussion, e.g.:
|
||||
|
||||
in_port(1), eth(...), eth_type(0x0800), ipv4(...), tcp(...)
|
||||
|
||||
|
||||
Basic rule for evolving flow keys
|
||||
---------------------------------
|
||||
|
||||
Some care is needed to really maintain forward and backward
|
||||
compatibility for applications that follow the rules listed under
|
||||
"Flow key compatibility" above.
|
||||
|
||||
The basic rule is obvious:
|
||||
|
||||
------------------------------------------------------------------
|
||||
New network protocol support must only supplement existing flow
|
||||
key attributes. It must not change the meaning of already defined
|
||||
flow key attributes.
|
||||
------------------------------------------------------------------
|
||||
|
||||
This rule does have less-obvious consequences so it is worth working
|
||||
through a few examples. Suppose, for example, that the kernel module
|
||||
did not already implement VLAN parsing. Instead, it just interpreted
|
||||
the 802.1Q TPID (0x8100) as the Ethertype then stopped parsing the
|
||||
packet. The flow key for any packet with an 802.1Q header would look
|
||||
essentially like this, ignoring metadata:
|
||||
|
||||
eth(...), eth_type(0x8100)
|
||||
|
||||
Naively, to add VLAN support, it makes sense to add a new "vlan" flow
|
||||
key attribute to contain the VLAN tag, then continue to decode the
|
||||
encapsulated headers beyond the VLAN tag using the existing field
|
||||
definitions. With this change, an TCP packet in VLAN 10 would have a
|
||||
flow key much like this:
|
||||
|
||||
eth(...), vlan(vid=10, pcp=0), eth_type(0x0800), ip(proto=6, ...), tcp(...)
|
||||
|
||||
But this change would negatively affect a userspace application that
|
||||
has not been updated to understand the new "vlan" flow key attribute.
|
||||
The application could, following the flow compatibility rules above,
|
||||
ignore the "vlan" attribute that it does not understand and therefore
|
||||
assume that the flow contained IP packets. This is a bad assumption
|
||||
(the flow only contains IP packets if one parses and skips over the
|
||||
802.1Q header) and it could cause the application's behavior to change
|
||||
across kernel versions even though it follows the compatibility rules.
|
||||
|
||||
The solution is to use a set of nested attributes. This is, for
|
||||
example, why 802.1Q support uses nested attributes. A TCP packet in
|
||||
VLAN 10 is actually expressed as:
|
||||
|
||||
eth(...), eth_type(0x8100), vlan(vid=10, pcp=0), encap(eth_type(0x0800),
|
||||
ip(proto=6, ...), tcp(...)))
|
||||
|
||||
Notice how the "eth_type", "ip", and "tcp" flow key attributes are
|
||||
nested inside the "encap" attribute. Thus, an application that does
|
||||
not understand the "vlan" key will not see either of those attributes
|
||||
and therefore will not misinterpret them. (Also, the outer eth_type
|
||||
is still 0x8100, not changed to 0x0800.)
|
||||
|
||||
Handling malformed packets
|
||||
--------------------------
|
||||
|
||||
Don't drop packets in the kernel for malformed protocol headers, bad
|
||||
checksums, etc. This would prevent userspace from implementing a
|
||||
simple Ethernet switch that forwards every packet.
|
||||
|
||||
Instead, in such a case, include an attribute with "empty" content.
|
||||
It doesn't matter if the empty content could be valid protocol values,
|
||||
as long as those values are rarely seen in practice, because userspace
|
||||
can always forward all packets with those values to userspace and
|
||||
handle them individually.
|
||||
|
||||
For example, consider a packet that contains an IP header that
|
||||
indicates protocol 6 for TCP, but which is truncated just after the IP
|
||||
header, so that the TCP header is missing. The flow key for this
|
||||
packet would include a tcp attribute with all-zero src and dst, like
|
||||
this:
|
||||
|
||||
eth(...), eth_type(0x0800), ip(proto=6, ...), tcp(src=0, dst=0)
|
||||
|
||||
As another example, consider a packet with an Ethernet type of 0x8100,
|
||||
indicating that a VLAN TCI should follow, but which is truncated just
|
||||
after the Ethernet type. The flow key for this packet would include
|
||||
an all-zero-bits vlan and an empty encap attribute, like this:
|
||||
|
||||
eth(...), eth_type(0x8100), vlan(0), encap()
|
||||
|
||||
Unlike a TCP packet with source and destination ports 0, an
|
||||
all-zero-bits VLAN TCI is not that rare, so the CFI bit (aka
|
||||
VLAN_TAG_PRESENT inside the kernel) is ordinarily set in a vlan
|
||||
attribute expressly to allow this situation to be distinguished.
|
||||
Thus, the flow key in this second example unambiguously indicates a
|
||||
missing or malformed VLAN TCI.
|
||||
|
||||
Other rules
|
||||
-----------
|
||||
|
||||
The other rules for flow keys are much less subtle:
|
||||
|
||||
- Duplicate attributes are not allowed at a given nesting level.
|
||||
|
||||
- Ordering of attributes is not significant.
|
||||
|
||||
- When the kernel sends a given flow key to userspace, it always
|
||||
composes it the same way. This allows userspace to hash and
|
||||
compare entire flow keys that it may not be able to fully
|
||||
interpret.
|
Reference in New Issue
Block a user