nfs: new subdir Documentation/filesystems/nfs
We're adding enough nfs documentation that it may as well have its own subdirectory. Acked-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
This commit is contained in:
12
Documentation/filesystems/nfs/00-INDEX
Normal file
12
Documentation/filesystems/nfs/00-INDEX
Normal file
@@ -0,0 +1,12 @@
|
||||
00-INDEX
|
||||
- this file (nfs-related documentation).
|
||||
Exporting
|
||||
- explanation of how to make filesystems exportable.
|
||||
nfs.txt
|
||||
- nfs client, and DNS resolution for fs_locations.
|
||||
nfs41-server.txt
|
||||
- info on the Linux server implementation of NFSv4 minor version 1.
|
||||
nfs-rdma.txt
|
||||
- how to install and setup the Linux NFS/RDMA client and server software
|
||||
nfsroot.txt
|
||||
- short guide on setting up a diskless box with NFS root filesystem.
|
147
Documentation/filesystems/nfs/Exporting
Normal file
147
Documentation/filesystems/nfs/Exporting
Normal file
@@ -0,0 +1,147 @@
|
||||
|
||||
Making Filesystems Exportable
|
||||
=============================
|
||||
|
||||
Overview
|
||||
--------
|
||||
|
||||
All filesystem operations require a dentry (or two) as a starting
|
||||
point. Local applications have a reference-counted hold on suitable
|
||||
dentries via open file descriptors or cwd/root. However remote
|
||||
applications that access a filesystem via a remote filesystem protocol
|
||||
such as NFS may not be able to hold such a reference, and so need a
|
||||
different way to refer to a particular dentry. As the alternative
|
||||
form of reference needs to be stable across renames, truncates, and
|
||||
server-reboot (among other things, though these tend to be the most
|
||||
problematic), there is no simple answer like 'filename'.
|
||||
|
||||
The mechanism discussed here allows each filesystem implementation to
|
||||
specify how to generate an opaque (outside of the filesystem) byte
|
||||
string for any dentry, and how to find an appropriate dentry for any
|
||||
given opaque byte string.
|
||||
This byte string will be called a "filehandle fragment" as it
|
||||
corresponds to part of an NFS filehandle.
|
||||
|
||||
A filesystem which supports the mapping between filehandle fragments
|
||||
and dentries will be termed "exportable".
|
||||
|
||||
|
||||
|
||||
Dcache Issues
|
||||
-------------
|
||||
|
||||
The dcache normally contains a proper prefix of any given filesystem
|
||||
tree. This means that if any filesystem object is in the dcache, then
|
||||
all of the ancestors of that filesystem object are also in the dcache.
|
||||
As normal access is by filename this prefix is created naturally and
|
||||
maintained easily (by each object maintaining a reference count on
|
||||
its parent).
|
||||
|
||||
However when objects are included into the dcache by interpreting a
|
||||
filehandle fragment, there is no automatic creation of a path prefix
|
||||
for the object. This leads to two related but distinct features of
|
||||
the dcache that are not needed for normal filesystem access.
|
||||
|
||||
1/ The dcache must sometimes contain objects that are not part of the
|
||||
proper prefix. i.e that are not connected to the root.
|
||||
2/ The dcache must be prepared for a newly found (via ->lookup) directory
|
||||
to already have a (non-connected) dentry, and must be able to move
|
||||
that dentry into place (based on the parent and name in the
|
||||
->lookup). This is particularly needed for directories as
|
||||
it is a dcache invariant that directories only have one dentry.
|
||||
|
||||
To implement these features, the dcache has:
|
||||
|
||||
a/ A dentry flag DCACHE_DISCONNECTED which is set on
|
||||
any dentry that might not be part of the proper prefix.
|
||||
This is set when anonymous dentries are created, and cleared when a
|
||||
dentry is noticed to be a child of a dentry which is in the proper
|
||||
prefix.
|
||||
|
||||
b/ A per-superblock list "s_anon" of dentries which are the roots of
|
||||
subtrees that are not in the proper prefix. These dentries, as
|
||||
well as the proper prefix, need to be released at unmount time. As
|
||||
these dentries will not be hashed, they are linked together on the
|
||||
d_hash list_head.
|
||||
|
||||
c/ Helper routines to allocate anonymous dentries, and to help attach
|
||||
loose directory dentries at lookup time. They are:
|
||||
d_alloc_anon(inode) will return a dentry for the given inode.
|
||||
If the inode already has a dentry, one of those is returned.
|
||||
If it doesn't, a new anonymous (IS_ROOT and
|
||||
DCACHE_DISCONNECTED) dentry is allocated and attached.
|
||||
In the case of a directory, care is taken that only one dentry
|
||||
can ever be attached.
|
||||
d_splice_alias(inode, dentry) will make sure that there is a
|
||||
dentry with the same name and parent as the given dentry, and
|
||||
which refers to the given inode.
|
||||
If the inode is a directory and already has a dentry, then that
|
||||
dentry is d_moved over the given dentry.
|
||||
If the passed dentry gets attached, care is taken that this is
|
||||
mutually exclusive to a d_alloc_anon operation.
|
||||
If the passed dentry is used, NULL is returned, else the used
|
||||
dentry is returned. This corresponds to the calling pattern of
|
||||
->lookup.
|
||||
|
||||
|
||||
Filesystem Issues
|
||||
-----------------
|
||||
|
||||
For a filesystem to be exportable it must:
|
||||
|
||||
1/ provide the filehandle fragment routines described below.
|
||||
2/ make sure that d_splice_alias is used rather than d_add
|
||||
when ->lookup finds an inode for a given parent and name.
|
||||
Typically the ->lookup routine will end with a:
|
||||
|
||||
return d_splice_alias(inode, dentry);
|
||||
}
|
||||
|
||||
|
||||
|
||||
A file system implementation declares that instances of the filesystem
|
||||
are exportable by setting the s_export_op field in the struct
|
||||
super_block. This field must point to a "struct export_operations"
|
||||
struct which has the following members:
|
||||
|
||||
encode_fh (optional)
|
||||
Takes a dentry and creates a filehandle fragment which can later be used
|
||||
to find or create a dentry for the same object. The default
|
||||
implementation creates a filehandle fragment that encodes a 32bit inode
|
||||
and generation number for the inode encoded, and if necessary the
|
||||
same information for the parent.
|
||||
|
||||
fh_to_dentry (mandatory)
|
||||
Given a filehandle fragment, this should find the implied object and
|
||||
create a dentry for it (possibly with d_alloc_anon).
|
||||
|
||||
fh_to_parent (optional but strongly recommended)
|
||||
Given a filehandle fragment, this should find the parent of the
|
||||
implied object and create a dentry for it (possibly with d_alloc_anon).
|
||||
May fail if the filehandle fragment is too small.
|
||||
|
||||
get_parent (optional but strongly recommended)
|
||||
When given a dentry for a directory, this should return a dentry for
|
||||
the parent. Quite possibly the parent dentry will have been allocated
|
||||
by d_alloc_anon. The default get_parent function just returns an error
|
||||
so any filehandle lookup that requires finding a parent will fail.
|
||||
->lookup("..") is *not* used as a default as it can leave ".." entries
|
||||
in the dcache which are too messy to work with.
|
||||
|
||||
get_name (optional)
|
||||
When given a parent dentry and a child dentry, this should find a name
|
||||
in the directory identified by the parent dentry, which leads to the
|
||||
object identified by the child dentry. If no get_name function is
|
||||
supplied, a default implementation is provided which uses vfs_readdir
|
||||
to find potential names, and matches inode numbers to find the correct
|
||||
match.
|
||||
|
||||
|
||||
A filehandle fragment consists of an array of 1 or more 4byte words,
|
||||
together with a one byte "type".
|
||||
The decode_fh routine should not depend on the stated size that is
|
||||
passed to it. This size may be larger than the original filehandle
|
||||
generated by encode_fh, in which case it will have been padded with
|
||||
nuls. Rather, the encode_fh routine should choose a "type" which
|
||||
indicates the decode_fh how much of the filehandle is valid, and how
|
||||
it should be interpreted.
|
271
Documentation/filesystems/nfs/nfs-rdma.txt
Normal file
271
Documentation/filesystems/nfs/nfs-rdma.txt
Normal file
@@ -0,0 +1,271 @@
|
||||
################################################################################
|
||||
# #
|
||||
# NFS/RDMA README #
|
||||
# #
|
||||
################################################################################
|
||||
|
||||
Author: NetApp and Open Grid Computing
|
||||
Date: May 29, 2008
|
||||
|
||||
Table of Contents
|
||||
~~~~~~~~~~~~~~~~~
|
||||
- Overview
|
||||
- Getting Help
|
||||
- Installation
|
||||
- Check RDMA and NFS Setup
|
||||
- NFS/RDMA Setup
|
||||
|
||||
Overview
|
||||
~~~~~~~~
|
||||
|
||||
This document describes how to install and setup the Linux NFS/RDMA client
|
||||
and server software.
|
||||
|
||||
The NFS/RDMA client was first included in Linux 2.6.24. The NFS/RDMA server
|
||||
was first included in the following release, Linux 2.6.25.
|
||||
|
||||
In our testing, we have obtained excellent performance results (full 10Gbit
|
||||
wire bandwidth at minimal client CPU) under many workloads. The code passes
|
||||
the full Connectathon test suite and operates over both Infiniband and iWARP
|
||||
RDMA adapters.
|
||||
|
||||
Getting Help
|
||||
~~~~~~~~~~~~
|
||||
|
||||
If you get stuck, you can ask questions on the
|
||||
|
||||
nfs-rdma-devel@lists.sourceforge.net
|
||||
|
||||
mailing list.
|
||||
|
||||
Installation
|
||||
~~~~~~~~~~~~
|
||||
|
||||
These instructions are a step by step guide to building a machine for
|
||||
use with NFS/RDMA.
|
||||
|
||||
- Install an RDMA device
|
||||
|
||||
Any device supported by the drivers in drivers/infiniband/hw is acceptable.
|
||||
|
||||
Testing has been performed using several Mellanox-based IB cards, the
|
||||
Ammasso AMS1100 iWARP adapter, and the Chelsio cxgb3 iWARP adapter.
|
||||
|
||||
- Install a Linux distribution and tools
|
||||
|
||||
The first kernel release to contain both the NFS/RDMA client and server was
|
||||
Linux 2.6.25 Therefore, a distribution compatible with this and subsequent
|
||||
Linux kernel release should be installed.
|
||||
|
||||
The procedures described in this document have been tested with
|
||||
distributions from Red Hat's Fedora Project (http://fedora.redhat.com/).
|
||||
|
||||
- Install nfs-utils-1.1.2 or greater on the client
|
||||
|
||||
An NFS/RDMA mount point can be obtained by using the mount.nfs command in
|
||||
nfs-utils-1.1.2 or greater (nfs-utils-1.1.1 was the first nfs-utils
|
||||
version with support for NFS/RDMA mounts, but for various reasons we
|
||||
recommend using nfs-utils-1.1.2 or greater). To see which version of
|
||||
mount.nfs you are using, type:
|
||||
|
||||
$ /sbin/mount.nfs -V
|
||||
|
||||
If the version is less than 1.1.2 or the command does not exist,
|
||||
you should install the latest version of nfs-utils.
|
||||
|
||||
Download the latest package from:
|
||||
|
||||
http://www.kernel.org/pub/linux/utils/nfs
|
||||
|
||||
Uncompress the package and follow the installation instructions.
|
||||
|
||||
If you will not need the idmapper and gssd executables (you do not need
|
||||
these to create an NFS/RDMA enabled mount command), the installation
|
||||
process can be simplified by disabling these features when running
|
||||
configure:
|
||||
|
||||
$ ./configure --disable-gss --disable-nfsv4
|
||||
|
||||
To build nfs-utils you will need the tcp_wrappers package installed. For
|
||||
more information on this see the package's README and INSTALL files.
|
||||
|
||||
After building the nfs-utils package, there will be a mount.nfs binary in
|
||||
the utils/mount directory. This binary can be used to initiate NFS v2, v3,
|
||||
or v4 mounts. To initiate a v4 mount, the binary must be called
|
||||
mount.nfs4. The standard technique is to create a symlink called
|
||||
mount.nfs4 to mount.nfs.
|
||||
|
||||
This mount.nfs binary should be installed at /sbin/mount.nfs as follows:
|
||||
|
||||
$ sudo cp utils/mount/mount.nfs /sbin/mount.nfs
|
||||
|
||||
In this location, mount.nfs will be invoked automatically for NFS mounts
|
||||
by the system mount command.
|
||||
|
||||
NOTE: mount.nfs and therefore nfs-utils-1.1.2 or greater is only needed
|
||||
on the NFS client machine. You do not need this specific version of
|
||||
nfs-utils on the server. Furthermore, only the mount.nfs command from
|
||||
nfs-utils-1.1.2 is needed on the client.
|
||||
|
||||
- Install a Linux kernel with NFS/RDMA
|
||||
|
||||
The NFS/RDMA client and server are both included in the mainline Linux
|
||||
kernel version 2.6.25 and later. This and other versions of the 2.6 Linux
|
||||
kernel can be found at:
|
||||
|
||||
ftp://ftp.kernel.org/pub/linux/kernel/v2.6/
|
||||
|
||||
Download the sources and place them in an appropriate location.
|
||||
|
||||
- Configure the RDMA stack
|
||||
|
||||
Make sure your kernel configuration has RDMA support enabled. Under
|
||||
Device Drivers -> InfiniBand support, update the kernel configuration
|
||||
to enable InfiniBand support [NOTE: the option name is misleading. Enabling
|
||||
InfiniBand support is required for all RDMA devices (IB, iWARP, etc.)].
|
||||
|
||||
Enable the appropriate IB HCA support (mlx4, mthca, ehca, ipath, etc.) or
|
||||
iWARP adapter support (amso, cxgb3, etc.).
|
||||
|
||||
If you are using InfiniBand, be sure to enable IP-over-InfiniBand support.
|
||||
|
||||
- Configure the NFS client and server
|
||||
|
||||
Your kernel configuration must also have NFS file system support and/or
|
||||
NFS server support enabled. These and other NFS related configuration
|
||||
options can be found under File Systems -> Network File Systems.
|
||||
|
||||
- Build, install, reboot
|
||||
|
||||
The NFS/RDMA code will be enabled automatically if NFS and RDMA
|
||||
are turned on. The NFS/RDMA client and server are configured via the hidden
|
||||
SUNRPC_XPRT_RDMA config option that depends on SUNRPC and INFINIBAND. The
|
||||
value of SUNRPC_XPRT_RDMA will be:
|
||||
|
||||
- N if either SUNRPC or INFINIBAND are N, in this case the NFS/RDMA client
|
||||
and server will not be built
|
||||
- M if both SUNRPC and INFINIBAND are on (M or Y) and at least one is M,
|
||||
in this case the NFS/RDMA client and server will be built as modules
|
||||
- Y if both SUNRPC and INFINIBAND are Y, in this case the NFS/RDMA client
|
||||
and server will be built into the kernel
|
||||
|
||||
Therefore, if you have followed the steps above and turned no NFS and RDMA,
|
||||
the NFS/RDMA client and server will be built.
|
||||
|
||||
Build a new kernel, install it, boot it.
|
||||
|
||||
Check RDMA and NFS Setup
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Before configuring the NFS/RDMA software, it is a good idea to test
|
||||
your new kernel to ensure that the kernel is working correctly.
|
||||
In particular, it is a good idea to verify that the RDMA stack
|
||||
is functioning as expected and standard NFS over TCP/IP and/or UDP/IP
|
||||
is working properly.
|
||||
|
||||
- Check RDMA Setup
|
||||
|
||||
If you built the RDMA components as modules, load them at
|
||||
this time. For example, if you are using a Mellanox Tavor/Sinai/Arbel
|
||||
card:
|
||||
|
||||
$ modprobe ib_mthca
|
||||
$ modprobe ib_ipoib
|
||||
|
||||
If you are using InfiniBand, make sure there is a Subnet Manager (SM)
|
||||
running on the network. If your IB switch has an embedded SM, you can
|
||||
use it. Otherwise, you will need to run an SM, such as OpenSM, on one
|
||||
of your end nodes.
|
||||
|
||||
If an SM is running on your network, you should see the following:
|
||||
|
||||
$ cat /sys/class/infiniband/driverX/ports/1/state
|
||||
4: ACTIVE
|
||||
|
||||
where driverX is mthca0, ipath5, ehca3, etc.
|
||||
|
||||
To further test the InfiniBand software stack, use IPoIB (this
|
||||
assumes you have two IB hosts named host1 and host2):
|
||||
|
||||
host1$ ifconfig ib0 a.b.c.x
|
||||
host2$ ifconfig ib0 a.b.c.y
|
||||
host1$ ping a.b.c.y
|
||||
host2$ ping a.b.c.x
|
||||
|
||||
For other device types, follow the appropriate procedures.
|
||||
|
||||
- Check NFS Setup
|
||||
|
||||
For the NFS components enabled above (client and/or server),
|
||||
test their functionality over standard Ethernet using TCP/IP or UDP/IP.
|
||||
|
||||
NFS/RDMA Setup
|
||||
~~~~~~~~~~~~~~
|
||||
|
||||
We recommend that you use two machines, one to act as the client and
|
||||
one to act as the server.
|
||||
|
||||
One time configuration:
|
||||
|
||||
- On the server system, configure the /etc/exports file and
|
||||
start the NFS/RDMA server.
|
||||
|
||||
Exports entries with the following formats have been tested:
|
||||
|
||||
/vol0 192.168.0.47(fsid=0,rw,async,insecure,no_root_squash)
|
||||
/vol0 192.168.0.0/255.255.255.0(fsid=0,rw,async,insecure,no_root_squash)
|
||||
|
||||
The IP address(es) is(are) the client's IPoIB address for an InfiniBand
|
||||
HCA or the cleint's iWARP address(es) for an RNIC.
|
||||
|
||||
NOTE: The "insecure" option must be used because the NFS/RDMA client does
|
||||
not use a reserved port.
|
||||
|
||||
Each time a machine boots:
|
||||
|
||||
- Load and configure the RDMA drivers
|
||||
|
||||
For InfiniBand using a Mellanox adapter:
|
||||
|
||||
$ modprobe ib_mthca
|
||||
$ modprobe ib_ipoib
|
||||
$ ifconfig ib0 a.b.c.d
|
||||
|
||||
NOTE: use unique addresses for the client and server
|
||||
|
||||
- Start the NFS server
|
||||
|
||||
If the NFS/RDMA server was built as a module (CONFIG_SUNRPC_XPRT_RDMA=m in
|
||||
kernel config), load the RDMA transport module:
|
||||
|
||||
$ modprobe svcrdma
|
||||
|
||||
Regardless of how the server was built (module or built-in), start the
|
||||
server:
|
||||
|
||||
$ /etc/init.d/nfs start
|
||||
|
||||
or
|
||||
|
||||
$ service nfs start
|
||||
|
||||
Instruct the server to listen on the RDMA transport:
|
||||
|
||||
$ echo rdma 20049 > /proc/fs/nfsd/portlist
|
||||
|
||||
- On the client system
|
||||
|
||||
If the NFS/RDMA client was built as a module (CONFIG_SUNRPC_XPRT_RDMA=m in
|
||||
kernel config), load the RDMA client module:
|
||||
|
||||
$ modprobe xprtrdma.ko
|
||||
|
||||
Regardless of how the client was built (module or built-in), use this
|
||||
command to mount the NFS/RDMA server:
|
||||
|
||||
$ mount -o rdma,port=20049 <IPoIB-server-name-or-address>:/<export> /mnt
|
||||
|
||||
To verify that the mount is using RDMA, run "cat /proc/mounts" and check
|
||||
the "proto" field for the given mount.
|
||||
|
||||
Congratulations! You're using NFS/RDMA!
|
98
Documentation/filesystems/nfs/nfs.txt
Normal file
98
Documentation/filesystems/nfs/nfs.txt
Normal file
@@ -0,0 +1,98 @@
|
||||
|
||||
The NFS client
|
||||
==============
|
||||
|
||||
The NFS version 2 protocol was first documented in RFC1094 (March 1989).
|
||||
Since then two more major releases of NFS have been published, with NFSv3
|
||||
being documented in RFC1813 (June 1995), and NFSv4 in RFC3530 (April
|
||||
2003).
|
||||
|
||||
The Linux NFS client currently supports all the above published versions,
|
||||
and work is in progress on adding support for minor version 1 of the NFSv4
|
||||
protocol.
|
||||
|
||||
The purpose of this document is to provide information on some of the
|
||||
upcall interfaces that are used in order to provide the NFS client with
|
||||
some of the information that it requires in order to fully comply with
|
||||
the NFS spec.
|
||||
|
||||
The DNS resolver
|
||||
================
|
||||
|
||||
NFSv4 allows for one server to refer the NFS client to data that has been
|
||||
migrated onto another server by means of the special "fs_locations"
|
||||
attribute. See
|
||||
http://tools.ietf.org/html/rfc3530#section-6
|
||||
and
|
||||
http://tools.ietf.org/html/draft-ietf-nfsv4-referrals-00
|
||||
|
||||
The fs_locations information can take the form of either an ip address and
|
||||
a path, or a DNS hostname and a path. The latter requires the NFS client to
|
||||
do a DNS lookup in order to mount the new volume, and hence the need for an
|
||||
upcall to allow userland to provide this service.
|
||||
|
||||
Assuming that the user has the 'rpc_pipefs' filesystem mounted in the usual
|
||||
/var/lib/nfs/rpc_pipefs, the upcall consists of the following steps:
|
||||
|
||||
(1) The process checks the dns_resolve cache to see if it contains a
|
||||
valid entry. If so, it returns that entry and exits.
|
||||
|
||||
(2) If no valid entry exists, the helper script '/sbin/nfs_cache_getent'
|
||||
(may be changed using the 'nfs.cache_getent' kernel boot parameter)
|
||||
is run, with two arguments:
|
||||
- the cache name, "dns_resolve"
|
||||
- the hostname to resolve
|
||||
|
||||
(3) After looking up the corresponding ip address, the helper script
|
||||
writes the result into the rpc_pipefs pseudo-file
|
||||
'/var/lib/nfs/rpc_pipefs/cache/dns_resolve/channel'
|
||||
in the following (text) format:
|
||||
|
||||
"<ip address> <hostname> <ttl>\n"
|
||||
|
||||
Where <ip address> is in the usual IPv4 (123.456.78.90) or IPv6
|
||||
(ffee:ddcc:bbaa:9988:7766:5544:3322:1100, ffee::1100, ...) format.
|
||||
<hostname> is identical to the second argument of the helper
|
||||
script, and <ttl> is the 'time to live' of this cache entry (in
|
||||
units of seconds).
|
||||
|
||||
Note: If <ip address> is invalid, say the string "0", then a negative
|
||||
entry is created, which will cause the kernel to treat the hostname
|
||||
as having no valid DNS translation.
|
||||
|
||||
|
||||
|
||||
|
||||
A basic sample /sbin/nfs_cache_getent
|
||||
=====================================
|
||||
|
||||
#!/bin/bash
|
||||
#
|
||||
ttl=600
|
||||
#
|
||||
cut=/usr/bin/cut
|
||||
getent=/usr/bin/getent
|
||||
rpc_pipefs=/var/lib/nfs/rpc_pipefs
|
||||
#
|
||||
die()
|
||||
{
|
||||
echo "Usage: $0 cache_name entry_name"
|
||||
exit 1
|
||||
}
|
||||
|
||||
[ $# -lt 2 ] && die
|
||||
cachename="$1"
|
||||
cache_path=${rpc_pipefs}/cache/${cachename}/channel
|
||||
|
||||
case "${cachename}" in
|
||||
dns_resolve)
|
||||
name="$2"
|
||||
result="$(${getent} hosts ${name} | ${cut} -f1 -d\ )"
|
||||
[ -z "${result}" ] && result="0"
|
||||
;;
|
||||
*)
|
||||
die
|
||||
;;
|
||||
esac
|
||||
echo "${result} ${name} ${ttl}" >${cache_path}
|
||||
|
222
Documentation/filesystems/nfs/nfs41-server.txt
Normal file
222
Documentation/filesystems/nfs/nfs41-server.txt
Normal file
@@ -0,0 +1,222 @@
|
||||
NFSv4.1 Server Implementation
|
||||
|
||||
Server support for minorversion 1 can be controlled using the
|
||||
/proc/fs/nfsd/versions control file. The string output returned
|
||||
by reading this file will contain either "+4.1" or "-4.1"
|
||||
correspondingly.
|
||||
|
||||
Currently, server support for minorversion 1 is disabled by default.
|
||||
It can be enabled at run time by writing the string "+4.1" to
|
||||
the /proc/fs/nfsd/versions control file. Note that to write this
|
||||
control file, the nfsd service must be taken down. Use your user-mode
|
||||
nfs-utils to set this up; see rpc.nfsd(8)
|
||||
|
||||
(Warning: older servers will interpret "+4.1" and "-4.1" as "+4" and
|
||||
"-4", respectively. Therefore, code meant to work on both new and old
|
||||
kernels must turn 4.1 on or off *before* turning support for version 4
|
||||
on or off; rpc.nfsd does this correctly.)
|
||||
|
||||
The NFSv4 minorversion 1 (NFSv4.1) implementation in nfsd is based
|
||||
on the latest NFSv4.1 Internet Draft:
|
||||
http://tools.ietf.org/html/draft-ietf-nfsv4-minorversion1-29
|
||||
|
||||
From the many new features in NFSv4.1 the current implementation
|
||||
focuses on the mandatory-to-implement NFSv4.1 Sessions, providing
|
||||
"exactly once" semantics and better control and throttling of the
|
||||
resources allocated for each client.
|
||||
|
||||
Other NFSv4.1 features, Parallel NFS operations in particular,
|
||||
are still under development out of tree.
|
||||
See http://wiki.linux-nfs.org/wiki/index.php/PNFS_prototype_design
|
||||
for more information.
|
||||
|
||||
The current implementation is intended for developers only: while it
|
||||
does support ordinary file operations on clients we have tested against
|
||||
(including the linux client), it is incomplete in ways which may limit
|
||||
features unexpectedly, cause known bugs in rare cases, or cause
|
||||
interoperability problems with future clients. Known issues:
|
||||
|
||||
- gss support is questionable: currently mounts with kerberos
|
||||
from a linux client are possible, but we aren't really
|
||||
conformant with the spec (for example, we don't use kerberos
|
||||
on the backchannel correctly).
|
||||
- no trunking support: no clients currently take advantage of
|
||||
trunking, but this is a mandatory feature, and its use is
|
||||
recommended to clients in a number of places. (E.g. to ensure
|
||||
timely renewal in case an existing connection's retry timeouts
|
||||
have gotten too long; see section 8.3 of the draft.)
|
||||
Therefore, lack of this feature may cause future clients to
|
||||
fail.
|
||||
- Incomplete backchannel support: incomplete backchannel gss
|
||||
support and no support for BACKCHANNEL_CTL mean that
|
||||
callbacks (hence delegations and layouts) may not be
|
||||
available and clients confused by the incomplete
|
||||
implementation may fail.
|
||||
- Server reboot recovery is unsupported; if the server reboots,
|
||||
clients may fail.
|
||||
- We do not support SSV, which provides security for shared
|
||||
client-server state (thus preventing unauthorized tampering
|
||||
with locks and opens, for example). It is mandatory for
|
||||
servers to support this, though no clients use it yet.
|
||||
- Mandatory operations which we do not support, such as
|
||||
DESTROY_CLIENTID, FREE_STATEID, SECINFO_NO_NAME, and
|
||||
TEST_STATEID, are not currently used by clients, but will be
|
||||
(and the spec recommends their uses in common cases), and
|
||||
clients should not be expected to know how to recover from the
|
||||
case where they are not supported. This will eventually cause
|
||||
interoperability failures.
|
||||
|
||||
In addition, some limitations are inherited from the current NFSv4
|
||||
implementation:
|
||||
|
||||
- Incomplete delegation enforcement: if a file is renamed or
|
||||
unlinked, a client holding a delegation may continue to
|
||||
indefinitely allow opens of the file under the old name.
|
||||
|
||||
The table below, taken from the NFSv4.1 document, lists
|
||||
the operations that are mandatory to implement (REQ), optional
|
||||
(OPT), and NFSv4.0 operations that are required not to implement (MNI)
|
||||
in minor version 1. The first column indicates the operations that
|
||||
are not supported yet by the linux server implementation.
|
||||
|
||||
The OPTIONAL features identified and their abbreviations are as follows:
|
||||
pNFS Parallel NFS
|
||||
FDELG File Delegations
|
||||
DDELG Directory Delegations
|
||||
|
||||
The following abbreviations indicate the linux server implementation status.
|
||||
I Implemented NFSv4.1 operations.
|
||||
NS Not Supported.
|
||||
NS* unimplemented optional feature.
|
||||
P pNFS features implemented out of tree.
|
||||
PNS pNFS features that are not supported yet (out of tree).
|
||||
|
||||
Operations
|
||||
|
||||
+----------------------+------------+--------------+----------------+
|
||||
| Operation | REQ, REC, | Feature | Definition |
|
||||
| | OPT, or | (REQ, REC, | |
|
||||
| | MNI | or OPT) | |
|
||||
+----------------------+------------+--------------+----------------+
|
||||
| ACCESS | REQ | | Section 18.1 |
|
||||
NS | BACKCHANNEL_CTL | REQ | | Section 18.33 |
|
||||
NS | BIND_CONN_TO_SESSION | REQ | | Section 18.34 |
|
||||
| CLOSE | REQ | | Section 18.2 |
|
||||
| COMMIT | REQ | | Section 18.3 |
|
||||
| CREATE | REQ | | Section 18.4 |
|
||||
I | CREATE_SESSION | REQ | | Section 18.36 |
|
||||
NS*| DELEGPURGE | OPT | FDELG (REQ) | Section 18.5 |
|
||||
| DELEGRETURN | OPT | FDELG, | Section 18.6 |
|
||||
| | | DDELG, pNFS | |
|
||||
| | | (REQ) | |
|
||||
NS | DESTROY_CLIENTID | REQ | | Section 18.50 |
|
||||
I | DESTROY_SESSION | REQ | | Section 18.37 |
|
||||
I | EXCHANGE_ID | REQ | | Section 18.35 |
|
||||
NS | FREE_STATEID | REQ | | Section 18.38 |
|
||||
| GETATTR | REQ | | Section 18.7 |
|
||||
P | GETDEVICEINFO | OPT | pNFS (REQ) | Section 18.40 |
|
||||
P | GETDEVICELIST | OPT | pNFS (OPT) | Section 18.41 |
|
||||
| GETFH | REQ | | Section 18.8 |
|
||||
NS*| GET_DIR_DELEGATION | OPT | DDELG (REQ) | Section 18.39 |
|
||||
P | LAYOUTCOMMIT | OPT | pNFS (REQ) | Section 18.42 |
|
||||
P | LAYOUTGET | OPT | pNFS (REQ) | Section 18.43 |
|
||||
P | LAYOUTRETURN | OPT | pNFS (REQ) | Section 18.44 |
|
||||
| LINK | OPT | | Section 18.9 |
|
||||
| LOCK | REQ | | Section 18.10 |
|
||||
| LOCKT | REQ | | Section 18.11 |
|
||||
| LOCKU | REQ | | Section 18.12 |
|
||||
| LOOKUP | REQ | | Section 18.13 |
|
||||
| LOOKUPP | REQ | | Section 18.14 |
|
||||
| NVERIFY | REQ | | Section 18.15 |
|
||||
| OPEN | REQ | | Section 18.16 |
|
||||
NS*| OPENATTR | OPT | | Section 18.17 |
|
||||
| OPEN_CONFIRM | MNI | | N/A |
|
||||
| OPEN_DOWNGRADE | REQ | | Section 18.18 |
|
||||
| PUTFH | REQ | | Section 18.19 |
|
||||
| PUTPUBFH | REQ | | Section 18.20 |
|
||||
| PUTROOTFH | REQ | | Section 18.21 |
|
||||
| READ | REQ | | Section 18.22 |
|
||||
| READDIR | REQ | | Section 18.23 |
|
||||
| READLINK | OPT | | Section 18.24 |
|
||||
NS | RECLAIM_COMPLETE | REQ | | Section 18.51 |
|
||||
| RELEASE_LOCKOWNER | MNI | | N/A |
|
||||
| REMOVE | REQ | | Section 18.25 |
|
||||
| RENAME | REQ | | Section 18.26 |
|
||||
| RENEW | MNI | | N/A |
|
||||
| RESTOREFH | REQ | | Section 18.27 |
|
||||
| SAVEFH | REQ | | Section 18.28 |
|
||||
| SECINFO | REQ | | Section 18.29 |
|
||||
NS | SECINFO_NO_NAME | REC | pNFS files | Section 18.45, |
|
||||
| | | layout (REQ) | Section 13.12 |
|
||||
I | SEQUENCE | REQ | | Section 18.46 |
|
||||
| SETATTR | REQ | | Section 18.30 |
|
||||
| SETCLIENTID | MNI | | N/A |
|
||||
| SETCLIENTID_CONFIRM | MNI | | N/A |
|
||||
NS | SET_SSV | REQ | | Section 18.47 |
|
||||
NS | TEST_STATEID | REQ | | Section 18.48 |
|
||||
| VERIFY | REQ | | Section 18.31 |
|
||||
NS*| WANT_DELEGATION | OPT | FDELG (OPT) | Section 18.49 |
|
||||
| WRITE | REQ | | Section 18.32 |
|
||||
|
||||
Callback Operations
|
||||
|
||||
+-------------------------+-----------+-------------+---------------+
|
||||
| Operation | REQ, REC, | Feature | Definition |
|
||||
| | OPT, or | (REQ, REC, | |
|
||||
| | MNI | or OPT) | |
|
||||
+-------------------------+-----------+-------------+---------------+
|
||||
| CB_GETATTR | OPT | FDELG (REQ) | Section 20.1 |
|
||||
P | CB_LAYOUTRECALL | OPT | pNFS (REQ) | Section 20.3 |
|
||||
NS*| CB_NOTIFY | OPT | DDELG (REQ) | Section 20.4 |
|
||||
P | CB_NOTIFY_DEVICEID | OPT | pNFS (OPT) | Section 20.12 |
|
||||
NS*| CB_NOTIFY_LOCK | OPT | | Section 20.11 |
|
||||
NS*| CB_PUSH_DELEG | OPT | FDELG (OPT) | Section 20.5 |
|
||||
| CB_RECALL | OPT | FDELG, | Section 20.2 |
|
||||
| | | DDELG, pNFS | |
|
||||
| | | (REQ) | |
|
||||
NS*| CB_RECALL_ANY | OPT | FDELG, | Section 20.6 |
|
||||
| | | DDELG, pNFS | |
|
||||
| | | (REQ) | |
|
||||
NS | CB_RECALL_SLOT | REQ | | Section 20.8 |
|
||||
NS*| CB_RECALLABLE_OBJ_AVAIL | OPT | DDELG, pNFS | Section 20.7 |
|
||||
| | | (REQ) | |
|
||||
I | CB_SEQUENCE | OPT | FDELG, | Section 20.9 |
|
||||
| | | DDELG, pNFS | |
|
||||
| | | (REQ) | |
|
||||
NS*| CB_WANTS_CANCELLED | OPT | FDELG, | Section 20.10 |
|
||||
| | | DDELG, pNFS | |
|
||||
| | | (REQ) | |
|
||||
+-------------------------+-----------+-------------+---------------+
|
||||
|
||||
Implementation notes:
|
||||
|
||||
DELEGPURGE:
|
||||
* mandatory only for servers that support CLAIM_DELEGATE_PREV and/or
|
||||
CLAIM_DELEG_PREV_FH (which allows clients to keep delegations that
|
||||
persist across client reboots). Thus we need not implement this for
|
||||
now.
|
||||
|
||||
EXCHANGE_ID:
|
||||
* only SP4_NONE state protection supported
|
||||
* implementation ids are ignored
|
||||
|
||||
CREATE_SESSION:
|
||||
* backchannel attributes are ignored
|
||||
* backchannel security parameters are ignored
|
||||
|
||||
SEQUENCE:
|
||||
* no support for dynamic slot table renegotiation (optional)
|
||||
|
||||
nfsv4.1 COMPOUND rules:
|
||||
The following cases aren't supported yet:
|
||||
* Enforcing of NFS4ERR_NOT_ONLY_OP for: BIND_CONN_TO_SESSION, CREATE_SESSION,
|
||||
DESTROY_CLIENTID, DESTROY_SESSION, EXCHANGE_ID.
|
||||
* DESTROY_SESSION MUST be the final operation in the COMPOUND request.
|
||||
|
||||
Nonstandard compound limitations:
|
||||
* No support for a sessions fore channel RPC compound that requires both a
|
||||
ca_maxrequestsize request and a ca_maxresponsesize reply, so we may
|
||||
fail to live up to the promise we made in CREATE_SESSION fore channel
|
||||
negotiation.
|
||||
* No more than one IO operation (read, write, readdir) allowed per
|
||||
compound.
|
270
Documentation/filesystems/nfs/nfsroot.txt
Normal file
270
Documentation/filesystems/nfs/nfsroot.txt
Normal file
@@ -0,0 +1,270 @@
|
||||
Mounting the root filesystem via NFS (nfsroot)
|
||||
===============================================
|
||||
|
||||
Written 1996 by Gero Kuhlmann <gero@gkminix.han.de>
|
||||
Updated 1997 by Martin Mares <mj@atrey.karlin.mff.cuni.cz>
|
||||
Updated 2006 by Nico Schottelius <nico-kernel-nfsroot@schottelius.org>
|
||||
Updated 2006 by Horms <horms@verge.net.au>
|
||||
|
||||
|
||||
|
||||
In order to use a diskless system, such as an X-terminal or printer server
|
||||
for example, it is necessary for the root filesystem to be present on a
|
||||
non-disk device. This may be an initramfs (see Documentation/filesystems/
|
||||
ramfs-rootfs-initramfs.txt), a ramdisk (see Documentation/initrd.txt) or a
|
||||
filesystem mounted via NFS. The following text describes on how to use NFS
|
||||
for the root filesystem. For the rest of this text 'client' means the
|
||||
diskless system, and 'server' means the NFS server.
|
||||
|
||||
|
||||
|
||||
|
||||
1.) Enabling nfsroot capabilities
|
||||
-----------------------------
|
||||
|
||||
In order to use nfsroot, NFS client support needs to be selected as
|
||||
built-in during configuration. Once this has been selected, the nfsroot
|
||||
option will become available, which should also be selected.
|
||||
|
||||
In the networking options, kernel level autoconfiguration can be selected,
|
||||
along with the types of autoconfiguration to support. Selecting all of
|
||||
DHCP, BOOTP and RARP is safe.
|
||||
|
||||
|
||||
|
||||
|
||||
2.) Kernel command line
|
||||
-------------------
|
||||
|
||||
When the kernel has been loaded by a boot loader (see below) it needs to be
|
||||
told what root fs device to use. And in the case of nfsroot, where to find
|
||||
both the server and the name of the directory on the server to mount as root.
|
||||
This can be established using the following kernel command line parameters:
|
||||
|
||||
|
||||
root=/dev/nfs
|
||||
|
||||
This is necessary to enable the pseudo-NFS-device. Note that it's not a
|
||||
real device but just a synonym to tell the kernel to use NFS instead of
|
||||
a real device.
|
||||
|
||||
|
||||
nfsroot=[<server-ip>:]<root-dir>[,<nfs-options>]
|
||||
|
||||
If the `nfsroot' parameter is NOT given on the command line,
|
||||
the default "/tftpboot/%s" will be used.
|
||||
|
||||
<server-ip> Specifies the IP address of the NFS server.
|
||||
The default address is determined by the `ip' parameter
|
||||
(see below). This parameter allows the use of different
|
||||
servers for IP autoconfiguration and NFS.
|
||||
|
||||
<root-dir> Name of the directory on the server to mount as root.
|
||||
If there is a "%s" token in the string, it will be
|
||||
replaced by the ASCII-representation of the client's
|
||||
IP address.
|
||||
|
||||
<nfs-options> Standard NFS options. All options are separated by commas.
|
||||
The following defaults are used:
|
||||
port = as given by server portmap daemon
|
||||
rsize = 4096
|
||||
wsize = 4096
|
||||
timeo = 7
|
||||
retrans = 3
|
||||
acregmin = 3
|
||||
acregmax = 60
|
||||
acdirmin = 30
|
||||
acdirmax = 60
|
||||
flags = hard, nointr, noposix, cto, ac
|
||||
|
||||
|
||||
ip=<client-ip>:<server-ip>:<gw-ip>:<netmask>:<hostname>:<device>:<autoconf>
|
||||
|
||||
This parameter tells the kernel how to configure IP addresses of devices
|
||||
and also how to set up the IP routing table. It was originally called
|
||||
`nfsaddrs', but now the boot-time IP configuration works independently of
|
||||
NFS, so it was renamed to `ip' and the old name remained as an alias for
|
||||
compatibility reasons.
|
||||
|
||||
If this parameter is missing from the kernel command line, all fields are
|
||||
assumed to be empty, and the defaults mentioned below apply. In general
|
||||
this means that the kernel tries to configure everything using
|
||||
autoconfiguration.
|
||||
|
||||
The <autoconf> parameter can appear alone as the value to the `ip'
|
||||
parameter (without all the ':' characters before). If the value is
|
||||
"ip=off" or "ip=none", no autoconfiguration will take place, otherwise
|
||||
autoconfiguration will take place. The most common way to use this
|
||||
is "ip=dhcp".
|
||||
|
||||
<client-ip> IP address of the client.
|
||||
|
||||
Default: Determined using autoconfiguration.
|
||||
|
||||
<server-ip> IP address of the NFS server. If RARP is used to determine
|
||||
the client address and this parameter is NOT empty only
|
||||
replies from the specified server are accepted.
|
||||
|
||||
Only required for NFS root. That is autoconfiguration
|
||||
will not be triggered if it is missing and NFS root is not
|
||||
in operation.
|
||||
|
||||
Default: Determined using autoconfiguration.
|
||||
The address of the autoconfiguration server is used.
|
||||
|
||||
<gw-ip> IP address of a gateway if the server is on a different subnet.
|
||||
|
||||
Default: Determined using autoconfiguration.
|
||||
|
||||
<netmask> Netmask for local network interface. If unspecified
|
||||
the netmask is derived from the client IP address assuming
|
||||
classful addressing.
|
||||
|
||||
Default: Determined using autoconfiguration.
|
||||
|
||||
<hostname> Name of the client. May be supplied by autoconfiguration,
|
||||
but its absence will not trigger autoconfiguration.
|
||||
|
||||
Default: Client IP address is used in ASCII notation.
|
||||
|
||||
<device> Name of network device to use.
|
||||
|
||||
Default: If the host only has one device, it is used.
|
||||
Otherwise the device is determined using
|
||||
autoconfiguration. This is done by sending
|
||||
autoconfiguration requests out of all devices,
|
||||
and using the device that received the first reply.
|
||||
|
||||
<autoconf> Method to use for autoconfiguration. In the case of options
|
||||
which specify multiple autoconfiguration protocols,
|
||||
requests are sent using all protocols, and the first one
|
||||
to reply is used.
|
||||
|
||||
Only autoconfiguration protocols that have been compiled
|
||||
into the kernel will be used, regardless of the value of
|
||||
this option.
|
||||
|
||||
off or none: don't use autoconfiguration
|
||||
(do static IP assignment instead)
|
||||
on or any: use any protocol available in the kernel
|
||||
(default)
|
||||
dhcp: use DHCP
|
||||
bootp: use BOOTP
|
||||
rarp: use RARP
|
||||
both: use both BOOTP and RARP but not DHCP
|
||||
(old option kept for backwards compatibility)
|
||||
|
||||
Default: any
|
||||
|
||||
|
||||
|
||||
|
||||
3.) Boot Loader
|
||||
----------
|
||||
|
||||
To get the kernel into memory different approaches can be used.
|
||||
They depend on various facilities being available:
|
||||
|
||||
|
||||
3.1) Booting from a floppy using syslinux
|
||||
|
||||
When building kernels, an easy way to create a boot floppy that uses
|
||||
syslinux is to use the zdisk or bzdisk make targets which use zimage
|
||||
and bzimage images respectively. Both targets accept the
|
||||
FDARGS parameter which can be used to set the kernel command line.
|
||||
|
||||
e.g.
|
||||
make bzdisk FDARGS="root=/dev/nfs"
|
||||
|
||||
Note that the user running this command will need to have
|
||||
access to the floppy drive device, /dev/fd0
|
||||
|
||||
For more information on syslinux, including how to create bootdisks
|
||||
for prebuilt kernels, see http://syslinux.zytor.com/
|
||||
|
||||
N.B: Previously it was possible to write a kernel directly to
|
||||
a floppy using dd, configure the boot device using rdev, and
|
||||
boot using the resulting floppy. Linux no longer supports this
|
||||
method of booting.
|
||||
|
||||
3.2) Booting from a cdrom using isolinux
|
||||
|
||||
When building kernels, an easy way to create a bootable cdrom that
|
||||
uses isolinux is to use the isoimage target which uses a bzimage
|
||||
image. Like zdisk and bzdisk, this target accepts the FDARGS
|
||||
parameter which can be used to set the kernel command line.
|
||||
|
||||
e.g.
|
||||
make isoimage FDARGS="root=/dev/nfs"
|
||||
|
||||
The resulting iso image will be arch/<ARCH>/boot/image.iso
|
||||
This can be written to a cdrom using a variety of tools including
|
||||
cdrecord.
|
||||
|
||||
e.g.
|
||||
cdrecord dev=ATAPI:1,0,0 arch/i386/boot/image.iso
|
||||
|
||||
For more information on isolinux, including how to create bootdisks
|
||||
for prebuilt kernels, see http://syslinux.zytor.com/
|
||||
|
||||
3.2) Using LILO
|
||||
When using LILO all the necessary command line parameters may be
|
||||
specified using the 'append=' directive in the LILO configuration
|
||||
file.
|
||||
|
||||
However, to use the 'root=' directive you also need to create
|
||||
a dummy root device, which may be removed after LILO is run.
|
||||
|
||||
mknod /dev/boot255 c 0 255
|
||||
|
||||
For information on configuring LILO, please refer to its documentation.
|
||||
|
||||
3.3) Using GRUB
|
||||
When using GRUB, kernel parameter are simply appended after the kernel
|
||||
specification: kernel <kernel> <parameters>
|
||||
|
||||
3.4) Using loadlin
|
||||
loadlin may be used to boot Linux from a DOS command prompt without
|
||||
requiring a local hard disk to mount as root. This has not been
|
||||
thoroughly tested by the authors of this document, but in general
|
||||
it should be possible configure the kernel command line similarly
|
||||
to the configuration of LILO.
|
||||
|
||||
Please refer to the loadlin documentation for further information.
|
||||
|
||||
3.5) Using a boot ROM
|
||||
This is probably the most elegant way of booting a diskless client.
|
||||
With a boot ROM the kernel is loaded using the TFTP protocol. The
|
||||
authors of this document are not aware of any no commercial boot
|
||||
ROMs that support booting Linux over the network. However, there
|
||||
are two free implementations of a boot ROM, netboot-nfs and
|
||||
etherboot, both of which are available on sunsite.unc.edu, and both
|
||||
of which contain everything you need to boot a diskless Linux client.
|
||||
|
||||
3.6) Using pxelinux
|
||||
Pxelinux may be used to boot linux using the PXE boot loader
|
||||
which is present on many modern network cards.
|
||||
|
||||
When using pxelinux, the kernel image is specified using
|
||||
"kernel <relative-path-below /tftpboot>". The nfsroot parameters
|
||||
are passed to the kernel by adding them to the "append" line.
|
||||
It is common to use serial console in conjunction with pxeliunx,
|
||||
see Documentation/serial-console.txt for more information.
|
||||
|
||||
For more information on isolinux, including how to create bootdisks
|
||||
for prebuilt kernels, see http://syslinux.zytor.com/
|
||||
|
||||
|
||||
|
||||
|
||||
4.) Credits
|
||||
-------
|
||||
|
||||
The nfsroot code in the kernel and the RARP support have been written
|
||||
by Gero Kuhlmann <gero@gkminix.han.de>.
|
||||
|
||||
The rest of the IP layer autoconfiguration code has been written
|
||||
by Martin Mares <mj@atrey.karlin.mff.cuni.cz>.
|
||||
|
||||
In order to write the initial version of nfsroot I would like to thank
|
||||
Jens-Uwe Mager <jum@anubis.han.de> for his help.
|
Reference in New Issue
Block a user