nfs: new subdir Documentation/filesystems/nfs

We're adding enough nfs documentation that it may as well have its own
subdirectory.

Acked-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
This commit is contained in:
J. Bruce Fields
2009-10-27 14:41:35 -04:00
parent e343eb0d60
commit dc7a08166f
16 changed files with 27 additions and 21 deletions

View File

@@ -0,0 +1,12 @@
00-INDEX
- this file (nfs-related documentation).
Exporting
- explanation of how to make filesystems exportable.
nfs.txt
- nfs client, and DNS resolution for fs_locations.
nfs41-server.txt
- info on the Linux server implementation of NFSv4 minor version 1.
nfs-rdma.txt
- how to install and setup the Linux NFS/RDMA client and server software
nfsroot.txt
- short guide on setting up a diskless box with NFS root filesystem.

View File

@@ -0,0 +1,147 @@
Making Filesystems Exportable
=============================
Overview
--------
All filesystem operations require a dentry (or two) as a starting
point. Local applications have a reference-counted hold on suitable
dentries via open file descriptors or cwd/root. However remote
applications that access a filesystem via a remote filesystem protocol
such as NFS may not be able to hold such a reference, and so need a
different way to refer to a particular dentry. As the alternative
form of reference needs to be stable across renames, truncates, and
server-reboot (among other things, though these tend to be the most
problematic), there is no simple answer like 'filename'.
The mechanism discussed here allows each filesystem implementation to
specify how to generate an opaque (outside of the filesystem) byte
string for any dentry, and how to find an appropriate dentry for any
given opaque byte string.
This byte string will be called a "filehandle fragment" as it
corresponds to part of an NFS filehandle.
A filesystem which supports the mapping between filehandle fragments
and dentries will be termed "exportable".
Dcache Issues
-------------
The dcache normally contains a proper prefix of any given filesystem
tree. This means that if any filesystem object is in the dcache, then
all of the ancestors of that filesystem object are also in the dcache.
As normal access is by filename this prefix is created naturally and
maintained easily (by each object maintaining a reference count on
its parent).
However when objects are included into the dcache by interpreting a
filehandle fragment, there is no automatic creation of a path prefix
for the object. This leads to two related but distinct features of
the dcache that are not needed for normal filesystem access.
1/ The dcache must sometimes contain objects that are not part of the
proper prefix. i.e that are not connected to the root.
2/ The dcache must be prepared for a newly found (via ->lookup) directory
to already have a (non-connected) dentry, and must be able to move
that dentry into place (based on the parent and name in the
->lookup). This is particularly needed for directories as
it is a dcache invariant that directories only have one dentry.
To implement these features, the dcache has:
a/ A dentry flag DCACHE_DISCONNECTED which is set on
any dentry that might not be part of the proper prefix.
This is set when anonymous dentries are created, and cleared when a
dentry is noticed to be a child of a dentry which is in the proper
prefix.
b/ A per-superblock list "s_anon" of dentries which are the roots of
subtrees that are not in the proper prefix. These dentries, as
well as the proper prefix, need to be released at unmount time. As
these dentries will not be hashed, they are linked together on the
d_hash list_head.
c/ Helper routines to allocate anonymous dentries, and to help attach
loose directory dentries at lookup time. They are:
d_alloc_anon(inode) will return a dentry for the given inode.
If the inode already has a dentry, one of those is returned.
If it doesn't, a new anonymous (IS_ROOT and
DCACHE_DISCONNECTED) dentry is allocated and attached.
In the case of a directory, care is taken that only one dentry
can ever be attached.
d_splice_alias(inode, dentry) will make sure that there is a
dentry with the same name and parent as the given dentry, and
which refers to the given inode.
If the inode is a directory and already has a dentry, then that
dentry is d_moved over the given dentry.
If the passed dentry gets attached, care is taken that this is
mutually exclusive to a d_alloc_anon operation.
If the passed dentry is used, NULL is returned, else the used
dentry is returned. This corresponds to the calling pattern of
->lookup.
Filesystem Issues
-----------------
For a filesystem to be exportable it must:
1/ provide the filehandle fragment routines described below.
2/ make sure that d_splice_alias is used rather than d_add
when ->lookup finds an inode for a given parent and name.
Typically the ->lookup routine will end with a:
return d_splice_alias(inode, dentry);
}
A file system implementation declares that instances of the filesystem
are exportable by setting the s_export_op field in the struct
super_block. This field must point to a "struct export_operations"
struct which has the following members:
encode_fh (optional)
Takes a dentry and creates a filehandle fragment which can later be used
to find or create a dentry for the same object. The default
implementation creates a filehandle fragment that encodes a 32bit inode
and generation number for the inode encoded, and if necessary the
same information for the parent.
fh_to_dentry (mandatory)
Given a filehandle fragment, this should find the implied object and
create a dentry for it (possibly with d_alloc_anon).
fh_to_parent (optional but strongly recommended)
Given a filehandle fragment, this should find the parent of the
implied object and create a dentry for it (possibly with d_alloc_anon).
May fail if the filehandle fragment is too small.
get_parent (optional but strongly recommended)
When given a dentry for a directory, this should return a dentry for
the parent. Quite possibly the parent dentry will have been allocated
by d_alloc_anon. The default get_parent function just returns an error
so any filehandle lookup that requires finding a parent will fail.
->lookup("..") is *not* used as a default as it can leave ".." entries
in the dcache which are too messy to work with.
get_name (optional)
When given a parent dentry and a child dentry, this should find a name
in the directory identified by the parent dentry, which leads to the
object identified by the child dentry. If no get_name function is
supplied, a default implementation is provided which uses vfs_readdir
to find potential names, and matches inode numbers to find the correct
match.
A filehandle fragment consists of an array of 1 or more 4byte words,
together with a one byte "type".
The decode_fh routine should not depend on the stated size that is
passed to it. This size may be larger than the original filehandle
generated by encode_fh, in which case it will have been padded with
nuls. Rather, the encode_fh routine should choose a "type" which
indicates the decode_fh how much of the filehandle is valid, and how
it should be interpreted.

View File

@@ -0,0 +1,271 @@
################################################################################
# #
# NFS/RDMA README #
# #
################################################################################
Author: NetApp and Open Grid Computing
Date: May 29, 2008
Table of Contents
~~~~~~~~~~~~~~~~~
- Overview
- Getting Help
- Installation
- Check RDMA and NFS Setup
- NFS/RDMA Setup
Overview
~~~~~~~~
This document describes how to install and setup the Linux NFS/RDMA client
and server software.
The NFS/RDMA client was first included in Linux 2.6.24. The NFS/RDMA server
was first included in the following release, Linux 2.6.25.
In our testing, we have obtained excellent performance results (full 10Gbit
wire bandwidth at minimal client CPU) under many workloads. The code passes
the full Connectathon test suite and operates over both Infiniband and iWARP
RDMA adapters.
Getting Help
~~~~~~~~~~~~
If you get stuck, you can ask questions on the
nfs-rdma-devel@lists.sourceforge.net
mailing list.
Installation
~~~~~~~~~~~~
These instructions are a step by step guide to building a machine for
use with NFS/RDMA.
- Install an RDMA device
Any device supported by the drivers in drivers/infiniband/hw is acceptable.
Testing has been performed using several Mellanox-based IB cards, the
Ammasso AMS1100 iWARP adapter, and the Chelsio cxgb3 iWARP adapter.
- Install a Linux distribution and tools
The first kernel release to contain both the NFS/RDMA client and server was
Linux 2.6.25 Therefore, a distribution compatible with this and subsequent
Linux kernel release should be installed.
The procedures described in this document have been tested with
distributions from Red Hat's Fedora Project (http://fedora.redhat.com/).
- Install nfs-utils-1.1.2 or greater on the client
An NFS/RDMA mount point can be obtained by using the mount.nfs command in
nfs-utils-1.1.2 or greater (nfs-utils-1.1.1 was the first nfs-utils
version with support for NFS/RDMA mounts, but for various reasons we
recommend using nfs-utils-1.1.2 or greater). To see which version of
mount.nfs you are using, type:
$ /sbin/mount.nfs -V
If the version is less than 1.1.2 or the command does not exist,
you should install the latest version of nfs-utils.
Download the latest package from:
http://www.kernel.org/pub/linux/utils/nfs
Uncompress the package and follow the installation instructions.
If you will not need the idmapper and gssd executables (you do not need
these to create an NFS/RDMA enabled mount command), the installation
process can be simplified by disabling these features when running
configure:
$ ./configure --disable-gss --disable-nfsv4
To build nfs-utils you will need the tcp_wrappers package installed. For
more information on this see the package's README and INSTALL files.
After building the nfs-utils package, there will be a mount.nfs binary in
the utils/mount directory. This binary can be used to initiate NFS v2, v3,
or v4 mounts. To initiate a v4 mount, the binary must be called
mount.nfs4. The standard technique is to create a symlink called
mount.nfs4 to mount.nfs.
This mount.nfs binary should be installed at /sbin/mount.nfs as follows:
$ sudo cp utils/mount/mount.nfs /sbin/mount.nfs
In this location, mount.nfs will be invoked automatically for NFS mounts
by the system mount command.
NOTE: mount.nfs and therefore nfs-utils-1.1.2 or greater is only needed
on the NFS client machine. You do not need this specific version of
nfs-utils on the server. Furthermore, only the mount.nfs command from
nfs-utils-1.1.2 is needed on the client.
- Install a Linux kernel with NFS/RDMA
The NFS/RDMA client and server are both included in the mainline Linux
kernel version 2.6.25 and later. This and other versions of the 2.6 Linux
kernel can be found at:
ftp://ftp.kernel.org/pub/linux/kernel/v2.6/
Download the sources and place them in an appropriate location.
- Configure the RDMA stack
Make sure your kernel configuration has RDMA support enabled. Under
Device Drivers -> InfiniBand support, update the kernel configuration
to enable InfiniBand support [NOTE: the option name is misleading. Enabling
InfiniBand support is required for all RDMA devices (IB, iWARP, etc.)].
Enable the appropriate IB HCA support (mlx4, mthca, ehca, ipath, etc.) or
iWARP adapter support (amso, cxgb3, etc.).
If you are using InfiniBand, be sure to enable IP-over-InfiniBand support.
- Configure the NFS client and server
Your kernel configuration must also have NFS file system support and/or
NFS server support enabled. These and other NFS related configuration
options can be found under File Systems -> Network File Systems.
- Build, install, reboot
The NFS/RDMA code will be enabled automatically if NFS and RDMA
are turned on. The NFS/RDMA client and server are configured via the hidden
SUNRPC_XPRT_RDMA config option that depends on SUNRPC and INFINIBAND. The
value of SUNRPC_XPRT_RDMA will be:
- N if either SUNRPC or INFINIBAND are N, in this case the NFS/RDMA client
and server will not be built
- M if both SUNRPC and INFINIBAND are on (M or Y) and at least one is M,
in this case the NFS/RDMA client and server will be built as modules
- Y if both SUNRPC and INFINIBAND are Y, in this case the NFS/RDMA client
and server will be built into the kernel
Therefore, if you have followed the steps above and turned no NFS and RDMA,
the NFS/RDMA client and server will be built.
Build a new kernel, install it, boot it.
Check RDMA and NFS Setup
~~~~~~~~~~~~~~~~~~~~~~~~
Before configuring the NFS/RDMA software, it is a good idea to test
your new kernel to ensure that the kernel is working correctly.
In particular, it is a good idea to verify that the RDMA stack
is functioning as expected and standard NFS over TCP/IP and/or UDP/IP
is working properly.
- Check RDMA Setup
If you built the RDMA components as modules, load them at
this time. For example, if you are using a Mellanox Tavor/Sinai/Arbel
card:
$ modprobe ib_mthca
$ modprobe ib_ipoib
If you are using InfiniBand, make sure there is a Subnet Manager (SM)
running on the network. If your IB switch has an embedded SM, you can
use it. Otherwise, you will need to run an SM, such as OpenSM, on one
of your end nodes.
If an SM is running on your network, you should see the following:
$ cat /sys/class/infiniband/driverX/ports/1/state
4: ACTIVE
where driverX is mthca0, ipath5, ehca3, etc.
To further test the InfiniBand software stack, use IPoIB (this
assumes you have two IB hosts named host1 and host2):
host1$ ifconfig ib0 a.b.c.x
host2$ ifconfig ib0 a.b.c.y
host1$ ping a.b.c.y
host2$ ping a.b.c.x
For other device types, follow the appropriate procedures.
- Check NFS Setup
For the NFS components enabled above (client and/or server),
test their functionality over standard Ethernet using TCP/IP or UDP/IP.
NFS/RDMA Setup
~~~~~~~~~~~~~~
We recommend that you use two machines, one to act as the client and
one to act as the server.
One time configuration:
- On the server system, configure the /etc/exports file and
start the NFS/RDMA server.
Exports entries with the following formats have been tested:
/vol0 192.168.0.47(fsid=0,rw,async,insecure,no_root_squash)
/vol0 192.168.0.0/255.255.255.0(fsid=0,rw,async,insecure,no_root_squash)
The IP address(es) is(are) the client's IPoIB address for an InfiniBand
HCA or the cleint's iWARP address(es) for an RNIC.
NOTE: The "insecure" option must be used because the NFS/RDMA client does
not use a reserved port.
Each time a machine boots:
- Load and configure the RDMA drivers
For InfiniBand using a Mellanox adapter:
$ modprobe ib_mthca
$ modprobe ib_ipoib
$ ifconfig ib0 a.b.c.d
NOTE: use unique addresses for the client and server
- Start the NFS server
If the NFS/RDMA server was built as a module (CONFIG_SUNRPC_XPRT_RDMA=m in
kernel config), load the RDMA transport module:
$ modprobe svcrdma
Regardless of how the server was built (module or built-in), start the
server:
$ /etc/init.d/nfs start
or
$ service nfs start
Instruct the server to listen on the RDMA transport:
$ echo rdma 20049 > /proc/fs/nfsd/portlist
- On the client system
If the NFS/RDMA client was built as a module (CONFIG_SUNRPC_XPRT_RDMA=m in
kernel config), load the RDMA client module:
$ modprobe xprtrdma.ko
Regardless of how the client was built (module or built-in), use this
command to mount the NFS/RDMA server:
$ mount -o rdma,port=20049 <IPoIB-server-name-or-address>:/<export> /mnt
To verify that the mount is using RDMA, run "cat /proc/mounts" and check
the "proto" field for the given mount.
Congratulations! You're using NFS/RDMA!

View File

@@ -0,0 +1,98 @@
The NFS client
==============
The NFS version 2 protocol was first documented in RFC1094 (March 1989).
Since then two more major releases of NFS have been published, with NFSv3
being documented in RFC1813 (June 1995), and NFSv4 in RFC3530 (April
2003).
The Linux NFS client currently supports all the above published versions,
and work is in progress on adding support for minor version 1 of the NFSv4
protocol.
The purpose of this document is to provide information on some of the
upcall interfaces that are used in order to provide the NFS client with
some of the information that it requires in order to fully comply with
the NFS spec.
The DNS resolver
================
NFSv4 allows for one server to refer the NFS client to data that has been
migrated onto another server by means of the special "fs_locations"
attribute. See
http://tools.ietf.org/html/rfc3530#section-6
and
http://tools.ietf.org/html/draft-ietf-nfsv4-referrals-00
The fs_locations information can take the form of either an ip address and
a path, or a DNS hostname and a path. The latter requires the NFS client to
do a DNS lookup in order to mount the new volume, and hence the need for an
upcall to allow userland to provide this service.
Assuming that the user has the 'rpc_pipefs' filesystem mounted in the usual
/var/lib/nfs/rpc_pipefs, the upcall consists of the following steps:
(1) The process checks the dns_resolve cache to see if it contains a
valid entry. If so, it returns that entry and exits.
(2) If no valid entry exists, the helper script '/sbin/nfs_cache_getent'
(may be changed using the 'nfs.cache_getent' kernel boot parameter)
is run, with two arguments:
- the cache name, "dns_resolve"
- the hostname to resolve
(3) After looking up the corresponding ip address, the helper script
writes the result into the rpc_pipefs pseudo-file
'/var/lib/nfs/rpc_pipefs/cache/dns_resolve/channel'
in the following (text) format:
"<ip address> <hostname> <ttl>\n"
Where <ip address> is in the usual IPv4 (123.456.78.90) or IPv6
(ffee:ddcc:bbaa:9988:7766:5544:3322:1100, ffee::1100, ...) format.
<hostname> is identical to the second argument of the helper
script, and <ttl> is the 'time to live' of this cache entry (in
units of seconds).
Note: If <ip address> is invalid, say the string "0", then a negative
entry is created, which will cause the kernel to treat the hostname
as having no valid DNS translation.
A basic sample /sbin/nfs_cache_getent
=====================================
#!/bin/bash
#
ttl=600
#
cut=/usr/bin/cut
getent=/usr/bin/getent
rpc_pipefs=/var/lib/nfs/rpc_pipefs
#
die()
{
echo "Usage: $0 cache_name entry_name"
exit 1
}
[ $# -lt 2 ] && die
cachename="$1"
cache_path=${rpc_pipefs}/cache/${cachename}/channel
case "${cachename}" in
dns_resolve)
name="$2"
result="$(${getent} hosts ${name} | ${cut} -f1 -d\ )"
[ -z "${result}" ] && result="0"
;;
*)
die
;;
esac
echo "${result} ${name} ${ttl}" >${cache_path}

View File

@@ -0,0 +1,222 @@
NFSv4.1 Server Implementation
Server support for minorversion 1 can be controlled using the
/proc/fs/nfsd/versions control file. The string output returned
by reading this file will contain either "+4.1" or "-4.1"
correspondingly.
Currently, server support for minorversion 1 is disabled by default.
It can be enabled at run time by writing the string "+4.1" to
the /proc/fs/nfsd/versions control file. Note that to write this
control file, the nfsd service must be taken down. Use your user-mode
nfs-utils to set this up; see rpc.nfsd(8)
(Warning: older servers will interpret "+4.1" and "-4.1" as "+4" and
"-4", respectively. Therefore, code meant to work on both new and old
kernels must turn 4.1 on or off *before* turning support for version 4
on or off; rpc.nfsd does this correctly.)
The NFSv4 minorversion 1 (NFSv4.1) implementation in nfsd is based
on the latest NFSv4.1 Internet Draft:
http://tools.ietf.org/html/draft-ietf-nfsv4-minorversion1-29
From the many new features in NFSv4.1 the current implementation
focuses on the mandatory-to-implement NFSv4.1 Sessions, providing
"exactly once" semantics and better control and throttling of the
resources allocated for each client.
Other NFSv4.1 features, Parallel NFS operations in particular,
are still under development out of tree.
See http://wiki.linux-nfs.org/wiki/index.php/PNFS_prototype_design
for more information.
The current implementation is intended for developers only: while it
does support ordinary file operations on clients we have tested against
(including the linux client), it is incomplete in ways which may limit
features unexpectedly, cause known bugs in rare cases, or cause
interoperability problems with future clients. Known issues:
- gss support is questionable: currently mounts with kerberos
from a linux client are possible, but we aren't really
conformant with the spec (for example, we don't use kerberos
on the backchannel correctly).
- no trunking support: no clients currently take advantage of
trunking, but this is a mandatory feature, and its use is
recommended to clients in a number of places. (E.g. to ensure
timely renewal in case an existing connection's retry timeouts
have gotten too long; see section 8.3 of the draft.)
Therefore, lack of this feature may cause future clients to
fail.
- Incomplete backchannel support: incomplete backchannel gss
support and no support for BACKCHANNEL_CTL mean that
callbacks (hence delegations and layouts) may not be
available and clients confused by the incomplete
implementation may fail.
- Server reboot recovery is unsupported; if the server reboots,
clients may fail.
- We do not support SSV, which provides security for shared
client-server state (thus preventing unauthorized tampering
with locks and opens, for example). It is mandatory for
servers to support this, though no clients use it yet.
- Mandatory operations which we do not support, such as
DESTROY_CLIENTID, FREE_STATEID, SECINFO_NO_NAME, and
TEST_STATEID, are not currently used by clients, but will be
(and the spec recommends their uses in common cases), and
clients should not be expected to know how to recover from the
case where they are not supported. This will eventually cause
interoperability failures.
In addition, some limitations are inherited from the current NFSv4
implementation:
- Incomplete delegation enforcement: if a file is renamed or
unlinked, a client holding a delegation may continue to
indefinitely allow opens of the file under the old name.
The table below, taken from the NFSv4.1 document, lists
the operations that are mandatory to implement (REQ), optional
(OPT), and NFSv4.0 operations that are required not to implement (MNI)
in minor version 1. The first column indicates the operations that
are not supported yet by the linux server implementation.
The OPTIONAL features identified and their abbreviations are as follows:
pNFS Parallel NFS
FDELG File Delegations
DDELG Directory Delegations
The following abbreviations indicate the linux server implementation status.
I Implemented NFSv4.1 operations.
NS Not Supported.
NS* unimplemented optional feature.
P pNFS features implemented out of tree.
PNS pNFS features that are not supported yet (out of tree).
Operations
+----------------------+------------+--------------+----------------+
| Operation | REQ, REC, | Feature | Definition |
| | OPT, or | (REQ, REC, | |
| | MNI | or OPT) | |
+----------------------+------------+--------------+----------------+
| ACCESS | REQ | | Section 18.1 |
NS | BACKCHANNEL_CTL | REQ | | Section 18.33 |
NS | BIND_CONN_TO_SESSION | REQ | | Section 18.34 |
| CLOSE | REQ | | Section 18.2 |
| COMMIT | REQ | | Section 18.3 |
| CREATE | REQ | | Section 18.4 |
I | CREATE_SESSION | REQ | | Section 18.36 |
NS*| DELEGPURGE | OPT | FDELG (REQ) | Section 18.5 |
| DELEGRETURN | OPT | FDELG, | Section 18.6 |
| | | DDELG, pNFS | |
| | | (REQ) | |
NS | DESTROY_CLIENTID | REQ | | Section 18.50 |
I | DESTROY_SESSION | REQ | | Section 18.37 |
I | EXCHANGE_ID | REQ | | Section 18.35 |
NS | FREE_STATEID | REQ | | Section 18.38 |
| GETATTR | REQ | | Section 18.7 |
P | GETDEVICEINFO | OPT | pNFS (REQ) | Section 18.40 |
P | GETDEVICELIST | OPT | pNFS (OPT) | Section 18.41 |
| GETFH | REQ | | Section 18.8 |
NS*| GET_DIR_DELEGATION | OPT | DDELG (REQ) | Section 18.39 |
P | LAYOUTCOMMIT | OPT | pNFS (REQ) | Section 18.42 |
P | LAYOUTGET | OPT | pNFS (REQ) | Section 18.43 |
P | LAYOUTRETURN | OPT | pNFS (REQ) | Section 18.44 |
| LINK | OPT | | Section 18.9 |
| LOCK | REQ | | Section 18.10 |
| LOCKT | REQ | | Section 18.11 |
| LOCKU | REQ | | Section 18.12 |
| LOOKUP | REQ | | Section 18.13 |
| LOOKUPP | REQ | | Section 18.14 |
| NVERIFY | REQ | | Section 18.15 |
| OPEN | REQ | | Section 18.16 |
NS*| OPENATTR | OPT | | Section 18.17 |
| OPEN_CONFIRM | MNI | | N/A |
| OPEN_DOWNGRADE | REQ | | Section 18.18 |
| PUTFH | REQ | | Section 18.19 |
| PUTPUBFH | REQ | | Section 18.20 |
| PUTROOTFH | REQ | | Section 18.21 |
| READ | REQ | | Section 18.22 |
| READDIR | REQ | | Section 18.23 |
| READLINK | OPT | | Section 18.24 |
NS | RECLAIM_COMPLETE | REQ | | Section 18.51 |
| RELEASE_LOCKOWNER | MNI | | N/A |
| REMOVE | REQ | | Section 18.25 |
| RENAME | REQ | | Section 18.26 |
| RENEW | MNI | | N/A |
| RESTOREFH | REQ | | Section 18.27 |
| SAVEFH | REQ | | Section 18.28 |
| SECINFO | REQ | | Section 18.29 |
NS | SECINFO_NO_NAME | REC | pNFS files | Section 18.45, |
| | | layout (REQ) | Section 13.12 |
I | SEQUENCE | REQ | | Section 18.46 |
| SETATTR | REQ | | Section 18.30 |
| SETCLIENTID | MNI | | N/A |
| SETCLIENTID_CONFIRM | MNI | | N/A |
NS | SET_SSV | REQ | | Section 18.47 |
NS | TEST_STATEID | REQ | | Section 18.48 |
| VERIFY | REQ | | Section 18.31 |
NS*| WANT_DELEGATION | OPT | FDELG (OPT) | Section 18.49 |
| WRITE | REQ | | Section 18.32 |
Callback Operations
+-------------------------+-----------+-------------+---------------+
| Operation | REQ, REC, | Feature | Definition |
| | OPT, or | (REQ, REC, | |
| | MNI | or OPT) | |
+-------------------------+-----------+-------------+---------------+
| CB_GETATTR | OPT | FDELG (REQ) | Section 20.1 |
P | CB_LAYOUTRECALL | OPT | pNFS (REQ) | Section 20.3 |
NS*| CB_NOTIFY | OPT | DDELG (REQ) | Section 20.4 |
P | CB_NOTIFY_DEVICEID | OPT | pNFS (OPT) | Section 20.12 |
NS*| CB_NOTIFY_LOCK | OPT | | Section 20.11 |
NS*| CB_PUSH_DELEG | OPT | FDELG (OPT) | Section 20.5 |
| CB_RECALL | OPT | FDELG, | Section 20.2 |
| | | DDELG, pNFS | |
| | | (REQ) | |
NS*| CB_RECALL_ANY | OPT | FDELG, | Section 20.6 |
| | | DDELG, pNFS | |
| | | (REQ) | |
NS | CB_RECALL_SLOT | REQ | | Section 20.8 |
NS*| CB_RECALLABLE_OBJ_AVAIL | OPT | DDELG, pNFS | Section 20.7 |
| | | (REQ) | |
I | CB_SEQUENCE | OPT | FDELG, | Section 20.9 |
| | | DDELG, pNFS | |
| | | (REQ) | |
NS*| CB_WANTS_CANCELLED | OPT | FDELG, | Section 20.10 |
| | | DDELG, pNFS | |
| | | (REQ) | |
+-------------------------+-----------+-------------+---------------+
Implementation notes:
DELEGPURGE:
* mandatory only for servers that support CLAIM_DELEGATE_PREV and/or
CLAIM_DELEG_PREV_FH (which allows clients to keep delegations that
persist across client reboots). Thus we need not implement this for
now.
EXCHANGE_ID:
* only SP4_NONE state protection supported
* implementation ids are ignored
CREATE_SESSION:
* backchannel attributes are ignored
* backchannel security parameters are ignored
SEQUENCE:
* no support for dynamic slot table renegotiation (optional)
nfsv4.1 COMPOUND rules:
The following cases aren't supported yet:
* Enforcing of NFS4ERR_NOT_ONLY_OP for: BIND_CONN_TO_SESSION, CREATE_SESSION,
DESTROY_CLIENTID, DESTROY_SESSION, EXCHANGE_ID.
* DESTROY_SESSION MUST be the final operation in the COMPOUND request.
Nonstandard compound limitations:
* No support for a sessions fore channel RPC compound that requires both a
ca_maxrequestsize request and a ca_maxresponsesize reply, so we may
fail to live up to the promise we made in CREATE_SESSION fore channel
negotiation.
* No more than one IO operation (read, write, readdir) allowed per
compound.

View File

@@ -0,0 +1,270 @@
Mounting the root filesystem via NFS (nfsroot)
===============================================
Written 1996 by Gero Kuhlmann <gero@gkminix.han.de>
Updated 1997 by Martin Mares <mj@atrey.karlin.mff.cuni.cz>
Updated 2006 by Nico Schottelius <nico-kernel-nfsroot@schottelius.org>
Updated 2006 by Horms <horms@verge.net.au>
In order to use a diskless system, such as an X-terminal or printer server
for example, it is necessary for the root filesystem to be present on a
non-disk device. This may be an initramfs (see Documentation/filesystems/
ramfs-rootfs-initramfs.txt), a ramdisk (see Documentation/initrd.txt) or a
filesystem mounted via NFS. The following text describes on how to use NFS
for the root filesystem. For the rest of this text 'client' means the
diskless system, and 'server' means the NFS server.
1.) Enabling nfsroot capabilities
-----------------------------
In order to use nfsroot, NFS client support needs to be selected as
built-in during configuration. Once this has been selected, the nfsroot
option will become available, which should also be selected.
In the networking options, kernel level autoconfiguration can be selected,
along with the types of autoconfiguration to support. Selecting all of
DHCP, BOOTP and RARP is safe.
2.) Kernel command line
-------------------
When the kernel has been loaded by a boot loader (see below) it needs to be
told what root fs device to use. And in the case of nfsroot, where to find
both the server and the name of the directory on the server to mount as root.
This can be established using the following kernel command line parameters:
root=/dev/nfs
This is necessary to enable the pseudo-NFS-device. Note that it's not a
real device but just a synonym to tell the kernel to use NFS instead of
a real device.
nfsroot=[<server-ip>:]<root-dir>[,<nfs-options>]
If the `nfsroot' parameter is NOT given on the command line,
the default "/tftpboot/%s" will be used.
<server-ip> Specifies the IP address of the NFS server.
The default address is determined by the `ip' parameter
(see below). This parameter allows the use of different
servers for IP autoconfiguration and NFS.
<root-dir> Name of the directory on the server to mount as root.
If there is a "%s" token in the string, it will be
replaced by the ASCII-representation of the client's
IP address.
<nfs-options> Standard NFS options. All options are separated by commas.
The following defaults are used:
port = as given by server portmap daemon
rsize = 4096
wsize = 4096
timeo = 7
retrans = 3
acregmin = 3
acregmax = 60
acdirmin = 30
acdirmax = 60
flags = hard, nointr, noposix, cto, ac
ip=<client-ip>:<server-ip>:<gw-ip>:<netmask>:<hostname>:<device>:<autoconf>
This parameter tells the kernel how to configure IP addresses of devices
and also how to set up the IP routing table. It was originally called
`nfsaddrs', but now the boot-time IP configuration works independently of
NFS, so it was renamed to `ip' and the old name remained as an alias for
compatibility reasons.
If this parameter is missing from the kernel command line, all fields are
assumed to be empty, and the defaults mentioned below apply. In general
this means that the kernel tries to configure everything using
autoconfiguration.
The <autoconf> parameter can appear alone as the value to the `ip'
parameter (without all the ':' characters before). If the value is
"ip=off" or "ip=none", no autoconfiguration will take place, otherwise
autoconfiguration will take place. The most common way to use this
is "ip=dhcp".
<client-ip> IP address of the client.
Default: Determined using autoconfiguration.
<server-ip> IP address of the NFS server. If RARP is used to determine
the client address and this parameter is NOT empty only
replies from the specified server are accepted.
Only required for NFS root. That is autoconfiguration
will not be triggered if it is missing and NFS root is not
in operation.
Default: Determined using autoconfiguration.
The address of the autoconfiguration server is used.
<gw-ip> IP address of a gateway if the server is on a different subnet.
Default: Determined using autoconfiguration.
<netmask> Netmask for local network interface. If unspecified
the netmask is derived from the client IP address assuming
classful addressing.
Default: Determined using autoconfiguration.
<hostname> Name of the client. May be supplied by autoconfiguration,
but its absence will not trigger autoconfiguration.
Default: Client IP address is used in ASCII notation.
<device> Name of network device to use.
Default: If the host only has one device, it is used.
Otherwise the device is determined using
autoconfiguration. This is done by sending
autoconfiguration requests out of all devices,
and using the device that received the first reply.
<autoconf> Method to use for autoconfiguration. In the case of options
which specify multiple autoconfiguration protocols,
requests are sent using all protocols, and the first one
to reply is used.
Only autoconfiguration protocols that have been compiled
into the kernel will be used, regardless of the value of
this option.
off or none: don't use autoconfiguration
(do static IP assignment instead)
on or any: use any protocol available in the kernel
(default)
dhcp: use DHCP
bootp: use BOOTP
rarp: use RARP
both: use both BOOTP and RARP but not DHCP
(old option kept for backwards compatibility)
Default: any
3.) Boot Loader
----------
To get the kernel into memory different approaches can be used.
They depend on various facilities being available:
3.1) Booting from a floppy using syslinux
When building kernels, an easy way to create a boot floppy that uses
syslinux is to use the zdisk or bzdisk make targets which use zimage
and bzimage images respectively. Both targets accept the
FDARGS parameter which can be used to set the kernel command line.
e.g.
make bzdisk FDARGS="root=/dev/nfs"
Note that the user running this command will need to have
access to the floppy drive device, /dev/fd0
For more information on syslinux, including how to create bootdisks
for prebuilt kernels, see http://syslinux.zytor.com/
N.B: Previously it was possible to write a kernel directly to
a floppy using dd, configure the boot device using rdev, and
boot using the resulting floppy. Linux no longer supports this
method of booting.
3.2) Booting from a cdrom using isolinux
When building kernels, an easy way to create a bootable cdrom that
uses isolinux is to use the isoimage target which uses a bzimage
image. Like zdisk and bzdisk, this target accepts the FDARGS
parameter which can be used to set the kernel command line.
e.g.
make isoimage FDARGS="root=/dev/nfs"
The resulting iso image will be arch/<ARCH>/boot/image.iso
This can be written to a cdrom using a variety of tools including
cdrecord.
e.g.
cdrecord dev=ATAPI:1,0,0 arch/i386/boot/image.iso
For more information on isolinux, including how to create bootdisks
for prebuilt kernels, see http://syslinux.zytor.com/
3.2) Using LILO
When using LILO all the necessary command line parameters may be
specified using the 'append=' directive in the LILO configuration
file.
However, to use the 'root=' directive you also need to create
a dummy root device, which may be removed after LILO is run.
mknod /dev/boot255 c 0 255
For information on configuring LILO, please refer to its documentation.
3.3) Using GRUB
When using GRUB, kernel parameter are simply appended after the kernel
specification: kernel <kernel> <parameters>
3.4) Using loadlin
loadlin may be used to boot Linux from a DOS command prompt without
requiring a local hard disk to mount as root. This has not been
thoroughly tested by the authors of this document, but in general
it should be possible configure the kernel command line similarly
to the configuration of LILO.
Please refer to the loadlin documentation for further information.
3.5) Using a boot ROM
This is probably the most elegant way of booting a diskless client.
With a boot ROM the kernel is loaded using the TFTP protocol. The
authors of this document are not aware of any no commercial boot
ROMs that support booting Linux over the network. However, there
are two free implementations of a boot ROM, netboot-nfs and
etherboot, both of which are available on sunsite.unc.edu, and both
of which contain everything you need to boot a diskless Linux client.
3.6) Using pxelinux
Pxelinux may be used to boot linux using the PXE boot loader
which is present on many modern network cards.
When using pxelinux, the kernel image is specified using
"kernel <relative-path-below /tftpboot>". The nfsroot parameters
are passed to the kernel by adding them to the "append" line.
It is common to use serial console in conjunction with pxeliunx,
see Documentation/serial-console.txt for more information.
For more information on isolinux, including how to create bootdisks
for prebuilt kernels, see http://syslinux.zytor.com/
4.) Credits
-------
The nfsroot code in the kernel and the RARP support have been written
by Gero Kuhlmann <gero@gkminix.han.de>.
The rest of the IP layer autoconfiguration code has been written
by Martin Mares <mj@atrey.karlin.mff.cuni.cz>.
In order to write the initial version of nfsroot I would like to thank
Jens-Uwe Mager <jum@anubis.han.de> for his help.