Merge git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-2.6-fscache
* git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-2.6-fscache: (41 commits) NFS: Add mount options to enable local caching on NFS NFS: Display local caching state NFS: Store pages from an NFS inode into a local cache NFS: Read pages from FS-Cache into an NFS inode NFS: nfs_readpage_async() needs to be accessible as a fallback for local caching NFS: Add read context retention for FS-Cache to call back with NFS: FS-Cache page management NFS: Add some new I/O counters for FS-Cache doing things for NFS NFS: Invalidate FsCache page flags when cache removed NFS: Use local disk inode cache NFS: Define and create inode-level cache objects NFS: Define and create superblock-level objects NFS: Define and create server-level objects NFS: Register NFS for caching and retrieve the top-level index NFS: Permit local filesystem caching to be enabled for NFS NFS: Add FS-Cache option bit and debug bit NFS: Add comment banners to some NFS functions FS-Cache: Make kAFS use FS-Cache CacheFiles: A cache that backs onto a mounted filesystem CacheFiles: Export things for CacheFiles ...
This commit is contained in:
658
Documentation/filesystems/caching/backend-api.txt
Normal file
658
Documentation/filesystems/caching/backend-api.txt
Normal file
@ -0,0 +1,658 @@
|
||||
==========================
|
||||
FS-CACHE CACHE BACKEND API
|
||||
==========================
|
||||
|
||||
The FS-Cache system provides an API by which actual caches can be supplied to
|
||||
FS-Cache for it to then serve out to network filesystems and other interested
|
||||
parties.
|
||||
|
||||
This API is declared in <linux/fscache-cache.h>.
|
||||
|
||||
|
||||
====================================
|
||||
INITIALISING AND REGISTERING A CACHE
|
||||
====================================
|
||||
|
||||
To start off, a cache definition must be initialised and registered for each
|
||||
cache the backend wants to make available. For instance, CacheFS does this in
|
||||
the fill_super() operation on mounting.
|
||||
|
||||
The cache definition (struct fscache_cache) should be initialised by calling:
|
||||
|
||||
void fscache_init_cache(struct fscache_cache *cache,
|
||||
struct fscache_cache_ops *ops,
|
||||
const char *idfmt,
|
||||
...);
|
||||
|
||||
Where:
|
||||
|
||||
(*) "cache" is a pointer to the cache definition;
|
||||
|
||||
(*) "ops" is a pointer to the table of operations that the backend supports on
|
||||
this cache; and
|
||||
|
||||
(*) "idfmt" is a format and printf-style arguments for constructing a label
|
||||
for the cache.
|
||||
|
||||
|
||||
The cache should then be registered with FS-Cache by passing a pointer to the
|
||||
previously initialised cache definition to:
|
||||
|
||||
int fscache_add_cache(struct fscache_cache *cache,
|
||||
struct fscache_object *fsdef,
|
||||
const char *tagname);
|
||||
|
||||
Two extra arguments should also be supplied:
|
||||
|
||||
(*) "fsdef" which should point to the object representation for the FS-Cache
|
||||
master index in this cache. Netfs primary index entries will be created
|
||||
here. FS-Cache keeps the caller's reference to the index object if
|
||||
successful and will release it upon withdrawal of the cache.
|
||||
|
||||
(*) "tagname" which, if given, should be a text string naming this cache. If
|
||||
this is NULL, the identifier will be used instead. For CacheFS, the
|
||||
identifier is set to name the underlying block device and the tag can be
|
||||
supplied by mount.
|
||||
|
||||
This function may return -ENOMEM if it ran out of memory or -EEXIST if the tag
|
||||
is already in use. 0 will be returned on success.
|
||||
|
||||
|
||||
=====================
|
||||
UNREGISTERING A CACHE
|
||||
=====================
|
||||
|
||||
A cache can be withdrawn from the system by calling this function with a
|
||||
pointer to the cache definition:
|
||||
|
||||
void fscache_withdraw_cache(struct fscache_cache *cache);
|
||||
|
||||
In CacheFS's case, this is called by put_super().
|
||||
|
||||
|
||||
========
|
||||
SECURITY
|
||||
========
|
||||
|
||||
The cache methods are executed one of two contexts:
|
||||
|
||||
(1) that of the userspace process that issued the netfs operation that caused
|
||||
the cache method to be invoked, or
|
||||
|
||||
(2) that of one of the processes in the FS-Cache thread pool.
|
||||
|
||||
In either case, this may not be an appropriate context in which to access the
|
||||
cache.
|
||||
|
||||
The calling process's fsuid, fsgid and SELinux security identities may need to
|
||||
be masqueraded for the duration of the cache driver's access to the cache.
|
||||
This is left to the cache to handle; FS-Cache makes no effort in this regard.
|
||||
|
||||
|
||||
===================================
|
||||
CONTROL AND STATISTICS PRESENTATION
|
||||
===================================
|
||||
|
||||
The cache may present data to the outside world through FS-Cache's interfaces
|
||||
in sysfs and procfs - the former for control and the latter for statistics.
|
||||
|
||||
A sysfs directory called /sys/fs/fscache/<cachetag>/ is created if CONFIG_SYSFS
|
||||
is enabled. This is accessible through the kobject struct fscache_cache::kobj
|
||||
and is for use by the cache as it sees fit.
|
||||
|
||||
|
||||
========================
|
||||
RELEVANT DATA STRUCTURES
|
||||
========================
|
||||
|
||||
(*) Index/Data file FS-Cache representation cookie:
|
||||
|
||||
struct fscache_cookie {
|
||||
struct fscache_object_def *def;
|
||||
struct fscache_netfs *netfs;
|
||||
void *netfs_data;
|
||||
...
|
||||
};
|
||||
|
||||
The fields that might be of use to the backend describe the object
|
||||
definition, the netfs definition and the netfs's data for this cookie.
|
||||
The object definition contain functions supplied by the netfs for loading
|
||||
and matching index entries; these are required to provide some of the
|
||||
cache operations.
|
||||
|
||||
|
||||
(*) In-cache object representation:
|
||||
|
||||
struct fscache_object {
|
||||
int debug_id;
|
||||
enum {
|
||||
FSCACHE_OBJECT_RECYCLING,
|
||||
...
|
||||
} state;
|
||||
spinlock_t lock
|
||||
struct fscache_cache *cache;
|
||||
struct fscache_cookie *cookie;
|
||||
...
|
||||
};
|
||||
|
||||
Structures of this type should be allocated by the cache backend and
|
||||
passed to FS-Cache when requested by the appropriate cache operation. In
|
||||
the case of CacheFS, they're embedded in CacheFS's internal object
|
||||
structures.
|
||||
|
||||
The debug_id is a simple integer that can be used in debugging messages
|
||||
that refer to a particular object. In such a case it should be printed
|
||||
using "OBJ%x" to be consistent with FS-Cache.
|
||||
|
||||
Each object contains a pointer to the cookie that represents the object it
|
||||
is backing. An object should retired when put_object() is called if it is
|
||||
in state FSCACHE_OBJECT_RECYCLING. The fscache_object struct should be
|
||||
initialised by calling fscache_object_init(object).
|
||||
|
||||
|
||||
(*) FS-Cache operation record:
|
||||
|
||||
struct fscache_operation {
|
||||
atomic_t usage;
|
||||
struct fscache_object *object;
|
||||
unsigned long flags;
|
||||
#define FSCACHE_OP_EXCLUSIVE
|
||||
void (*processor)(struct fscache_operation *op);
|
||||
void (*release)(struct fscache_operation *op);
|
||||
...
|
||||
};
|
||||
|
||||
FS-Cache has a pool of threads that it uses to give CPU time to the
|
||||
various asynchronous operations that need to be done as part of driving
|
||||
the cache. These are represented by the above structure. The processor
|
||||
method is called to give the op CPU time, and the release method to get
|
||||
rid of it when its usage count reaches 0.
|
||||
|
||||
An operation can be made exclusive upon an object by setting the
|
||||
appropriate flag before enqueuing it with fscache_enqueue_operation(). If
|
||||
an operation needs more processing time, it should be enqueued again.
|
||||
|
||||
|
||||
(*) FS-Cache retrieval operation record:
|
||||
|
||||
struct fscache_retrieval {
|
||||
struct fscache_operation op;
|
||||
struct address_space *mapping;
|
||||
struct list_head *to_do;
|
||||
...
|
||||
};
|
||||
|
||||
A structure of this type is allocated by FS-Cache to record retrieval and
|
||||
allocation requests made by the netfs. This struct is then passed to the
|
||||
backend to do the operation. The backend may get extra refs to it by
|
||||
calling fscache_get_retrieval() and refs may be discarded by calling
|
||||
fscache_put_retrieval().
|
||||
|
||||
A retrieval operation can be used by the backend to do retrieval work. To
|
||||
do this, the retrieval->op.processor method pointer should be set
|
||||
appropriately by the backend and fscache_enqueue_retrieval() called to
|
||||
submit it to the thread pool. CacheFiles, for example, uses this to queue
|
||||
page examination when it detects PG_lock being cleared.
|
||||
|
||||
The to_do field is an empty list available for the cache backend to use as
|
||||
it sees fit.
|
||||
|
||||
|
||||
(*) FS-Cache storage operation record:
|
||||
|
||||
struct fscache_storage {
|
||||
struct fscache_operation op;
|
||||
pgoff_t store_limit;
|
||||
...
|
||||
};
|
||||
|
||||
A structure of this type is allocated by FS-Cache to record outstanding
|
||||
writes to be made. FS-Cache itself enqueues this operation and invokes
|
||||
the write_page() method on the object at appropriate times to effect
|
||||
storage.
|
||||
|
||||
|
||||
================
|
||||
CACHE OPERATIONS
|
||||
================
|
||||
|
||||
The cache backend provides FS-Cache with a table of operations that can be
|
||||
performed on the denizens of the cache. These are held in a structure of type:
|
||||
|
||||
struct fscache_cache_ops
|
||||
|
||||
(*) Name of cache provider [mandatory]:
|
||||
|
||||
const char *name
|
||||
|
||||
This isn't strictly an operation, but should be pointed at a string naming
|
||||
the backend.
|
||||
|
||||
|
||||
(*) Allocate a new object [mandatory]:
|
||||
|
||||
struct fscache_object *(*alloc_object)(struct fscache_cache *cache,
|
||||
struct fscache_cookie *cookie)
|
||||
|
||||
This method is used to allocate a cache object representation to back a
|
||||
cookie in a particular cache. fscache_object_init() should be called on
|
||||
the object to initialise it prior to returning.
|
||||
|
||||
This function may also be used to parse the index key to be used for
|
||||
multiple lookup calls to turn it into a more convenient form. FS-Cache
|
||||
will call the lookup_complete() method to allow the cache to release the
|
||||
form once lookup is complete or aborted.
|
||||
|
||||
|
||||
(*) Look up and create object [mandatory]:
|
||||
|
||||
void (*lookup_object)(struct fscache_object *object)
|
||||
|
||||
This method is used to look up an object, given that the object is already
|
||||
allocated and attached to the cookie. This should instantiate that object
|
||||
in the cache if it can.
|
||||
|
||||
The method should call fscache_object_lookup_negative() as soon as
|
||||
possible if it determines the object doesn't exist in the cache. If the
|
||||
object is found to exist and the netfs indicates that it is valid then
|
||||
fscache_obtained_object() should be called once the object is in a
|
||||
position to have data stored in it. Similarly, fscache_obtained_object()
|
||||
should also be called once a non-present object has been created.
|
||||
|
||||
If a lookup error occurs, fscache_object_lookup_error() should be called
|
||||
to abort the lookup of that object.
|
||||
|
||||
|
||||
(*) Release lookup data [mandatory]:
|
||||
|
||||
void (*lookup_complete)(struct fscache_object *object)
|
||||
|
||||
This method is called to ask the cache to release any resources it was
|
||||
using to perform a lookup.
|
||||
|
||||
|
||||
(*) Increment object refcount [mandatory]:
|
||||
|
||||
struct fscache_object *(*grab_object)(struct fscache_object *object)
|
||||
|
||||
This method is called to increment the reference count on an object. It
|
||||
may fail (for instance if the cache is being withdrawn) by returning NULL.
|
||||
It should return the object pointer if successful.
|
||||
|
||||
|
||||
(*) Lock/Unlock object [mandatory]:
|
||||
|
||||
void (*lock_object)(struct fscache_object *object)
|
||||
void (*unlock_object)(struct fscache_object *object)
|
||||
|
||||
These methods are used to exclusively lock an object. It must be possible
|
||||
to schedule with the lock held, so a spinlock isn't sufficient.
|
||||
|
||||
|
||||
(*) Pin/Unpin object [optional]:
|
||||
|
||||
int (*pin_object)(struct fscache_object *object)
|
||||
void (*unpin_object)(struct fscache_object *object)
|
||||
|
||||
These methods are used to pin an object into the cache. Once pinned an
|
||||
object cannot be reclaimed to make space. Return -ENOSPC if there's not
|
||||
enough space in the cache to permit this.
|
||||
|
||||
|
||||
(*) Update object [mandatory]:
|
||||
|
||||
int (*update_object)(struct fscache_object *object)
|
||||
|
||||
This is called to update the index entry for the specified object. The
|
||||
new information should be in object->cookie->netfs_data. This can be
|
||||
obtained by calling object->cookie->def->get_aux()/get_attr().
|
||||
|
||||
|
||||
(*) Discard object [mandatory]:
|
||||
|
||||
void (*drop_object)(struct fscache_object *object)
|
||||
|
||||
This method is called to indicate that an object has been unbound from its
|
||||
cookie, and that the cache should release the object's resources and
|
||||
retire it if it's in state FSCACHE_OBJECT_RECYCLING.
|
||||
|
||||
This method should not attempt to release any references held by the
|
||||
caller. The caller will invoke the put_object() method as appropriate.
|
||||
|
||||
|
||||
(*) Release object reference [mandatory]:
|
||||
|
||||
void (*put_object)(struct fscache_object *object)
|
||||
|
||||
This method is used to discard a reference to an object. The object may
|
||||
be freed when all the references to it are released.
|
||||
|
||||
|
||||
(*) Synchronise a cache [mandatory]:
|
||||
|
||||
void (*sync)(struct fscache_cache *cache)
|
||||
|
||||
This is called to ask the backend to synchronise a cache with its backing
|
||||
device.
|
||||
|
||||
|
||||
(*) Dissociate a cache [mandatory]:
|
||||
|
||||
void (*dissociate_pages)(struct fscache_cache *cache)
|
||||
|
||||
This is called to ask a cache to perform any page dissociations as part of
|
||||
cache withdrawal.
|
||||
|
||||
|
||||
(*) Notification that the attributes on a netfs file changed [mandatory]:
|
||||
|
||||
int (*attr_changed)(struct fscache_object *object);
|
||||
|
||||
This is called to indicate to the cache that certain attributes on a netfs
|
||||
file have changed (for example the maximum size a file may reach). The
|
||||
cache can read these from the netfs by calling the cookie's get_attr()
|
||||
method.
|
||||
|
||||
The cache may use the file size information to reserve space on the cache.
|
||||
It should also call fscache_set_store_limit() to indicate to FS-Cache the
|
||||
highest byte it's willing to store for an object.
|
||||
|
||||
This method may return -ve if an error occurred or the cache object cannot
|
||||
be expanded. In such a case, the object will be withdrawn from service.
|
||||
|
||||
This operation is run asynchronously from FS-Cache's thread pool, and
|
||||
storage and retrieval operations from the netfs are excluded during the
|
||||
execution of this operation.
|
||||
|
||||
|
||||
(*) Reserve cache space for an object's data [optional]:
|
||||
|
||||
int (*reserve_space)(struct fscache_object *object, loff_t size);
|
||||
|
||||
This is called to request that cache space be reserved to hold the data
|
||||
for an object and the metadata used to track it. Zero size should be
|
||||
taken as request to cancel a reservation.
|
||||
|
||||
This should return 0 if successful, -ENOSPC if there isn't enough space
|
||||
available, or -ENOMEM or -EIO on other errors.
|
||||
|
||||
The reservation may exceed the current size of the object, thus permitting
|
||||
future expansion. If the amount of space consumed by an object would
|
||||
exceed the reservation, it's permitted to refuse requests to allocate
|
||||
pages, but not required. An object may be pruned down to its reservation
|
||||
size if larger than that already.
|
||||
|
||||
|
||||
(*) Request page be read from cache [mandatory]:
|
||||
|
||||
int (*read_or_alloc_page)(struct fscache_retrieval *op,
|
||||
struct page *page,
|
||||
gfp_t gfp)
|
||||
|
||||
This is called to attempt to read a netfs page from the cache, or to
|
||||
reserve a backing block if not. FS-Cache will have done as much checking
|
||||
as it can before calling, but most of the work belongs to the backend.
|
||||
|
||||
If there's no page in the cache, then -ENODATA should be returned if the
|
||||
backend managed to reserve a backing block; -ENOBUFS or -ENOMEM if it
|
||||
didn't.
|
||||
|
||||
If there is suitable data in the cache, then a read operation should be
|
||||
queued and 0 returned. When the read finishes, fscache_end_io() should be
|
||||
called.
|
||||
|
||||
The fscache_mark_pages_cached() should be called for the page if any cache
|
||||
metadata is retained. This will indicate to the netfs that the page needs
|
||||
explicit uncaching. This operation takes a pagevec, thus allowing several
|
||||
pages to be marked at once.
|
||||
|
||||
The retrieval record pointed to by op should be retained for each page
|
||||
queued and released when I/O on the page has been formally ended.
|
||||
fscache_get/put_retrieval() are available for this purpose.
|
||||
|
||||
The retrieval record may be used to get CPU time via the FS-Cache thread
|
||||
pool. If this is desired, the op->op.processor should be set to point to
|
||||
the appropriate processing routine, and fscache_enqueue_retrieval() should
|
||||
be called at an appropriate point to request CPU time. For instance, the
|
||||
retrieval routine could be enqueued upon the completion of a disk read.
|
||||
The to_do field in the retrieval record is provided to aid in this.
|
||||
|
||||
If an I/O error occurs, fscache_io_error() should be called and -ENOBUFS
|
||||
returned if possible or fscache_end_io() called with a suitable error
|
||||
code..
|
||||
|
||||
|
||||
(*) Request pages be read from cache [mandatory]:
|
||||
|
||||
int (*read_or_alloc_pages)(struct fscache_retrieval *op,
|
||||
struct list_head *pages,
|
||||
unsigned *nr_pages,
|
||||
gfp_t gfp)
|
||||
|
||||
This is like the read_or_alloc_page() method, except it is handed a list
|
||||
of pages instead of one page. Any pages on which a read operation is
|
||||
started must be added to the page cache for the specified mapping and also
|
||||
to the LRU. Such pages must also be removed from the pages list and
|
||||
*nr_pages decremented per page.
|
||||
|
||||
If there was an error such as -ENOMEM, then that should be returned; else
|
||||
if one or more pages couldn't be read or allocated, then -ENOBUFS should
|
||||
be returned; else if one or more pages couldn't be read, then -ENODATA
|
||||
should be returned. If all the pages are dispatched then 0 should be
|
||||
returned.
|
||||
|
||||
|
||||
(*) Request page be allocated in the cache [mandatory]:
|
||||
|
||||
int (*allocate_page)(struct fscache_retrieval *op,
|
||||
struct page *page,
|
||||
gfp_t gfp)
|
||||
|
||||
This is like the read_or_alloc_page() method, except that it shouldn't
|
||||
read from the cache, even if there's data there that could be retrieved.
|
||||
It should, however, set up any internal metadata required such that
|
||||
the write_page() method can write to the cache.
|
||||
|
||||
If there's no backing block available, then -ENOBUFS should be returned
|
||||
(or -ENOMEM if there were other problems). If a block is successfully
|
||||
allocated, then the netfs page should be marked and 0 returned.
|
||||
|
||||
|
||||
(*) Request pages be allocated in the cache [mandatory]:
|
||||
|
||||
int (*allocate_pages)(struct fscache_retrieval *op,
|
||||
struct list_head *pages,
|
||||
unsigned *nr_pages,
|
||||
gfp_t gfp)
|
||||
|
||||
This is an multiple page version of the allocate_page() method. pages and
|
||||
nr_pages should be treated as for the read_or_alloc_pages() method.
|
||||
|
||||
|
||||
(*) Request page be written to cache [mandatory]:
|
||||
|
||||
int (*write_page)(struct fscache_storage *op,
|
||||
struct page *page);
|
||||
|
||||
This is called to write from a page on which there was a previously
|
||||
successful read_or_alloc_page() call or similar. FS-Cache filters out
|
||||
pages that don't have mappings.
|
||||
|
||||
This method is called asynchronously from the FS-Cache thread pool. It is
|
||||
not required to actually store anything, provided -ENODATA is then
|
||||
returned to the next read of this page.
|
||||
|
||||
If an error occurred, then a negative error code should be returned,
|
||||
otherwise zero should be returned. FS-Cache will take appropriate action
|
||||
in response to an error, such as withdrawing this object.
|
||||
|
||||
If this method returns success then FS-Cache will inform the netfs
|
||||
appropriately.
|
||||
|
||||
|
||||
(*) Discard retained per-page metadata [mandatory]:
|
||||
|
||||
void (*uncache_page)(struct fscache_object *object, struct page *page)
|
||||
|
||||
This is called when a netfs page is being evicted from the pagecache. The
|
||||
cache backend should tear down any internal representation or tracking it
|
||||
maintains for this page.
|
||||
|
||||
|
||||
==================
|
||||
FS-CACHE UTILITIES
|
||||
==================
|
||||
|
||||
FS-Cache provides some utilities that a cache backend may make use of:
|
||||
|
||||
(*) Note occurrence of an I/O error in a cache:
|
||||
|
||||
void fscache_io_error(struct fscache_cache *cache)
|
||||
|
||||
This tells FS-Cache that an I/O error occurred in the cache. After this
|
||||
has been called, only resource dissociation operations (object and page
|
||||
release) will be passed from the netfs to the cache backend for the
|
||||
specified cache.
|
||||
|
||||
This does not actually withdraw the cache. That must be done separately.
|
||||
|
||||
|
||||
(*) Invoke the retrieval I/O completion function:
|
||||
|
||||
void fscache_end_io(struct fscache_retrieval *op, struct page *page,
|
||||
int error);
|
||||
|
||||
This is called to note the end of an attempt to retrieve a page. The
|
||||
error value should be 0 if successful and an error otherwise.
|
||||
|
||||
|
||||
(*) Set highest store limit:
|
||||
|
||||
void fscache_set_store_limit(struct fscache_object *object,
|
||||
loff_t i_size);
|
||||
|
||||
This sets the limit FS-Cache imposes on the highest byte it's willing to
|
||||
try and store for a netfs. Any page over this limit is automatically
|
||||
rejected by fscache_read_alloc_page() and co with -ENOBUFS.
|
||||
|
||||
|
||||
(*) Mark pages as being cached:
|
||||
|
||||
void fscache_mark_pages_cached(struct fscache_retrieval *op,
|
||||
struct pagevec *pagevec);
|
||||
|
||||
This marks a set of pages as being cached. After this has been called,
|
||||
the netfs must call fscache_uncache_page() to unmark the pages.
|
||||
|
||||
|
||||
(*) Perform coherency check on an object:
|
||||
|
||||
enum fscache_checkaux fscache_check_aux(struct fscache_object *object,
|
||||
const void *data,
|
||||
uint16_t datalen);
|
||||
|
||||
This asks the netfs to perform a coherency check on an object that has
|
||||
just been looked up. The cookie attached to the object will determine the
|
||||
netfs to use. data and datalen should specify where the auxiliary data
|
||||
retrieved from the cache can be found.
|
||||
|
||||
One of three values will be returned:
|
||||
|
||||
(*) FSCACHE_CHECKAUX_OKAY
|
||||
|
||||
The coherency data indicates the object is valid as is.
|
||||
|
||||
(*) FSCACHE_CHECKAUX_NEEDS_UPDATE
|
||||
|
||||
The coherency data needs updating, but otherwise the object is
|
||||
valid.
|
||||
|
||||
(*) FSCACHE_CHECKAUX_OBSOLETE
|
||||
|
||||
The coherency data indicates that the object is obsolete and should
|
||||
be discarded.
|
||||
|
||||
|
||||
(*) Initialise a freshly allocated object:
|
||||
|
||||
void fscache_object_init(struct fscache_object *object);
|
||||
|
||||
This initialises all the fields in an object representation.
|
||||
|
||||
|
||||
(*) Indicate the destruction of an object:
|
||||
|
||||
void fscache_object_destroyed(struct fscache_cache *cache);
|
||||
|
||||
This must be called to inform FS-Cache that an object that belonged to a
|
||||
cache has been destroyed and deallocated. This will allow continuation
|
||||
of the cache withdrawal process when it is stopped pending destruction of
|
||||
all the objects.
|
||||
|
||||
|
||||
(*) Indicate negative lookup on an object:
|
||||
|
||||
void fscache_object_lookup_negative(struct fscache_object *object);
|
||||
|
||||
This is called to indicate to FS-Cache that a lookup process for an object
|
||||
found a negative result.
|
||||
|
||||
This changes the state of an object to permit reads pending on lookup
|
||||
completion to go off and start fetching data from the netfs server as it's
|
||||
known at this point that there can't be any data in the cache.
|
||||
|
||||
This may be called multiple times on an object. Only the first call is
|
||||
significant - all subsequent calls are ignored.
|
||||
|
||||
|
||||
(*) Indicate an object has been obtained:
|
||||
|
||||
void fscache_obtained_object(struct fscache_object *object);
|
||||
|
||||
This is called to indicate to FS-Cache that a lookup process for an object
|
||||
produced a positive result, or that an object was created. This should
|
||||
only be called once for any particular object.
|
||||
|
||||
This changes the state of an object to indicate:
|
||||
|
||||
(1) if no call to fscache_object_lookup_negative() has been made on
|
||||
this object, that there may be data available, and that reads can
|
||||
now go and look for it; and
|
||||
|
||||
(2) that writes may now proceed against this object.
|
||||
|
||||
|
||||
(*) Indicate that object lookup failed:
|
||||
|
||||
void fscache_object_lookup_error(struct fscache_object *object);
|
||||
|
||||
This marks an object as having encountered a fatal error (usually EIO)
|
||||
and causes it to move into a state whereby it will be withdrawn as soon
|
||||
as possible.
|
||||
|
||||
|
||||
(*) Get and release references on a retrieval record:
|
||||
|
||||
void fscache_get_retrieval(struct fscache_retrieval *op);
|
||||
void fscache_put_retrieval(struct fscache_retrieval *op);
|
||||
|
||||
These two functions are used to retain a retrieval record whilst doing
|
||||
asynchronous data retrieval and block allocation.
|
||||
|
||||
|
||||
(*) Enqueue a retrieval record for processing.
|
||||
|
||||
void fscache_enqueue_retrieval(struct fscache_retrieval *op);
|
||||
|
||||
This enqueues a retrieval record for processing by the FS-Cache thread
|
||||
pool. One of the threads in the pool will invoke the retrieval record's
|
||||
op->op.processor callback function. This function may be called from
|
||||
within the callback function.
|
||||
|
||||
|
||||
(*) List of object state names:
|
||||
|
||||
const char *fscache_object_states[];
|
||||
|
||||
For debugging purposes, this may be used to turn the state that an object
|
||||
is in into a text string for display purposes.
|
501
Documentation/filesystems/caching/cachefiles.txt
Normal file
501
Documentation/filesystems/caching/cachefiles.txt
Normal file
@ -0,0 +1,501 @@
|
||||
===============================================
|
||||
CacheFiles: CACHE ON ALREADY MOUNTED FILESYSTEM
|
||||
===============================================
|
||||
|
||||
Contents:
|
||||
|
||||
(*) Overview.
|
||||
|
||||
(*) Requirements.
|
||||
|
||||
(*) Configuration.
|
||||
|
||||
(*) Starting the cache.
|
||||
|
||||
(*) Things to avoid.
|
||||
|
||||
(*) Cache culling.
|
||||
|
||||
(*) Cache structure.
|
||||
|
||||
(*) Security model and SELinux.
|
||||
|
||||
(*) A note on security.
|
||||
|
||||
(*) Statistical information.
|
||||
|
||||
(*) Debugging.
|
||||
|
||||
|
||||
========
|
||||
OVERVIEW
|
||||
========
|
||||
|
||||
CacheFiles is a caching backend that's meant to use as a cache a directory on
|
||||
an already mounted filesystem of a local type (such as Ext3).
|
||||
|
||||
CacheFiles uses a userspace daemon to do some of the cache management - such as
|
||||
reaping stale nodes and culling. This is called cachefilesd and lives in
|
||||
/sbin.
|
||||
|
||||
The filesystem and data integrity of the cache are only as good as those of the
|
||||
filesystem providing the backing services. Note that CacheFiles does not
|
||||
attempt to journal anything since the journalling interfaces of the various
|
||||
filesystems are very specific in nature.
|
||||
|
||||
CacheFiles creates a misc character device - "/dev/cachefiles" - that is used
|
||||
to communication with the daemon. Only one thing may have this open at once,
|
||||
and whilst it is open, a cache is at least partially in existence. The daemon
|
||||
opens this and sends commands down it to control the cache.
|
||||
|
||||
CacheFiles is currently limited to a single cache.
|
||||
|
||||
CacheFiles attempts to maintain at least a certain percentage of free space on
|
||||
the filesystem, shrinking the cache by culling the objects it contains to make
|
||||
space if necessary - see the "Cache Culling" section. This means it can be
|
||||
placed on the same medium as a live set of data, and will expand to make use of
|
||||
spare space and automatically contract when the set of data requires more
|
||||
space.
|
||||
|
||||
|
||||
============
|
||||
REQUIREMENTS
|
||||
============
|
||||
|
||||
The use of CacheFiles and its daemon requires the following features to be
|
||||
available in the system and in the cache filesystem:
|
||||
|
||||
- dnotify.
|
||||
|
||||
- extended attributes (xattrs).
|
||||
|
||||
- openat() and friends.
|
||||
|
||||
- bmap() support on files in the filesystem (FIBMAP ioctl).
|
||||
|
||||
- The use of bmap() to detect a partial page at the end of the file.
|
||||
|
||||
It is strongly recommended that the "dir_index" option is enabled on Ext3
|
||||
filesystems being used as a cache.
|
||||
|
||||
|
||||
=============
|
||||
CONFIGURATION
|
||||
=============
|
||||
|
||||
The cache is configured by a script in /etc/cachefilesd.conf. These commands
|
||||
set up cache ready for use. The following script commands are available:
|
||||
|
||||
(*) brun <N>%
|
||||
(*) bcull <N>%
|
||||
(*) bstop <N>%
|
||||
(*) frun <N>%
|
||||
(*) fcull <N>%
|
||||
(*) fstop <N>%
|
||||
|
||||
Configure the culling limits. Optional. See the section on culling
|
||||
The defaults are 7% (run), 5% (cull) and 1% (stop) respectively.
|
||||
|
||||
The commands beginning with a 'b' are file space (block) limits, those
|
||||
beginning with an 'f' are file count limits.
|
||||
|
||||
(*) dir <path>
|
||||
|
||||
Specify the directory containing the root of the cache. Mandatory.
|
||||
|
||||
(*) tag <name>
|
||||
|
||||
Specify a tag to FS-Cache to use in distinguishing multiple caches.
|
||||
Optional. The default is "CacheFiles".
|
||||
|
||||
(*) debug <mask>
|
||||
|
||||
Specify a numeric bitmask to control debugging in the kernel module.
|
||||
Optional. The default is zero (all off). The following values can be
|
||||
OR'd into the mask to collect various information:
|
||||
|
||||
1 Turn on trace of function entry (_enter() macros)
|
||||
2 Turn on trace of function exit (_leave() macros)
|
||||
4 Turn on trace of internal debug points (_debug())
|
||||
|
||||
This mask can also be set through sysfs, eg:
|
||||
|
||||
echo 5 >/sys/modules/cachefiles/parameters/debug
|
||||
|
||||
|
||||
==================
|
||||
STARTING THE CACHE
|
||||
==================
|
||||
|
||||
The cache is started by running the daemon. The daemon opens the cache device,
|
||||
configures the cache and tells it to begin caching. At that point the cache
|
||||
binds to fscache and the cache becomes live.
|
||||
|
||||
The daemon is run as follows:
|
||||
|
||||
/sbin/cachefilesd [-d]* [-s] [-n] [-f <configfile>]
|
||||
|
||||
The flags are:
|
||||
|
||||
(*) -d
|
||||
|
||||
Increase the debugging level. This can be specified multiple times and
|
||||
is cumulative with itself.
|
||||
|
||||
(*) -s
|
||||
|
||||
Send messages to stderr instead of syslog.
|
||||
|
||||
(*) -n
|
||||
|
||||
Don't daemonise and go into background.
|
||||
|
||||
(*) -f <configfile>
|
||||
|
||||
Use an alternative configuration file rather than the default one.
|
||||
|
||||
|
||||
===============
|
||||
THINGS TO AVOID
|
||||
===============
|
||||
|
||||
Do not mount other things within the cache as this will cause problems. The
|
||||
kernel module contains its own very cut-down path walking facility that ignores
|
||||
mountpoints, but the daemon can't avoid them.
|
||||
|
||||
Do not create, rename or unlink files and directories in the cache whilst the
|
||||
cache is active, as this may cause the state to become uncertain.
|
||||
|
||||
Renaming files in the cache might make objects appear to be other objects (the
|
||||
filename is part of the lookup key).
|
||||
|
||||
Do not change or remove the extended attributes attached to cache files by the
|
||||
cache as this will cause the cache state management to get confused.
|
||||
|
||||
Do not create files or directories in the cache, lest the cache get confused or
|
||||
serve incorrect data.
|
||||
|
||||
Do not chmod files in the cache. The module creates things with minimal
|
||||
permissions to prevent random users being able to access them directly.
|
||||
|
||||
|
||||
=============
|
||||
CACHE CULLING
|
||||
=============
|
||||
|
||||
The cache may need culling occasionally to make space. This involves
|
||||
discarding objects from the cache that have been used less recently than
|
||||
anything else. Culling is based on the access time of data objects. Empty
|
||||
directories are culled if not in use.
|
||||
|
||||
Cache culling is done on the basis of the percentage of blocks and the
|
||||
percentage of files available in the underlying filesystem. There are six
|
||||
"limits":
|
||||
|
||||
(*) brun
|
||||
(*) frun
|
||||
|
||||
If the amount of free space and the number of available files in the cache
|
||||
rises above both these limits, then culling is turned off.
|
||||
|
||||
(*) bcull
|
||||
(*) fcull
|
||||
|
||||
If the amount of available space or the number of available files in the
|
||||
cache falls below either of these limits, then culling is started.
|
||||
|
||||
(*) bstop
|
||||
(*) fstop
|
||||
|
||||
If the amount of available space or the number of available files in the
|
||||
cache falls below either of these limits, then no further allocation of
|
||||
disk space or files is permitted until culling has raised things above
|
||||
these limits again.
|
||||
|
||||
These must be configured thusly:
|
||||
|
||||
0 <= bstop < bcull < brun < 100
|
||||
0 <= fstop < fcull < frun < 100
|
||||
|
||||
Note that these are percentages of available space and available files, and do
|
||||
_not_ appear as 100 minus the percentage displayed by the "df" program.
|
||||
|
||||
The userspace daemon scans the cache to build up a table of cullable objects.
|
||||
These are then culled in least recently used order. A new scan of the cache is
|
||||
started as soon as space is made in the table. Objects will be skipped if
|
||||
their atimes have changed or if the kernel module says it is still using them.
|
||||
|
||||
|
||||
===============
|
||||
CACHE STRUCTURE
|
||||
===============
|
||||
|
||||
The CacheFiles module will create two directories in the directory it was
|
||||
given:
|
||||
|
||||
(*) cache/
|
||||
|
||||
(*) graveyard/
|
||||
|
||||
The active cache objects all reside in the first directory. The CacheFiles
|
||||
kernel module moves any retired or culled objects that it can't simply unlink
|
||||
to the graveyard from which the daemon will actually delete them.
|
||||
|
||||
The daemon uses dnotify to monitor the graveyard directory, and will delete
|
||||
anything that appears therein.
|
||||
|
||||
|
||||
The module represents index objects as directories with the filename "I..." or
|
||||
"J...". Note that the "cache/" directory is itself a special index.
|
||||
|
||||
Data objects are represented as files if they have no children, or directories
|
||||
if they do. Their filenames all begin "D..." or "E...". If represented as a
|
||||
directory, data objects will have a file in the directory called "data" that
|
||||
actually holds the data.
|
||||
|
||||
Special objects are similar to data objects, except their filenames begin
|
||||
"S..." or "T...".
|
||||
|
||||
|
||||
If an object has children, then it will be represented as a directory.
|
||||
Immediately in the representative directory are a collection of directories
|
||||
named for hash values of the child object keys with an '@' prepended. Into
|
||||
this directory, if possible, will be placed the representations of the child
|
||||
objects:
|
||||
|
||||
INDEX INDEX INDEX DATA FILES
|
||||
========= ========== ================================= ================
|
||||
cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400
|
||||
cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400/@75/Es0g000w...DB1ry
|
||||
cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400/@75/Es0g000w...N22ry
|
||||
cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400/@75/Es0g000w...FP1ry
|
||||
|
||||
|
||||
If the key is so long that it exceeds NAME_MAX with the decorations added on to
|
||||
it, then it will be cut into pieces, the first few of which will be used to
|
||||
make a nest of directories, and the last one of which will be the objects
|
||||
inside the last directory. The names of the intermediate directories will have
|
||||
'+' prepended:
|
||||
|
||||
J1223/@23/+xy...z/+kl...m/Epqr
|
||||
|
||||
|
||||
Note that keys are raw data, and not only may they exceed NAME_MAX in size,
|
||||
they may also contain things like '/' and NUL characters, and so they may not
|
||||
be suitable for turning directly into a filename.
|
||||
|
||||
To handle this, CacheFiles will use a suitably printable filename directly and
|
||||
"base-64" encode ones that aren't directly suitable. The two versions of
|
||||
object filenames indicate the encoding:
|
||||
|
||||
OBJECT TYPE PRINTABLE ENCODED
|
||||
=============== =============== ===============
|
||||
Index "I..." "J..."
|
||||
Data "D..." "E..."
|
||||
Special "S..." "T..."
|
||||
|
||||
Intermediate directories are always "@" or "+" as appropriate.
|
||||
|
||||
|
||||
Each object in the cache has an extended attribute label that holds the object
|
||||
type ID (required to distinguish special objects) and the auxiliary data from
|
||||
the netfs. The latter is used to detect stale objects in the cache and update
|
||||
or retire them.
|
||||
|
||||
|
||||
Note that CacheFiles will erase from the cache any file it doesn't recognise or
|
||||
any file of an incorrect type (such as a FIFO file or a device file).
|
||||
|
||||
|
||||
==========================
|
||||
SECURITY MODEL AND SELINUX
|
||||
==========================
|
||||
|
||||
CacheFiles is implemented to deal properly with the LSM security features of
|
||||
the Linux kernel and the SELinux facility.
|
||||
|
||||
One of the problems that CacheFiles faces is that it is generally acting on
|
||||
behalf of a process, and running in that process's context, and that includes a
|
||||
security context that is not appropriate for accessing the cache - either
|
||||
because the files in the cache are inaccessible to that process, or because if
|
||||
the process creates a file in the cache, that file may be inaccessible to other
|
||||
processes.
|
||||
|
||||
The way CacheFiles works is to temporarily change the security context (fsuid,
|
||||
fsgid and actor security label) that the process acts as - without changing the
|
||||
security context of the process when it the target of an operation performed by
|
||||
some other process (so signalling and suchlike still work correctly).
|
||||
|
||||
|
||||
When the CacheFiles module is asked to bind to its cache, it:
|
||||
|
||||
(1) Finds the security label attached to the root cache directory and uses
|
||||
that as the security label with which it will create files. By default,
|
||||
this is:
|
||||
|
||||
cachefiles_var_t
|
||||
|
||||
(2) Finds the security label of the process which issued the bind request
|
||||
(presumed to be the cachefilesd daemon), which by default will be:
|
||||
|
||||
cachefilesd_t
|
||||
|
||||
and asks LSM to supply a security ID as which it should act given the
|
||||
daemon's label. By default, this will be:
|
||||
|
||||
cachefiles_kernel_t
|
||||
|
||||
SELinux transitions the daemon's security ID to the module's security ID
|
||||
based on a rule of this form in the policy.
|
||||
|
||||
type_transition <daemon's-ID> kernel_t : process <module's-ID>;
|
||||
|
||||
For instance:
|
||||
|
||||
type_transition cachefilesd_t kernel_t : process cachefiles_kernel_t;
|
||||
|
||||
|
||||
The module's security ID gives it permission to create, move and remove files
|
||||
and directories in the cache, to find and access directories and files in the
|
||||
cache, to set and access extended attributes on cache objects, and to read and
|
||||
write files in the cache.
|
||||
|
||||
The daemon's security ID gives it only a very restricted set of permissions: it
|
||||
may scan directories, stat files and erase files and directories. It may
|
||||
not read or write files in the cache, and so it is precluded from accessing the
|
||||
data cached therein; nor is it permitted to create new files in the cache.
|
||||
|
||||
|
||||
There are policy source files available in:
|
||||
|
||||
http://people.redhat.com/~dhowells/fscache/cachefilesd-0.8.tar.bz2
|
||||
|
||||
and later versions. In that tarball, see the files:
|
||||
|
||||
cachefilesd.te
|
||||
cachefilesd.fc
|
||||
cachefilesd.if
|
||||
|
||||
They are built and installed directly by the RPM.
|
||||
|
||||
If a non-RPM based system is being used, then copy the above files to their own
|
||||
directory and run:
|
||||
|
||||
make -f /usr/share/selinux/devel/Makefile
|
||||
semodule -i cachefilesd.pp
|
||||
|
||||
You will need checkpolicy and selinux-policy-devel installed prior to the
|
||||
build.
|
||||
|
||||
|
||||
By default, the cache is located in /var/fscache, but if it is desirable that
|
||||
it should be elsewhere, than either the above policy files must be altered, or
|
||||
an auxiliary policy must be installed to label the alternate location of the
|
||||
cache.
|
||||
|
||||
For instructions on how to add an auxiliary policy to enable the cache to be
|
||||
located elsewhere when SELinux is in enforcing mode, please see:
|
||||
|
||||
/usr/share/doc/cachefilesd-*/move-cache.txt
|
||||
|
||||
When the cachefilesd rpm is installed; alternatively, the document can be found
|
||||
in the sources.
|
||||
|
||||
|
||||
==================
|
||||
A NOTE ON SECURITY
|
||||
==================
|
||||
|
||||
CacheFiles makes use of the split security in the task_struct. It allocates
|
||||
its own task_security structure, and redirects current->act_as to point to it
|
||||
when it acts on behalf of another process, in that process's context.
|
||||
|
||||
The reason it does this is that it calls vfs_mkdir() and suchlike rather than
|
||||
bypassing security and calling inode ops directly. Therefore the VFS and LSM
|
||||
may deny the CacheFiles access to the cache data because under some
|
||||
circumstances the caching code is running in the security context of whatever
|
||||
process issued the original syscall on the netfs.
|
||||
|
||||
Furthermore, should CacheFiles create a file or directory, the security
|
||||
parameters with that object is created (UID, GID, security label) would be
|
||||
derived from that process that issued the system call, thus potentially
|
||||
preventing other processes from accessing the cache - including CacheFiles's
|
||||
cache management daemon (cachefilesd).
|
||||
|
||||
What is required is to temporarily override the security of the process that
|
||||
issued the system call. We can't, however, just do an in-place change of the
|
||||
security data as that affects the process as an object, not just as a subject.
|
||||
This means it may lose signals or ptrace events for example, and affects what
|
||||
the process looks like in /proc.
|
||||
|
||||
So CacheFiles makes use of a logical split in the security between the
|
||||
objective security (task->sec) and the subjective security (task->act_as). The
|
||||
objective security holds the intrinsic security properties of a process and is
|
||||
never overridden. This is what appears in /proc, and is what is used when a
|
||||
process is the target of an operation by some other process (SIGKILL for
|
||||
example).
|
||||
|
||||
The subjective security holds the active security properties of a process, and
|
||||
may be overridden. This is not seen externally, and is used whan a process
|
||||
acts upon another object, for example SIGKILLing another process or opening a
|
||||
file.
|
||||
|
||||
LSM hooks exist that allow SELinux (or Smack or whatever) to reject a request
|
||||
for CacheFiles to run in a context of a specific security label, or to create
|
||||
files and directories with another security label.
|
||||
|
||||
|
||||
=======================
|
||||
STATISTICAL INFORMATION
|
||||
=======================
|
||||
|
||||
If FS-Cache is compiled with the following option enabled:
|
||||
|
||||
CONFIG_CACHEFILES_HISTOGRAM=y
|
||||
|
||||
then it will gather certain statistics and display them through a proc file.
|
||||
|
||||
(*) /proc/fs/cachefiles/histogram
|
||||
|
||||
cat /proc/fs/cachefiles/histogram
|
||||
JIFS SECS LOOKUPS MKDIRS CREATES
|
||||
===== ===== ========= ========= =========
|
||||
|
||||
This shows the breakdown of the number of times each amount of time
|
||||
between 0 jiffies and HZ-1 jiffies a variety of tasks took to run. The
|
||||
columns are as follows:
|
||||
|
||||
COLUMN TIME MEASUREMENT
|
||||
======= =======================================================
|
||||
LOOKUPS Length of time to perform a lookup on the backing fs
|
||||
MKDIRS Length of time to perform a mkdir on the backing fs
|
||||
CREATES Length of time to perform a create on the backing fs
|
||||
|
||||
Each row shows the number of events that took a particular range of times.
|
||||
Each step is 1 jiffy in size. The JIFS column indicates the particular
|
||||
jiffy range covered, and the SECS field the equivalent number of seconds.
|
||||
|
||||
|
||||
=========
|
||||
DEBUGGING
|
||||
=========
|
||||
|
||||
If CONFIG_CACHEFILES_DEBUG is enabled, the CacheFiles facility can have runtime
|
||||
debugging enabled by adjusting the value in:
|
||||
|
||||
/sys/module/cachefiles/parameters/debug
|
||||
|
||||
This is a bitmask of debugging streams to enable:
|
||||
|
||||
BIT VALUE STREAM POINT
|
||||
======= ======= =============================== =======================
|
||||
0 1 General Function entry trace
|
||||
1 2 Function exit trace
|
||||
2 4 General
|
||||
|
||||
The appropriate set of values should be OR'd together and the result written to
|
||||
the control file. For example:
|
||||
|
||||
echo $((1|4|8)) >/sys/module/cachefiles/parameters/debug
|
||||
|
||||
will turn on all function entry debugging.
|
333
Documentation/filesystems/caching/fscache.txt
Normal file
333
Documentation/filesystems/caching/fscache.txt
Normal file
@ -0,0 +1,333 @@
|
||||
==========================
|
||||
General Filesystem Caching
|
||||
==========================
|
||||
|
||||
========
|
||||
OVERVIEW
|
||||
========
|
||||
|
||||
This facility is a general purpose cache for network filesystems, though it
|
||||
could be used for caching other things such as ISO9660 filesystems too.
|
||||
|
||||
FS-Cache mediates between cache backends (such as CacheFS) and network
|
||||
filesystems:
|
||||
|
||||
+---------+
|
||||
| | +--------------+
|
||||
| NFS |--+ | |
|
||||
| | | +-->| CacheFS |
|
||||
+---------+ | +----------+ | | /dev/hda5 |
|
||||
| | | | +--------------+
|
||||
+---------+ +-->| | |
|
||||
| | | |--+
|
||||
| AFS |----->| FS-Cache |
|
||||
| | | |--+
|
||||
+---------+ +-->| | |
|
||||
| | | | +--------------+
|
||||
+---------+ | +----------+ | | |
|
||||
| | | +-->| CacheFiles |
|
||||
| ISOFS |--+ | /var/cache |
|
||||
| | +--------------+
|
||||
+---------+
|
||||
|
||||
Or to look at it another way, FS-Cache is a module that provides a caching
|
||||
facility to a network filesystem such that the cache is transparent to the
|
||||
user:
|
||||
|
||||
+---------+
|
||||
| |
|
||||
| Server |
|
||||
| |
|
||||
+---------+
|
||||
| NETWORK
|
||||
~~~~~|~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
|
||||
| +----------+
|
||||
V | |
|
||||
+---------+ | |
|
||||
| | | |
|
||||
| NFS |----->| FS-Cache |
|
||||
| | | |--+
|
||||
+---------+ | | | +--------------+ +--------------+
|
||||
| | | | | | | |
|
||||
V +----------+ +-->| CacheFiles |-->| Ext3 |
|
||||
+---------+ | /var/cache | | /dev/sda6 |
|
||||
| | +--------------+ +--------------+
|
||||
| VFS | ^ ^
|
||||
| | | |
|
||||
+---------+ +--------------+ |
|
||||
| KERNEL SPACE | |
|
||||
~~~~~|~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~|~~~~~~|~~~~
|
||||
| USER SPACE | |
|
||||
V | |
|
||||
+---------+ +--------------+
|
||||
| | | |
|
||||
| Process | | cachefilesd |
|
||||
| | | |
|
||||
+---------+ +--------------+
|
||||
|
||||
|
||||
FS-Cache does not follow the idea of completely loading every netfs file
|
||||
opened in its entirety into a cache before permitting it to be accessed and
|
||||
then serving the pages out of that cache rather than the netfs inode because:
|
||||
|
||||
(1) It must be practical to operate without a cache.
|
||||
|
||||
(2) The size of any accessible file must not be limited to the size of the
|
||||
cache.
|
||||
|
||||
(3) The combined size of all opened files (this includes mapped libraries)
|
||||
must not be limited to the size of the cache.
|
||||
|
||||
(4) The user should not be forced to download an entire file just to do a
|
||||
one-off access of a small portion of it (such as might be done with the
|
||||
"file" program).
|
||||
|
||||
It instead serves the cache out in PAGE_SIZE chunks as and when requested by
|
||||
the netfs('s) using it.
|
||||
|
||||
|
||||
FS-Cache provides the following facilities:
|
||||
|
||||
(1) More than one cache can be used at once. Caches can be selected
|
||||
explicitly by use of tags.
|
||||
|
||||
(2) Caches can be added / removed at any time.
|
||||
|
||||
(3) The netfs is provided with an interface that allows either party to
|
||||
withdraw caching facilities from a file (required for (2)).
|
||||
|
||||
(4) The interface to the netfs returns as few errors as possible, preferring
|
||||
rather to let the netfs remain oblivious.
|
||||
|
||||
(5) Cookies are used to represent indices, files and other objects to the
|
||||
netfs. The simplest cookie is just a NULL pointer - indicating nothing
|
||||
cached there.
|
||||
|
||||
(6) The netfs is allowed to propose - dynamically - any index hierarchy it
|
||||
desires, though it must be aware that the index search function is
|
||||
recursive, stack space is limited, and indices can only be children of
|
||||
indices.
|
||||
|
||||
(7) Data I/O is done direct to and from the netfs's pages. The netfs
|
||||
indicates that page A is at index B of the data-file represented by cookie
|
||||
C, and that it should be read or written. The cache backend may or may
|
||||
not start I/O on that page, but if it does, a netfs callback will be
|
||||
invoked to indicate completion. The I/O may be either synchronous or
|
||||
asynchronous.
|
||||
|
||||
(8) Cookies can be "retired" upon release. At this point FS-Cache will mark
|
||||
them as obsolete and the index hierarchy rooted at that point will get
|
||||
recycled.
|
||||
|
||||
(9) The netfs provides a "match" function for index searches. In addition to
|
||||
saying whether a match was made or not, this can also specify that an
|
||||
entry should be updated or deleted.
|
||||
|
||||
(10) As much as possible is done asynchronously.
|
||||
|
||||
|
||||
FS-Cache maintains a virtual indexing tree in which all indices, files, objects
|
||||
and pages are kept. Bits of this tree may actually reside in one or more
|
||||
caches.
|
||||
|
||||
FSDEF
|
||||
|
|
||||
+------------------------------------+
|
||||
| |
|
||||
NFS AFS
|
||||
| |
|
||||
+--------------------------+ +-----------+
|
||||
| | | |
|
||||
homedir mirror afs.org redhat.com
|
||||
| | |
|
||||
+------------+ +---------------+ +----------+
|
||||
| | | | | |
|
||||
00001 00002 00007 00125 vol00001 vol00002
|
||||
| | | | |
|
||||
+---+---+ +-----+ +---+ +------+------+ +-----+----+
|
||||
| | | | | | | | | | | | |
|
||||
PG0 PG1 PG2 PG0 XATTR PG0 PG1 DIRENT DIRENT DIRENT R/W R/O Bak
|
||||
| |
|
||||
PG0 +-------+
|
||||
| |
|
||||
00001 00003
|
||||
|
|
||||
+---+---+
|
||||
| | |
|
||||
PG0 PG1 PG2
|
||||
|
||||
In the example above, you can see two netfs's being backed: NFS and AFS. These
|
||||
have different index hierarchies:
|
||||
|
||||
(*) The NFS primary index contains per-server indices. Each server index is
|
||||
indexed by NFS file handles to get data file objects. Each data file
|
||||
objects can have an array of pages, but may also have further child
|
||||
objects, such as extended attributes and directory entries. Extended
|
||||
attribute objects themselves have page-array contents.
|
||||
|
||||
(*) The AFS primary index contains per-cell indices. Each cell index contains
|
||||
per-logical-volume indices. Each of volume index contains up to three
|
||||
indices for the read-write, read-only and backup mirrors of those volumes.
|
||||
Each of these contains vnode data file objects, each of which contains an
|
||||
array of pages.
|
||||
|
||||
The very top index is the FS-Cache master index in which individual netfs's
|
||||
have entries.
|
||||
|
||||
Any index object may reside in more than one cache, provided it only has index
|
||||
children. Any index with non-index object children will be assumed to only
|
||||
reside in one cache.
|
||||
|
||||
|
||||
The netfs API to FS-Cache can be found in:
|
||||
|
||||
Documentation/filesystems/caching/netfs-api.txt
|
||||
|
||||
The cache backend API to FS-Cache can be found in:
|
||||
|
||||
Documentation/filesystems/caching/backend-api.txt
|
||||
|
||||
A description of the internal representations and object state machine can be
|
||||
found in:
|
||||
|
||||
Documentation/filesystems/caching/object.txt
|
||||
|
||||
|
||||
=======================
|
||||
STATISTICAL INFORMATION
|
||||
=======================
|
||||
|
||||
If FS-Cache is compiled with the following options enabled:
|
||||
|
||||
CONFIG_FSCACHE_STATS=y
|
||||
CONFIG_FSCACHE_HISTOGRAM=y
|
||||
|
||||
then it will gather certain statistics and display them through a number of
|
||||
proc files.
|
||||
|
||||
(*) /proc/fs/fscache/stats
|
||||
|
||||
This shows counts of a number of events that can happen in FS-Cache:
|
||||
|
||||
CLASS EVENT MEANING
|
||||
======= ======= =======================================================
|
||||
Cookies idx=N Number of index cookies allocated
|
||||
dat=N Number of data storage cookies allocated
|
||||
spc=N Number of special cookies allocated
|
||||
Objects alc=N Number of objects allocated
|
||||
nal=N Number of object allocation failures
|
||||
avl=N Number of objects that reached the available state
|
||||
ded=N Number of objects that reached the dead state
|
||||
ChkAux non=N Number of objects that didn't have a coherency check
|
||||
ok=N Number of objects that passed a coherency check
|
||||
upd=N Number of objects that needed a coherency data update
|
||||
obs=N Number of objects that were declared obsolete
|
||||
Pages mrk=N Number of pages marked as being cached
|
||||
unc=N Number of uncache page requests seen
|
||||
Acquire n=N Number of acquire cookie requests seen
|
||||
nul=N Number of acq reqs given a NULL parent
|
||||
noc=N Number of acq reqs rejected due to no cache available
|
||||
ok=N Number of acq reqs succeeded
|
||||
nbf=N Number of acq reqs rejected due to error
|
||||
oom=N Number of acq reqs failed on ENOMEM
|
||||
Lookups n=N Number of lookup calls made on cache backends
|
||||
neg=N Number of negative lookups made
|
||||
pos=N Number of positive lookups made
|
||||
crt=N Number of objects created by lookup
|
||||
Updates n=N Number of update cookie requests seen
|
||||
nul=N Number of upd reqs given a NULL parent
|
||||
run=N Number of upd reqs granted CPU time
|
||||
Relinqs n=N Number of relinquish cookie requests seen
|
||||
nul=N Number of rlq reqs given a NULL parent
|
||||
wcr=N Number of rlq reqs waited on completion of creation
|
||||
AttrChg n=N Number of attribute changed requests seen
|
||||
ok=N Number of attr changed requests queued
|
||||
nbf=N Number of attr changed rejected -ENOBUFS
|
||||
oom=N Number of attr changed failed -ENOMEM
|
||||
run=N Number of attr changed ops given CPU time
|
||||
Allocs n=N Number of allocation requests seen
|
||||
ok=N Number of successful alloc reqs
|
||||
wt=N Number of alloc reqs that waited on lookup completion
|
||||
nbf=N Number of alloc reqs rejected -ENOBUFS
|
||||
ops=N Number of alloc reqs submitted
|
||||
owt=N Number of alloc reqs waited for CPU time
|
||||
Retrvls n=N Number of retrieval (read) requests seen
|
||||
ok=N Number of successful retr reqs
|
||||
wt=N Number of retr reqs that waited on lookup completion
|
||||
nod=N Number of retr reqs returned -ENODATA
|
||||
nbf=N Number of retr reqs rejected -ENOBUFS
|
||||
int=N Number of retr reqs aborted -ERESTARTSYS
|
||||
oom=N Number of retr reqs failed -ENOMEM
|
||||
ops=N Number of retr reqs submitted
|
||||
owt=N Number of retr reqs waited for CPU time
|
||||
Stores n=N Number of storage (write) requests seen
|
||||
ok=N Number of successful store reqs
|
||||
agn=N Number of store reqs on a page already pending storage
|
||||
nbf=N Number of store reqs rejected -ENOBUFS
|
||||
oom=N Number of store reqs failed -ENOMEM
|
||||
ops=N Number of store reqs submitted
|
||||
run=N Number of store reqs granted CPU time
|
||||
Ops pend=N Number of times async ops added to pending queues
|
||||
run=N Number of times async ops given CPU time
|
||||
enq=N Number of times async ops queued for processing
|
||||
dfr=N Number of async ops queued for deferred release
|
||||
rel=N Number of async ops released
|
||||
gc=N Number of deferred-release async ops garbage collected
|
||||
|
||||
|
||||
(*) /proc/fs/fscache/histogram
|
||||
|
||||
cat /proc/fs/fscache/histogram
|
||||
JIFS SECS OBJ INST OP RUNS OBJ RUNS RETRV DLY RETRIEVLS
|
||||
===== ===== ========= ========= ========= ========= =========
|
||||
|
||||
This shows the breakdown of the number of times each amount of time
|
||||
between 0 jiffies and HZ-1 jiffies a variety of tasks took to run. The
|
||||
columns are as follows:
|
||||
|
||||
COLUMN TIME MEASUREMENT
|
||||
======= =======================================================
|
||||
OBJ INST Length of time to instantiate an object
|
||||
OP RUNS Length of time a call to process an operation took
|
||||
OBJ RUNS Length of time a call to process an object event took
|
||||
RETRV DLY Time between an requesting a read and lookup completing
|
||||
RETRIEVLS Time between beginning and end of a retrieval
|
||||
|
||||
Each row shows the number of events that took a particular range of times.
|
||||
Each step is 1 jiffy in size. The JIFS column indicates the particular
|
||||
jiffy range covered, and the SECS field the equivalent number of seconds.
|
||||
|
||||
|
||||
=========
|
||||
DEBUGGING
|
||||
=========
|
||||
|
||||
If CONFIG_FSCACHE_DEBUG is enabled, the FS-Cache facility can have runtime
|
||||
debugging enabled by adjusting the value in:
|
||||
|
||||
/sys/module/fscache/parameters/debug
|
||||
|
||||
This is a bitmask of debugging streams to enable:
|
||||
|
||||
BIT VALUE STREAM POINT
|
||||
======= ======= =============================== =======================
|
||||
0 1 Cache management Function entry trace
|
||||
1 2 Function exit trace
|
||||
2 4 General
|
||||
3 8 Cookie management Function entry trace
|
||||
4 16 Function exit trace
|
||||
5 32 General
|
||||
6 64 Page handling Function entry trace
|
||||
7 128 Function exit trace
|
||||
8 256 General
|
||||
9 512 Operation management Function entry trace
|
||||
10 1024 Function exit trace
|
||||
11 2048 General
|
||||
|
||||
The appropriate set of values should be OR'd together and the result written to
|
||||
the control file. For example:
|
||||
|
||||
echo $((1|8|64)) >/sys/module/fscache/parameters/debug
|
||||
|
||||
will turn on all function entry debugging.
|
778
Documentation/filesystems/caching/netfs-api.txt
Normal file
778
Documentation/filesystems/caching/netfs-api.txt
Normal file
@ -0,0 +1,778 @@
|
||||
===============================
|
||||
FS-CACHE NETWORK FILESYSTEM API
|
||||
===============================
|
||||
|
||||
There's an API by which a network filesystem can make use of the FS-Cache
|
||||
facilities. This is based around a number of principles:
|
||||
|
||||
(1) Caches can store a number of different object types. There are two main
|
||||
object types: indices and files. The first is a special type used by
|
||||
FS-Cache to make finding objects faster and to make retiring of groups of
|
||||
objects easier.
|
||||
|
||||
(2) Every index, file or other object is represented by a cookie. This cookie
|
||||
may or may not have anything associated with it, but the netfs doesn't
|
||||
need to care.
|
||||
|
||||
(3) Barring the top-level index (one entry per cached netfs), the index
|
||||
hierarchy for each netfs is structured according the whim of the netfs.
|
||||
|
||||
This API is declared in <linux/fscache.h>.
|
||||
|
||||
This document contains the following sections:
|
||||
|
||||
(1) Network filesystem definition
|
||||
(2) Index definition
|
||||
(3) Object definition
|
||||
(4) Network filesystem (un)registration
|
||||
(5) Cache tag lookup
|
||||
(6) Index registration
|
||||
(7) Data file registration
|
||||
(8) Miscellaneous object registration
|
||||
(9) Setting the data file size
|
||||
(10) Page alloc/read/write
|
||||
(11) Page uncaching
|
||||
(12) Index and data file update
|
||||
(13) Miscellaneous cookie operations
|
||||
(14) Cookie unregistration
|
||||
(15) Index and data file invalidation
|
||||
(16) FS-Cache specific page flags.
|
||||
|
||||
|
||||
=============================
|
||||
NETWORK FILESYSTEM DEFINITION
|
||||
=============================
|
||||
|
||||
FS-Cache needs a description of the network filesystem. This is specified
|
||||
using a record of the following structure:
|
||||
|
||||
struct fscache_netfs {
|
||||
uint32_t version;
|
||||
const char *name;
|
||||
struct fscache_cookie *primary_index;
|
||||
...
|
||||
};
|
||||
|
||||
This first two fields should be filled in before registration, and the third
|
||||
will be filled in by the registration function; any other fields should just be
|
||||
ignored and are for internal use only.
|
||||
|
||||
The fields are:
|
||||
|
||||
(1) The name of the netfs (used as the key in the toplevel index).
|
||||
|
||||
(2) The version of the netfs (if the name matches but the version doesn't, the
|
||||
entire in-cache hierarchy for this netfs will be scrapped and begun
|
||||
afresh).
|
||||
|
||||
(3) The cookie representing the primary index will be allocated according to
|
||||
another parameter passed into the registration function.
|
||||
|
||||
For example, kAFS (linux/fs/afs/) uses the following definitions to describe
|
||||
itself:
|
||||
|
||||
struct fscache_netfs afs_cache_netfs = {
|
||||
.version = 0,
|
||||
.name = "afs",
|
||||
};
|
||||
|
||||
|
||||
================
|
||||
INDEX DEFINITION
|
||||
================
|
||||
|
||||
Indices are used for two purposes:
|
||||
|
||||
(1) To aid the finding of a file based on a series of keys (such as AFS's
|
||||
"cell", "volume ID", "vnode ID").
|
||||
|
||||
(2) To make it easier to discard a subset of all the files cached based around
|
||||
a particular key - for instance to mirror the removal of an AFS volume.
|
||||
|
||||
However, since it's unlikely that any two netfs's are going to want to define
|
||||
their index hierarchies in quite the same way, FS-Cache tries to impose as few
|
||||
restraints as possible on how an index is structured and where it is placed in
|
||||
the tree. The netfs can even mix indices and data files at the same level, but
|
||||
it's not recommended.
|
||||
|
||||
Each index entry consists of a key of indeterminate length plus some auxilliary
|
||||
data, also of indeterminate length.
|
||||
|
||||
There are some limits on indices:
|
||||
|
||||
(1) Any index containing non-index objects should be restricted to a single
|
||||
cache. Any such objects created within an index will be created in the
|
||||
first cache only. The cache in which an index is created can be
|
||||
controlled by cache tags (see below).
|
||||
|
||||
(2) The entry data must be atomically journallable, so it is limited to about
|
||||
400 bytes at present. At least 400 bytes will be available.
|
||||
|
||||
(3) The depth of the index tree should be judged with care as the search
|
||||
function is recursive. Too many layers will run the kernel out of stack.
|
||||
|
||||
|
||||
=================
|
||||
OBJECT DEFINITION
|
||||
=================
|
||||
|
||||
To define an object, a structure of the following type should be filled out:
|
||||
|
||||
struct fscache_cookie_def
|
||||
{
|
||||
uint8_t name[16];
|
||||
uint8_t type;
|
||||
|
||||
struct fscache_cache_tag *(*select_cache)(
|
||||
const void *parent_netfs_data,
|
||||
const void *cookie_netfs_data);
|
||||
|
||||
uint16_t (*get_key)(const void *cookie_netfs_data,
|
||||
void *buffer,
|
||||
uint16_t bufmax);
|
||||
|
||||
void (*get_attr)(const void *cookie_netfs_data,
|
||||
uint64_t *size);
|
||||
|
||||
uint16_t (*get_aux)(const void *cookie_netfs_data,
|
||||
void *buffer,
|
||||
uint16_t bufmax);
|
||||
|
||||
enum fscache_checkaux (*check_aux)(void *cookie_netfs_data,
|
||||
const void *data,
|
||||
uint16_t datalen);
|
||||
|
||||
void (*get_context)(void *cookie_netfs_data, void *context);
|
||||
|
||||
void (*put_context)(void *cookie_netfs_data, void *context);
|
||||
|
||||
void (*mark_pages_cached)(void *cookie_netfs_data,
|
||||
struct address_space *mapping,
|
||||
struct pagevec *cached_pvec);
|
||||
|
||||
void (*now_uncached)(void *cookie_netfs_data);
|
||||
};
|
||||
|
||||
This has the following fields:
|
||||
|
||||
(1) The type of the object [mandatory].
|
||||
|
||||
This is one of the following values:
|
||||
|
||||
(*) FSCACHE_COOKIE_TYPE_INDEX
|
||||
|
||||
This defines an index, which is a special FS-Cache type.
|
||||
|
||||
(*) FSCACHE_COOKIE_TYPE_DATAFILE
|
||||
|
||||
This defines an ordinary data file.
|
||||
|
||||
(*) Any other value between 2 and 255
|
||||
|
||||
This defines an extraordinary object such as an XATTR.
|
||||
|
||||
(2) The name of the object type (NUL terminated unless all 16 chars are used)
|
||||
[optional].
|
||||
|
||||
(3) A function to select the cache in which to store an index [optional].
|
||||
|
||||
This function is invoked when an index needs to be instantiated in a cache
|
||||
during the instantiation of a non-index object. Only the immediate index
|
||||
parent for the non-index object will be queried. Any indices above that
|
||||
in the hierarchy may be stored in multiple caches. This function does not
|
||||
need to be supplied for any non-index object or any index that will only
|
||||
have index children.
|
||||
|
||||
If this function is not supplied or if it returns NULL then the first
|
||||
cache in the parent's list will be chosed, or failing that, the first
|
||||
cache in the master list.
|
||||
|
||||
(4) A function to retrieve an object's key from the netfs [mandatory].
|
||||
|
||||
This function will be called with the netfs data that was passed to the
|
||||
cookie acquisition function and the maximum length of key data that it may
|
||||
provide. It should write the required key data into the given buffer and
|
||||
return the quantity it wrote.
|
||||
|
||||
(5) A function to retrieve attribute data from the netfs [optional].
|
||||
|
||||
This function will be called with the netfs data that was passed to the
|
||||
cookie acquisition function. It should return the size of the file if
|
||||
this is a data file. The size may be used to govern how much cache must
|
||||
be reserved for this file in the cache.
|
||||
|
||||
If the function is absent, a file size of 0 is assumed.
|
||||
|
||||
(6) A function to retrieve auxilliary data from the netfs [optional].
|
||||
|
||||
This function will be called with the netfs data that was passed to the
|
||||
cookie acquisition function and the maximum length of auxilliary data that
|
||||
it may provide. It should write the auxilliary data into the given buffer
|
||||
and return the quantity it wrote.
|
||||
|
||||
If this function is absent, the auxilliary data length will be set to 0.
|
||||
|
||||
The length of the auxilliary data buffer may be dependent on the key
|
||||
length. A netfs mustn't rely on being able to provide more than 400 bytes
|
||||
for both.
|
||||
|
||||
(7) A function to check the auxilliary data [optional].
|
||||
|
||||
This function will be called to check that a match found in the cache for
|
||||
this object is valid. For instance with AFS it could check the auxilliary
|
||||
data against the data version number returned by the server to determine
|
||||
whether the index entry in a cache is still valid.
|
||||
|
||||
If this function is absent, it will be assumed that matching objects in a
|
||||
cache are always valid.
|
||||
|
||||
If present, the function should return one of the following values:
|
||||
|
||||
(*) FSCACHE_CHECKAUX_OKAY - the entry is okay as is
|
||||
(*) FSCACHE_CHECKAUX_NEEDS_UPDATE - the entry requires update
|
||||
(*) FSCACHE_CHECKAUX_OBSOLETE - the entry should be deleted
|
||||
|
||||
This function can also be used to extract data from the auxilliary data in
|
||||
the cache and copy it into the netfs's structures.
|
||||
|
||||
(8) A pair of functions to manage contexts for the completion callback
|
||||
[optional].
|
||||
|
||||
The cache read/write functions are passed a context which is then passed
|
||||
to the I/O completion callback function. To ensure this context remains
|
||||
valid until after the I/O completion is called, two functions may be
|
||||
provided: one to get an extra reference on the context, and one to drop a
|
||||
reference to it.
|
||||
|
||||
If the context is not used or is a type of object that won't go out of
|
||||
scope, then these functions are not required. These functions are not
|
||||
required for indices as indices may not contain data. These functions may
|
||||
be called in interrupt context and so may not sleep.
|
||||
|
||||
(9) A function to mark a page as retaining cache metadata [optional].
|
||||
|
||||
This is called by the cache to indicate that it is retaining in-memory
|
||||
information for this page and that the netfs should uncache the page when
|
||||
it has finished. This does not indicate whether there's data on the disk
|
||||
or not. Note that several pages at once may be presented for marking.
|
||||
|
||||
The PG_fscache bit is set on the pages before this function would be
|
||||
called, so the function need not be provided if this is sufficient.
|
||||
|
||||
This function is not required for indices as they're not permitted data.
|
||||
|
||||
(10) A function to unmark all the pages retaining cache metadata [mandatory].
|
||||
|
||||
This is called by FS-Cache to indicate that a backing store is being
|
||||
unbound from a cookie and that all the marks on the pages should be
|
||||
cleared to prevent confusion. Note that the cache will have torn down all
|
||||
its tracking information so that the pages don't need to be explicitly
|
||||
uncached.
|
||||
|
||||
This function is not required for indices as they're not permitted data.
|
||||
|
||||
|
||||
===================================
|
||||
NETWORK FILESYSTEM (UN)REGISTRATION
|
||||
===================================
|
||||
|
||||
The first step is to declare the network filesystem to the cache. This also
|
||||
involves specifying the layout of the primary index (for AFS, this would be the
|
||||
"cell" level).
|
||||
|
||||
The registration function is:
|
||||
|
||||
int fscache_register_netfs(struct fscache_netfs *netfs);
|
||||
|
||||
It just takes a pointer to the netfs definition. It returns 0 or an error as
|
||||
appropriate.
|
||||
|
||||
For kAFS, registration is done as follows:
|
||||
|
||||
ret = fscache_register_netfs(&afs_cache_netfs);
|
||||
|
||||
The last step is, of course, unregistration:
|
||||
|
||||
void fscache_unregister_netfs(struct fscache_netfs *netfs);
|
||||
|
||||
|
||||
================
|
||||
CACHE TAG LOOKUP
|
||||
================
|
||||
|
||||
FS-Cache permits the use of more than one cache. To permit particular index
|
||||
subtrees to be bound to particular caches, the second step is to look up cache
|
||||
representation tags. This step is optional; it can be left entirely up to
|
||||
FS-Cache as to which cache should be used. The problem with doing that is that
|
||||
FS-Cache will always pick the first cache that was registered.
|
||||
|
||||
To get the representation for a named tag:
|
||||
|
||||
struct fscache_cache_tag *fscache_lookup_cache_tag(const char *name);
|
||||
|
||||
This takes a text string as the name and returns a representation of a tag. It
|
||||
will never return an error. It may return a dummy tag, however, if it runs out
|
||||
of memory; this will inhibit caching with this tag.
|
||||
|
||||
Any representation so obtained must be released by passing it to this function:
|
||||
|
||||
void fscache_release_cache_tag(struct fscache_cache_tag *tag);
|
||||
|
||||
The tag will be retrieved by FS-Cache when it calls the object definition
|
||||
operation select_cache().
|
||||
|
||||
|
||||
==================
|
||||
INDEX REGISTRATION
|
||||
==================
|
||||
|
||||
The third step is to inform FS-Cache about part of an index hierarchy that can
|
||||
be used to locate files. This is done by requesting a cookie for each index in
|
||||
the path to the file:
|
||||
|
||||
struct fscache_cookie *
|
||||
fscache_acquire_cookie(struct fscache_cookie *parent,
|
||||
const struct fscache_object_def *def,
|
||||
void *netfs_data);
|
||||
|
||||
This function creates an index entry in the index represented by parent,
|
||||
filling in the index entry by calling the operations pointed to by def.
|
||||
|
||||
Note that this function never returns an error - all errors are handled
|
||||
internally. It may, however, return NULL to indicate no cookie. It is quite
|
||||
acceptable to pass this token back to this function as the parent to another
|
||||
acquisition (or even to the relinquish cookie, read page and write page
|
||||
functions - see below).
|
||||
|
||||
Note also that no indices are actually created in a cache until a non-index
|
||||
object needs to be created somewhere down the hierarchy. Furthermore, an index
|
||||
may be created in several different caches independently at different times.
|
||||
This is all handled transparently, and the netfs doesn't see any of it.
|
||||
|
||||
For example, with AFS, a cell would be added to the primary index. This index
|
||||
entry would have a dependent inode containing a volume location index for the
|
||||
volume mappings within this cell:
|
||||
|
||||
cell->cache =
|
||||
fscache_acquire_cookie(afs_cache_netfs.primary_index,
|
||||
&afs_cell_cache_index_def,
|
||||
cell);
|
||||
|
||||
Then when a volume location was accessed, it would be entered into the cell's
|
||||
index and an inode would be allocated that acts as a volume type and hash chain
|
||||
combination:
|
||||
|
||||
vlocation->cache =
|
||||
fscache_acquire_cookie(cell->cache,
|
||||
&afs_vlocation_cache_index_def,
|
||||
vlocation);
|
||||
|
||||
And then a particular flavour of volume (R/O for example) could be added to
|
||||
that index, creating another index for vnodes (AFS inode equivalents):
|
||||
|
||||
volume->cache =
|
||||
fscache_acquire_cookie(vlocation->cache,
|
||||
&afs_volume_cache_index_def,
|
||||
volume);
|
||||
|
||||
|
||||
======================
|
||||
DATA FILE REGISTRATION
|
||||
======================
|
||||
|
||||
The fourth step is to request a data file be created in the cache. This is
|
||||
identical to index cookie acquisition. The only difference is that the type in
|
||||
the object definition should be something other than index type.
|
||||
|
||||
vnode->cache =
|
||||
fscache_acquire_cookie(volume->cache,
|
||||
&afs_vnode_cache_object_def,
|
||||
vnode);
|
||||
|
||||
|
||||
=================================
|
||||
MISCELLANEOUS OBJECT REGISTRATION
|
||||
=================================
|
||||
|
||||
An optional step is to request an object of miscellaneous type be created in
|
||||
the cache. This is almost identical to index cookie acquisition. The only
|
||||
difference is that the type in the object definition should be something other
|
||||
than index type. Whilst the parent object could be an index, it's more likely
|
||||
it would be some other type of object such as a data file.
|
||||
|
||||
xattr->cache =
|
||||
fscache_acquire_cookie(vnode->cache,
|
||||
&afs_xattr_cache_object_def,
|
||||
xattr);
|
||||
|
||||
Miscellaneous objects might be used to store extended attributes or directory
|
||||
entries for example.
|
||||
|
||||
|
||||
==========================
|
||||
SETTING THE DATA FILE SIZE
|
||||
==========================
|
||||
|
||||
The fifth step is to set the physical attributes of the file, such as its size.
|
||||
This doesn't automatically reserve any space in the cache, but permits the
|
||||
cache to adjust its metadata for data tracking appropriately:
|
||||
|
||||
int fscache_attr_changed(struct fscache_cookie *cookie);
|
||||
|
||||
The cache will return -ENOBUFS if there is no backing cache or if there is no
|
||||
space to allocate any extra metadata required in the cache. The attributes
|
||||
will be accessed with the get_attr() cookie definition operation.
|
||||
|
||||
Note that attempts to read or write data pages in the cache over this size may
|
||||
be rebuffed with -ENOBUFS.
|
||||
|
||||
This operation schedules an attribute adjustment to happen asynchronously at
|
||||
some point in the future, and as such, it may happen after the function returns
|
||||
to the caller. The attribute adjustment excludes read and write operations.
|
||||
|
||||
|
||||
=====================
|
||||
PAGE READ/ALLOC/WRITE
|
||||
=====================
|
||||
|
||||
And the sixth step is to store and retrieve pages in the cache. There are
|
||||
three functions that are used to do this.
|
||||
|
||||
Note:
|
||||
|
||||
(1) A page should not be re-read or re-allocated without uncaching it first.
|
||||
|
||||
(2) A read or allocated page must be uncached when the netfs page is released
|
||||
from the pagecache.
|
||||
|
||||
(3) A page should only be written to the cache if previous read or allocated.
|
||||
|
||||
This permits the cache to maintain its page tracking in proper order.
|
||||
|
||||
|
||||
PAGE READ
|
||||
---------
|
||||
|
||||
Firstly, the netfs should ask FS-Cache to examine the caches and read the
|
||||
contents cached for a particular page of a particular file if present, or else
|
||||
allocate space to store the contents if not:
|
||||
|
||||
typedef
|
||||
void (*fscache_rw_complete_t)(struct page *page,
|
||||
void *context,
|
||||
int error);
|
||||
|
||||
int fscache_read_or_alloc_page(struct fscache_cookie *cookie,
|
||||
struct page *page,
|
||||
fscache_rw_complete_t end_io_func,
|
||||
void *context,
|
||||
gfp_t gfp);
|
||||
|
||||
The cookie argument must specify a cookie for an object that isn't an index,
|
||||
the page specified will have the data loaded into it (and is also used to
|
||||
specify the page number), and the gfp argument is used to control how any
|
||||
memory allocations made are satisfied.
|
||||
|
||||
If the cookie indicates the inode is not cached:
|
||||
|
||||
(1) The function will return -ENOBUFS.
|
||||
|
||||
Else if there's a copy of the page resident in the cache:
|
||||
|
||||
(1) The mark_pages_cached() cookie operation will be called on that page.
|
||||
|
||||
(2) The function will submit a request to read the data from the cache's
|
||||
backing device directly into the page specified.
|
||||
|
||||
(3) The function will return 0.
|
||||
|
||||
(4) When the read is complete, end_io_func() will be invoked with:
|
||||
|
||||
(*) The netfs data supplied when the cookie was created.
|
||||
|
||||
(*) The page descriptor.
|
||||
|
||||
(*) The context argument passed to the above function. This will be
|
||||
maintained with the get_context/put_context functions mentioned above.
|
||||
|
||||
(*) An argument that's 0 on success or negative for an error code.
|
||||
|
||||
If an error occurs, it should be assumed that the page contains no usable
|
||||
data.
|
||||
|
||||
end_io_func() will be called in process context if the read is results in
|
||||
an error, but it might be called in interrupt context if the read is
|
||||
successful.
|
||||
|
||||
Otherwise, if there's not a copy available in cache, but the cache may be able
|
||||
to store the page:
|
||||
|
||||
(1) The mark_pages_cached() cookie operation will be called on that page.
|
||||
|
||||
(2) A block may be reserved in the cache and attached to the object at the
|
||||
appropriate place.
|
||||
|
||||
(3) The function will return -ENODATA.
|
||||
|
||||
This function may also return -ENOMEM or -EINTR, in which case it won't have
|
||||
read any data from the cache.
|
||||
|
||||
|
||||
PAGE ALLOCATE
|
||||
-------------
|
||||
|
||||
Alternatively, if there's not expected to be any data in the cache for a page
|
||||
because the file has been extended, a block can simply be allocated instead:
|
||||
|
||||
int fscache_alloc_page(struct fscache_cookie *cookie,
|
||||
struct page *page,
|
||||
gfp_t gfp);
|
||||
|
||||
This is similar to the fscache_read_or_alloc_page() function, except that it
|
||||
never reads from the cache. It will return 0 if a block has been allocated,
|
||||
rather than -ENODATA as the other would. One or the other must be performed
|
||||
before writing to the cache.
|
||||
|
||||
The mark_pages_cached() cookie operation will be called on the page if
|
||||
successful.
|
||||
|
||||
|
||||
PAGE WRITE
|
||||
----------
|
||||
|
||||
Secondly, if the netfs changes the contents of the page (either due to an
|
||||
initial download or if a user performs a write), then the page should be
|
||||
written back to the cache:
|
||||
|
||||
int fscache_write_page(struct fscache_cookie *cookie,
|
||||
struct page *page,
|
||||
gfp_t gfp);
|
||||
|
||||
The cookie argument must specify a data file cookie, the page specified should
|
||||
contain the data to be written (and is also used to specify the page number),
|
||||
and the gfp argument is used to control how any memory allocations made are
|
||||
satisfied.
|
||||
|
||||
The page must have first been read or allocated successfully and must not have
|
||||
been uncached before writing is performed.
|
||||
|
||||
If the cookie indicates the inode is not cached then:
|
||||
|
||||
(1) The function will return -ENOBUFS.
|
||||
|
||||
Else if space can be allocated in the cache to hold this page:
|
||||
|
||||
(1) PG_fscache_write will be set on the page.
|
||||
|
||||
(2) The function will submit a request to write the data to cache's backing
|
||||
device directly from the page specified.
|
||||
|
||||
(3) The function will return 0.
|
||||
|
||||
(4) When the write is complete PG_fscache_write is cleared on the page and
|
||||
anyone waiting for that bit will be woken up.
|
||||
|
||||
Else if there's no space available in the cache, -ENOBUFS will be returned. It
|
||||
is also possible for the PG_fscache_write bit to be cleared when no write took
|
||||
place if unforeseen circumstances arose (such as a disk error).
|
||||
|
||||
Writing takes place asynchronously.
|
||||
|
||||
|
||||
MULTIPLE PAGE READ
|
||||
------------------
|
||||
|
||||
A facility is provided to read several pages at once, as requested by the
|
||||
readpages() address space operation:
|
||||
|
||||
int fscache_read_or_alloc_pages(struct fscache_cookie *cookie,
|
||||
struct address_space *mapping,
|
||||
struct list_head *pages,
|
||||
int *nr_pages,
|
||||
fscache_rw_complete_t end_io_func,
|
||||
void *context,
|
||||
gfp_t gfp);
|
||||
|
||||
This works in a similar way to fscache_read_or_alloc_page(), except:
|
||||
|
||||
(1) Any page it can retrieve data for is removed from pages and nr_pages and
|
||||
dispatched for reading to the disk. Reads of adjacent pages on disk may
|
||||
be merged for greater efficiency.
|
||||
|
||||
(2) The mark_pages_cached() cookie operation will be called on several pages
|
||||
at once if they're being read or allocated.
|
||||
|
||||
(3) If there was an general error, then that error will be returned.
|
||||
|
||||
Else if some pages couldn't be allocated or read, then -ENOBUFS will be
|
||||
returned.
|
||||
|
||||
Else if some pages couldn't be read but were allocated, then -ENODATA will
|
||||
be returned.
|
||||
|
||||
Otherwise, if all pages had reads dispatched, then 0 will be returned, the
|
||||
list will be empty and *nr_pages will be 0.
|
||||
|
||||
(4) end_io_func will be called once for each page being read as the reads
|
||||
complete. It will be called in process context if error != 0, but it may
|
||||
be called in interrupt context if there is no error.
|
||||
|
||||
Note that a return of -ENODATA, -ENOBUFS or any other error does not preclude
|
||||
some of the pages being read and some being allocated. Those pages will have
|
||||
been marked appropriately and will need uncaching.
|
||||
|
||||
|
||||
==============
|
||||
PAGE UNCACHING
|
||||
==============
|
||||
|
||||
To uncache a page, this function should be called:
|
||||
|
||||
void fscache_uncache_page(struct fscache_cookie *cookie,
|
||||
struct page *page);
|
||||
|
||||
This function permits the cache to release any in-memory representation it
|
||||
might be holding for this netfs page. This function must be called once for
|
||||
each page on which the read or write page functions above have been called to
|
||||
make sure the cache's in-memory tracking information gets torn down.
|
||||
|
||||
Note that pages can't be explicitly deleted from the a data file. The whole
|
||||
data file must be retired (see the relinquish cookie function below).
|
||||
|
||||
Furthermore, note that this does not cancel the asynchronous read or write
|
||||
operation started by the read/alloc and write functions, so the page
|
||||
invalidation and release functions must use:
|
||||
|
||||
bool fscache_check_page_write(struct fscache_cookie *cookie,
|
||||
struct page *page);
|
||||
|
||||
to see if a page is being written to the cache, and:
|
||||
|
||||
void fscache_wait_on_page_write(struct fscache_cookie *cookie,
|
||||
struct page *page);
|
||||
|
||||
to wait for it to finish if it is.
|
||||
|
||||
|
||||
==========================
|
||||
INDEX AND DATA FILE UPDATE
|
||||
==========================
|
||||
|
||||
To request an update of the index data for an index or other object, the
|
||||
following function should be called:
|
||||
|
||||
void fscache_update_cookie(struct fscache_cookie *cookie);
|
||||
|
||||
This function will refer back to the netfs_data pointer stored in the cookie by
|
||||
the acquisition function to obtain the data to write into each revised index
|
||||
entry. The update method in the parent index definition will be called to
|
||||
transfer the data.
|
||||
|
||||
Note that partial updates may happen automatically at other times, such as when
|
||||
data blocks are added to a data file object.
|
||||
|
||||
|
||||
===============================
|
||||
MISCELLANEOUS COOKIE OPERATIONS
|
||||
===============================
|
||||
|
||||
There are a number of operations that can be used to control cookies:
|
||||
|
||||
(*) Cookie pinning:
|
||||
|
||||
int fscache_pin_cookie(struct fscache_cookie *cookie);
|
||||
void fscache_unpin_cookie(struct fscache_cookie *cookie);
|
||||
|
||||
These operations permit data cookies to be pinned into the cache and to
|
||||
have the pinning removed. They are not permitted on index cookies.
|
||||
|
||||
The pinning function will return 0 if successful, -ENOBUFS in the cookie
|
||||
isn't backed by a cache, -EOPNOTSUPP if the cache doesn't support pinning,
|
||||
-ENOSPC if there isn't enough space to honour the operation, -ENOMEM or
|
||||
-EIO if there's any other problem.
|
||||
|
||||
(*) Data space reservation:
|
||||
|
||||
int fscache_reserve_space(struct fscache_cookie *cookie, loff_t size);
|
||||
|
||||
This permits a netfs to request cache space be reserved to store up to the
|
||||
given amount of a file. It is permitted to ask for more than the current
|
||||
size of the file to allow for future file expansion.
|
||||
|
||||
If size is given as zero then the reservation will be cancelled.
|
||||
|
||||
The function will return 0 if successful, -ENOBUFS in the cookie isn't
|
||||
backed by a cache, -EOPNOTSUPP if the cache doesn't support reservations,
|
||||
-ENOSPC if there isn't enough space to honour the operation, -ENOMEM or
|
||||
-EIO if there's any other problem.
|
||||
|
||||
Note that this doesn't pin an object in a cache; it can still be culled to
|
||||
make space if it's not in use.
|
||||
|
||||
|
||||
=====================
|
||||
COOKIE UNREGISTRATION
|
||||
=====================
|
||||
|
||||
To get rid of a cookie, this function should be called.
|
||||
|
||||
void fscache_relinquish_cookie(struct fscache_cookie *cookie,
|
||||
int retire);
|
||||
|
||||
If retire is non-zero, then the object will be marked for recycling, and all
|
||||
copies of it will be removed from all active caches in which it is present.
|
||||
Not only that but all child objects will also be retired.
|
||||
|
||||
If retire is zero, then the object may be available again when next the
|
||||
acquisition function is called. Retirement here will overrule the pinning on a
|
||||
cookie.
|
||||
|
||||
One very important note - relinquish must NOT be called for a cookie unless all
|
||||
the cookies for "child" indices, objects and pages have been relinquished
|
||||
first.
|
||||
|
||||
|
||||
================================
|
||||
INDEX AND DATA FILE INVALIDATION
|
||||
================================
|
||||
|
||||
There is no direct way to invalidate an index subtree or a data file. To do
|
||||
this, the caller should relinquish and retire the cookie they have, and then
|
||||
acquire a new one.
|
||||
|
||||
|
||||
===========================
|
||||
FS-CACHE SPECIFIC PAGE FLAG
|
||||
===========================
|
||||
|
||||
FS-Cache makes use of a page flag, PG_private_2, for its own purpose. This is
|
||||
given the alternative name PG_fscache.
|
||||
|
||||
PG_fscache is used to indicate that the page is known by the cache, and that
|
||||
the cache must be informed if the page is going to go away. It's an indication
|
||||
to the netfs that the cache has an interest in this page, where an interest may
|
||||
be a pointer to it, resources allocated or reserved for it, or I/O in progress
|
||||
upon it.
|
||||
|
||||
The netfs can use this information in methods such as releasepage() to
|
||||
determine whether it needs to uncache a page or update it.
|
||||
|
||||
Furthermore, if this bit is set, releasepage() and invalidatepage() operations
|
||||
will be called on a page to get rid of it, even if PG_private is not set. This
|
||||
allows caching to attempted on a page before read_cache_pages() to be called
|
||||
after fscache_read_or_alloc_pages() as the former will try and release pages it
|
||||
was given under certain circumstances.
|
||||
|
||||
This bit does not overlap with such as PG_private. This means that FS-Cache
|
||||
can be used with a filesystem that uses the block buffering code.
|
||||
|
||||
There are a number of operations defined on this flag:
|
||||
|
||||
int PageFsCache(struct page *page);
|
||||
void SetPageFsCache(struct page *page)
|
||||
void ClearPageFsCache(struct page *page)
|
||||
int TestSetPageFsCache(struct page *page)
|
||||
int TestClearPageFsCache(struct page *page)
|
||||
|
||||
These functions are bit test, bit set, bit clear, bit test and set and bit
|
||||
test and clear operations on PG_fscache.
|
313
Documentation/filesystems/caching/object.txt
Normal file
313
Documentation/filesystems/caching/object.txt
Normal file
@ -0,0 +1,313 @@
|
||||
====================================================
|
||||
IN-KERNEL CACHE OBJECT REPRESENTATION AND MANAGEMENT
|
||||
====================================================
|
||||
|
||||
By: David Howells <dhowells@redhat.com>
|
||||
|
||||
Contents:
|
||||
|
||||
(*) Representation
|
||||
|
||||
(*) Object management state machine.
|
||||
|
||||
- Provision of cpu time.
|
||||
- Locking simplification.
|
||||
|
||||
(*) The set of states.
|
||||
|
||||
(*) The set of events.
|
||||
|
||||
|
||||
==============
|
||||
REPRESENTATION
|
||||
==============
|
||||
|
||||
FS-Cache maintains an in-kernel representation of each object that a netfs is
|
||||
currently interested in. Such objects are represented by the fscache_cookie
|
||||
struct and are referred to as cookies.
|
||||
|
||||
FS-Cache also maintains a separate in-kernel representation of the objects that
|
||||
a cache backend is currently actively caching. Such objects are represented by
|
||||
the fscache_object struct. The cache backends allocate these upon request, and
|
||||
are expected to embed them in their own representations. These are referred to
|
||||
as objects.
|
||||
|
||||
There is a 1:N relationship between cookies and objects. A cookie may be
|
||||
represented by multiple objects - an index may exist in more than one cache -
|
||||
or even by no objects (it may not be cached).
|
||||
|
||||
Furthermore, both cookies and objects are hierarchical. The two hierarchies
|
||||
correspond, but the cookies tree is a superset of the union of the object trees
|
||||
of multiple caches:
|
||||
|
||||
NETFS INDEX TREE : CACHE 1 : CACHE 2
|
||||
: :
|
||||
: +-----------+ :
|
||||
+----------->| IObject | :
|
||||
+-----------+ | : +-----------+ :
|
||||
| ICookie |-------+ : | :
|
||||
+-----------+ | : | : +-----------+
|
||||
| +------------------------------>| IObject |
|
||||
| : | : +-----------+
|
||||
| : V : |
|
||||
| : +-----------+ : |
|
||||
V +----------->| IObject | : |
|
||||
+-----------+ | : +-----------+ : |
|
||||
| ICookie |-------+ : | : V
|
||||
+-----------+ | : | : +-----------+
|
||||
| +------------------------------>| IObject |
|
||||
+-----+-----+ : | : +-----------+
|
||||
| | : | : |
|
||||
V | : V : |
|
||||
+-----------+ | : +-----------+ : |
|
||||
| ICookie |------------------------->| IObject | : |
|
||||
+-----------+ | : +-----------+ : |
|
||||
| V : | : V
|
||||
| +-----------+ : | : +-----------+
|
||||
| | ICookie |-------------------------------->| IObject |
|
||||
| +-----------+ : | : +-----------+
|
||||
V | : V : |
|
||||
+-----------+ | : +-----------+ : |
|
||||
| DCookie |------------------------->| DObject | : |
|
||||
+-----------+ | : +-----------+ : |
|
||||
| : : |
|
||||
+-------+-------+ : : |
|
||||
| | : : |
|
||||
V V : : V
|
||||
+-----------+ +-----------+ : : +-----------+
|
||||
| DCookie | | DCookie |------------------------>| DObject |
|
||||
+-----------+ +-----------+ : : +-----------+
|
||||
: :
|
||||
|
||||
In the above illustration, ICookie and IObject represent indices and DCookie
|
||||
and DObject represent data storage objects. Indices may have representation in
|
||||
multiple caches, but currently, non-index objects may not. Objects of any type
|
||||
may also be entirely unrepresented.
|
||||
|
||||
As far as the netfs API goes, the netfs is only actually permitted to see
|
||||
pointers to the cookies. The cookies themselves and any objects attached to
|
||||
those cookies are hidden from it.
|
||||
|
||||
|
||||
===============================
|
||||
OBJECT MANAGEMENT STATE MACHINE
|
||||
===============================
|
||||
|
||||
Within FS-Cache, each active object is managed by its own individual state
|
||||
machine. The state for an object is kept in the fscache_object struct, in
|
||||
object->state. A cookie may point to a set of objects that are in different
|
||||
states.
|
||||
|
||||
Each state has an action associated with it that is invoked when the machine
|
||||
wakes up in that state. There are four logical sets of states:
|
||||
|
||||
(1) Preparation: states that wait for the parent objects to become ready. The
|
||||
representations are hierarchical, and it is expected that an object must
|
||||
be created or accessed with respect to its parent object.
|
||||
|
||||
(2) Initialisation: states that perform lookups in the cache and validate
|
||||
what's found and that create on disk any missing metadata.
|
||||
|
||||
(3) Normal running: states that allow netfs operations on objects to proceed
|
||||
and that update the state of objects.
|
||||
|
||||
(4) Termination: states that detach objects from their netfs cookies, that
|
||||
delete objects from disk, that handle disk and system errors and that free
|
||||
up in-memory resources.
|
||||
|
||||
|
||||
In most cases, transitioning between states is in response to signalled events.
|
||||
When a state has finished processing, it will usually set the mask of events in
|
||||
which it is interested (object->event_mask) and relinquish the worker thread.
|
||||
Then when an event is raised (by calling fscache_raise_event()), if the event
|
||||
is not masked, the object will be queued for processing (by calling
|
||||
fscache_enqueue_object()).
|
||||
|
||||
|
||||
PROVISION OF CPU TIME
|
||||
---------------------
|
||||
|
||||
The work to be done by the various states is given CPU time by the threads of
|
||||
the slow work facility (see Documentation/slow-work.txt). This is used in
|
||||
preference to the workqueue facility because:
|
||||
|
||||
(1) Threads may be completely occupied for very long periods of time by a
|
||||
particular work item. These state actions may be doing sequences of
|
||||
synchronous, journalled disk accesses (lookup, mkdir, create, setxattr,
|
||||
getxattr, truncate, unlink, rmdir, rename).
|
||||
|
||||
(2) Threads may do little actual work, but may rather spend a lot of time
|
||||
sleeping on I/O. This means that single-threaded and 1-per-CPU-threaded
|
||||
workqueues don't necessarily have the right numbers of threads.
|
||||
|
||||
|
||||
LOCKING SIMPLIFICATION
|
||||
----------------------
|
||||
|
||||
Because only one worker thread may be operating on any particular object's
|
||||
state machine at once, this simplifies the locking, particularly with respect
|
||||
to disconnecting the netfs's representation of a cache object (fscache_cookie)
|
||||
from the cache backend's representation (fscache_object) - which may be
|
||||
requested from either end.
|
||||
|
||||
|
||||
=================
|
||||
THE SET OF STATES
|
||||
=================
|
||||
|
||||
The object state machine has a set of states that it can be in. There are
|
||||
preparation states in which the object sets itself up and waits for its parent
|
||||
object to transit to a state that allows access to its children:
|
||||
|
||||
(1) State FSCACHE_OBJECT_INIT.
|
||||
|
||||
Initialise the object and wait for the parent object to become active. In
|
||||
the cache, it is expected that it will not be possible to look an object
|
||||
up from the parent object, until that parent object itself has been looked
|
||||
up.
|
||||
|
||||
There are initialisation states in which the object sets itself up and accesses
|
||||
disk for the object metadata:
|
||||
|
||||
(2) State FSCACHE_OBJECT_LOOKING_UP.
|
||||
|
||||
Look up the object on disk, using the parent as a starting point.
|
||||
FS-Cache expects the cache backend to probe the cache to see whether this
|
||||
object is represented there, and if it is, to see if it's valid (coherency
|
||||
management).
|
||||
|
||||
The cache should call fscache_object_lookup_negative() to indicate lookup
|
||||
failure for whatever reason, and should call fscache_obtained_object() to
|
||||
indicate success.
|
||||
|
||||
At the completion of lookup, FS-Cache will let the netfs go ahead with
|
||||
read operations, no matter whether the file is yet cached. If not yet
|
||||
cached, read operations will be immediately rejected with ENODATA until
|
||||
the first known page is uncached - as to that point there can be no data
|
||||
to be read out of the cache for that file that isn't currently also held
|
||||
in the pagecache.
|
||||
|
||||
(3) State FSCACHE_OBJECT_CREATING.
|
||||
|
||||
Create an object on disk, using the parent as a starting point. This
|
||||
happens if the lookup failed to find the object, or if the object's
|
||||
coherency data indicated what's on disk is out of date. In this state,
|
||||
FS-Cache expects the cache to create
|
||||
|
||||
The cache should call fscache_obtained_object() if creation completes
|
||||
successfully, fscache_object_lookup_negative() otherwise.
|
||||
|
||||
At the completion of creation, FS-Cache will start processing write
|
||||
operations the netfs has queued for an object. If creation failed, the
|
||||
write ops will be transparently discarded, and nothing recorded in the
|
||||
cache.
|
||||
|
||||
There are some normal running states in which the object spends its time
|
||||
servicing netfs requests:
|
||||
|
||||
(4) State FSCACHE_OBJECT_AVAILABLE.
|
||||
|
||||
A transient state in which pending operations are started, child objects
|
||||
are permitted to advance from FSCACHE_OBJECT_INIT state, and temporary
|
||||
lookup data is freed.
|
||||
|
||||
(5) State FSCACHE_OBJECT_ACTIVE.
|
||||
|
||||
The normal running state. In this state, requests the netfs makes will be
|
||||
passed on to the cache.
|
||||
|
||||
(6) State FSCACHE_OBJECT_UPDATING.
|
||||
|
||||
The state machine comes here to update the object in the cache from the
|
||||
netfs's records. This involves updating the auxiliary data that is used
|
||||
to maintain coherency.
|
||||
|
||||
And there are terminal states in which an object cleans itself up, deallocates
|
||||
memory and potentially deletes stuff from disk:
|
||||
|
||||
(7) State FSCACHE_OBJECT_LC_DYING.
|
||||
|
||||
The object comes here if it is dying because of a lookup or creation
|
||||
error. This would be due to a disk error or system error of some sort.
|
||||
Temporary data is cleaned up, and the parent is released.
|
||||
|
||||
(8) State FSCACHE_OBJECT_DYING.
|
||||
|
||||
The object comes here if it is dying due to an error, because its parent
|
||||
cookie has been relinquished by the netfs or because the cache is being
|
||||
withdrawn.
|
||||
|
||||
Any child objects waiting on this one are given CPU time so that they too
|
||||
can destroy themselves. This object waits for all its children to go away
|
||||
before advancing to the next state.
|
||||
|
||||
(9) State FSCACHE_OBJECT_ABORT_INIT.
|
||||
|
||||
The object comes to this state if it was waiting on its parent in
|
||||
FSCACHE_OBJECT_INIT, but its parent died. The object will destroy itself
|
||||
so that the parent may proceed from the FSCACHE_OBJECT_DYING state.
|
||||
|
||||
(10) State FSCACHE_OBJECT_RELEASING.
|
||||
(11) State FSCACHE_OBJECT_RECYCLING.
|
||||
|
||||
The object comes to one of these two states when dying once it is rid of
|
||||
all its children, if it is dying because the netfs relinquished its
|
||||
cookie. In the first state, the cached data is expected to persist, and
|
||||
in the second it will be deleted.
|
||||
|
||||
(12) State FSCACHE_OBJECT_WITHDRAWING.
|
||||
|
||||
The object transits to this state if the cache decides it wants to
|
||||
withdraw the object from service, perhaps to make space, but also due to
|
||||
error or just because the whole cache is being withdrawn.
|
||||
|
||||
(13) State FSCACHE_OBJECT_DEAD.
|
||||
|
||||
The object transits to this state when the in-memory object record is
|
||||
ready to be deleted. The object processor shouldn't ever see an object in
|
||||
this state.
|
||||
|
||||
|
||||
THE SET OF EVENTS
|
||||
-----------------
|
||||
|
||||
There are a number of events that can be raised to an object state machine:
|
||||
|
||||
(*) FSCACHE_OBJECT_EV_UPDATE
|
||||
|
||||
The netfs requested that an object be updated. The state machine will ask
|
||||
the cache backend to update the object, and the cache backend will ask the
|
||||
netfs for details of the change through its cookie definition ops.
|
||||
|
||||
(*) FSCACHE_OBJECT_EV_CLEARED
|
||||
|
||||
This is signalled in two circumstances:
|
||||
|
||||
(a) when an object's last child object is dropped and
|
||||
|
||||
(b) when the last operation outstanding on an object is completed.
|
||||
|
||||
This is used to proceed from the dying state.
|
||||
|
||||
(*) FSCACHE_OBJECT_EV_ERROR
|
||||
|
||||
This is signalled when an I/O error occurs during the processing of some
|
||||
object.
|
||||
|
||||
(*) FSCACHE_OBJECT_EV_RELEASE
|
||||
(*) FSCACHE_OBJECT_EV_RETIRE
|
||||
|
||||
These are signalled when the netfs relinquishes a cookie it was using.
|
||||
The event selected depends on whether the netfs asks for the backing
|
||||
object to be retired (deleted) or retained.
|
||||
|
||||
(*) FSCACHE_OBJECT_EV_WITHDRAW
|
||||
|
||||
This is signalled when the cache backend wants to withdraw an object.
|
||||
This means that the object will have to be detached from the netfs's
|
||||
cookie.
|
||||
|
||||
Because the withdrawing releasing/retiring events are all handled by the object
|
||||
state machine, it doesn't matter if there's a collision with both ends trying
|
||||
to sever the connection at the same time. The state machine can just pick
|
||||
which one it wants to honour, and that effects the other.
|
213
Documentation/filesystems/caching/operations.txt
Normal file
213
Documentation/filesystems/caching/operations.txt
Normal file
@ -0,0 +1,213 @@
|
||||
================================
|
||||
ASYNCHRONOUS OPERATIONS HANDLING
|
||||
================================
|
||||
|
||||
By: David Howells <dhowells@redhat.com>
|
||||
|
||||
Contents:
|
||||
|
||||
(*) Overview.
|
||||
|
||||
(*) Operation record initialisation.
|
||||
|
||||
(*) Parameters.
|
||||
|
||||
(*) Procedure.
|
||||
|
||||
(*) Asynchronous callback.
|
||||
|
||||
|
||||
========
|
||||
OVERVIEW
|
||||
========
|
||||
|
||||
FS-Cache has an asynchronous operations handling facility that it uses for its
|
||||
data storage and retrieval routines. Its operations are represented by
|
||||
fscache_operation structs, though these are usually embedded into some other
|
||||
structure.
|
||||
|
||||
This facility is available to and expected to be be used by the cache backends,
|
||||
and FS-Cache will create operations and pass them off to the appropriate cache
|
||||
backend for completion.
|
||||
|
||||
To make use of this facility, <linux/fscache-cache.h> should be #included.
|
||||
|
||||
|
||||
===============================
|
||||
OPERATION RECORD INITIALISATION
|
||||
===============================
|
||||
|
||||
An operation is recorded in an fscache_operation struct:
|
||||
|
||||
struct fscache_operation {
|
||||
union {
|
||||
struct work_struct fast_work;
|
||||
struct slow_work slow_work;
|
||||
};
|
||||
unsigned long flags;
|
||||
fscache_operation_processor_t processor;
|
||||
...
|
||||
};
|
||||
|
||||
Someone wanting to issue an operation should allocate something with this
|
||||
struct embedded in it. They should initialise it by calling:
|
||||
|
||||
void fscache_operation_init(struct fscache_operation *op,
|
||||
fscache_operation_release_t release);
|
||||
|
||||
with the operation to be initialised and the release function to use.
|
||||
|
||||
The op->flags parameter should be set to indicate the CPU time provision and
|
||||
the exclusivity (see the Parameters section).
|
||||
|
||||
The op->fast_work, op->slow_work and op->processor flags should be set as
|
||||
appropriate for the CPU time provision (see the Parameters section).
|
||||
|
||||
FSCACHE_OP_WAITING may be set in op->flags prior to each submission of the
|
||||
operation and waited for afterwards.
|
||||
|
||||
|
||||
==========
|
||||
PARAMETERS
|
||||
==========
|
||||
|
||||
There are a number of parameters that can be set in the operation record's flag
|
||||
parameter. There are three options for the provision of CPU time in these
|
||||
operations:
|
||||
|
||||
(1) The operation may be done synchronously (FSCACHE_OP_MYTHREAD). A thread
|
||||
may decide it wants to handle an operation itself without deferring it to
|
||||
another thread.
|
||||
|
||||
This is, for example, used in read operations for calling readpages() on
|
||||
the backing filesystem in CacheFiles. Although readpages() does an
|
||||
asynchronous data fetch, the determination of whether pages exist is done
|
||||
synchronously - and the netfs does not proceed until this has been
|
||||
determined.
|
||||
|
||||
If this option is to be used, FSCACHE_OP_WAITING must be set in op->flags
|
||||
before submitting the operation, and the operating thread must wait for it
|
||||
to be cleared before proceeding:
|
||||
|
||||
wait_on_bit(&op->flags, FSCACHE_OP_WAITING,
|
||||
fscache_wait_bit, TASK_UNINTERRUPTIBLE);
|
||||
|
||||
|
||||
(2) The operation may be fast asynchronous (FSCACHE_OP_FAST), in which case it
|
||||
will be given to keventd to process. Such an operation is not permitted
|
||||
to sleep on I/O.
|
||||
|
||||
This is, for example, used by CacheFiles to copy data from a backing fs
|
||||
page to a netfs page after the backing fs has read the page in.
|
||||
|
||||
If this option is used, op->fast_work and op->processor must be
|
||||
initialised before submitting the operation:
|
||||
|
||||
INIT_WORK(&op->fast_work, do_some_work);
|
||||
|
||||
|
||||
(3) The operation may be slow asynchronous (FSCACHE_OP_SLOW), in which case it
|
||||
will be given to the slow work facility to process. Such an operation is
|
||||
permitted to sleep on I/O.
|
||||
|
||||
This is, for example, used by FS-Cache to handle background writes of
|
||||
pages that have just been fetched from a remote server.
|
||||
|
||||
If this option is used, op->slow_work and op->processor must be
|
||||
initialised before submitting the operation:
|
||||
|
||||
fscache_operation_init_slow(op, processor)
|
||||
|
||||
|
||||
Furthermore, operations may be one of two types:
|
||||
|
||||
(1) Exclusive (FSCACHE_OP_EXCLUSIVE). Operations of this type may not run in
|
||||
conjunction with any other operation on the object being operated upon.
|
||||
|
||||
An example of this is the attribute change operation, in which the file
|
||||
being written to may need truncation.
|
||||
|
||||
(2) Shareable. Operations of this type may be running simultaneously. It's
|
||||
up to the operation implementation to prevent interference between other
|
||||
operations running at the same time.
|
||||
|
||||
|
||||
=========
|
||||
PROCEDURE
|
||||
=========
|
||||
|
||||
Operations are used through the following procedure:
|
||||
|
||||
(1) The submitting thread must allocate the operation and initialise it
|
||||
itself. Normally this would be part of a more specific structure with the
|
||||
generic op embedded within.
|
||||
|
||||
(2) The submitting thread must then submit the operation for processing using
|
||||
one of the following two functions:
|
||||
|
||||
int fscache_submit_op(struct fscache_object *object,
|
||||
struct fscache_operation *op);
|
||||
|
||||
int fscache_submit_exclusive_op(struct fscache_object *object,
|
||||
struct fscache_operation *op);
|
||||
|
||||
The first function should be used to submit non-exclusive ops and the
|
||||
second to submit exclusive ones. The caller must still set the
|
||||
FSCACHE_OP_EXCLUSIVE flag.
|
||||
|
||||
If successful, both functions will assign the operation to the specified
|
||||
object and return 0. -ENOBUFS will be returned if the object specified is
|
||||
permanently unavailable.
|
||||
|
||||
The operation manager will defer operations on an object that is still
|
||||
undergoing lookup or creation. The operation will also be deferred if an
|
||||
operation of conflicting exclusivity is in progress on the object.
|
||||
|
||||
If the operation is asynchronous, the manager will retain a reference to
|
||||
it, so the caller should put their reference to it by passing it to:
|
||||
|
||||
void fscache_put_operation(struct fscache_operation *op);
|
||||
|
||||
(3) If the submitting thread wants to do the work itself, and has marked the
|
||||
operation with FSCACHE_OP_MYTHREAD, then it should monitor
|
||||
FSCACHE_OP_WAITING as described above and check the state of the object if
|
||||
necessary (the object might have died whilst the thread was waiting).
|
||||
|
||||
When it has finished doing its processing, it should call
|
||||
fscache_put_operation() on it.
|
||||
|
||||
(4) The operation holds an effective lock upon the object, preventing other
|
||||
exclusive ops conflicting until it is released. The operation can be
|
||||
enqueued for further immediate asynchronous processing by adjusting the
|
||||
CPU time provisioning option if necessary, eg:
|
||||
|
||||
op->flags &= ~FSCACHE_OP_TYPE;
|
||||
op->flags |= ~FSCACHE_OP_FAST;
|
||||
|
||||
and calling:
|
||||
|
||||
void fscache_enqueue_operation(struct fscache_operation *op)
|
||||
|
||||
This can be used to allow other things to have use of the worker thread
|
||||
pools.
|
||||
|
||||
|
||||
=====================
|
||||
ASYNCHRONOUS CALLBACK
|
||||
=====================
|
||||
|
||||
When used in asynchronous mode, the worker thread pool will invoke the
|
||||
processor method with a pointer to the operation. This should then get at the
|
||||
container struct by using container_of():
|
||||
|
||||
static void fscache_write_op(struct fscache_operation *_op)
|
||||
{
|
||||
struct fscache_storage *op =
|
||||
container_of(_op, struct fscache_storage, op);
|
||||
...
|
||||
}
|
||||
|
||||
The caller holds a reference on the operation, and will invoke
|
||||
fscache_put_operation() when the processor function returns. The processor
|
||||
function is at liberty to call fscache_enqueue_operation() or to take extra
|
||||
references.
|
Reference in New Issue
Block a user