[PATCH] madvise MADV_DONTFORK/MADV_DOFORK

Currently, copy-on-write may change the physical address of a page even if the
user requested that the page is pinned in memory (either by mlock or by
get_user_pages).  This happens if the process forks meanwhile, and the parent
writes to that page.  As a result, the page is orphaned: in case of
get_user_pages, the application will never see any data hardware DMA's into
this page after the COW.  In case of mlock'd memory, the parent is not getting
the realtime/security benefits of mlock.

In particular, this affects the Infiniband modules which do DMA from and into
user pages all the time.

This patch adds madvise options to control whether memory range is inherited
across fork.  Useful e.g.  for when hardware is doing DMA from/into these
pages.  Could also be useful to an application wanting to speed up its forks
by cutting large areas out of consideration.

Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Acked-by: Hugh Dickins <hugh@veritas.com>
Cc: Michael Kerrisk <mtk-manpages@gmx.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This commit is contained in:
Michael S. Tsirkin
2006-02-14 13:53:08 -08:00
committed by Linus Torvalds
parent 8861da31e3
commit f822566165
21 changed files with 57 additions and 4 deletions

View File

@ -22,16 +22,23 @@ static long madvise_behavior(struct vm_area_struct * vma,
struct mm_struct * mm = vma->vm_mm;
int error = 0;
pgoff_t pgoff;
int new_flags = vma->vm_flags & ~VM_READHINTMASK;
int new_flags = vma->vm_flags;
switch (behavior) {
case MADV_NORMAL:
new_flags = new_flags & ~VM_RAND_READ & ~VM_SEQ_READ;
break;
case MADV_SEQUENTIAL:
new_flags |= VM_SEQ_READ;
new_flags = (new_flags & ~VM_RAND_READ) | VM_SEQ_READ;
break;
case MADV_RANDOM:
new_flags |= VM_RAND_READ;
new_flags = (new_flags & ~VM_SEQ_READ) | VM_RAND_READ;
break;
default:
case MADV_DONTFORK:
new_flags |= VM_DONTCOPY;
break;
case MADV_DOFORK:
new_flags &= ~VM_DONTCOPY;
break;
}
@ -177,6 +184,12 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
long error;
switch (behavior) {
case MADV_DOFORK:
if (vma->vm_flags & VM_IO) {
error = -EINVAL;
break;
}
case MADV_DONTFORK:
case MADV_NORMAL:
case MADV_SEQUENTIAL:
case MADV_RANDOM: