linux-kernel-test/fs
Dmitry Monakhov 28a535f9a0 ext4: completed_io locking cleanup
Current unwritten extent conversion state-machine is very fuzzy.
- For unknown reason it performs conversion under i_mutex. What for?
  My diagnosis:
  We already protect extent tree with i_data_sem, truncate and punch_hole
  should wait for DIO, so the only data we have to protect is end_io->flags
  modification, but only flush_completed_IO and end_io_work modified this
  flags and we can serialize them via i_completed_io_lock.

  Currently all these games with mutex_trylock result in the following deadlock
   truncate:                          kworker:
    ext4_setattr                       ext4_end_io_work
    mutex_lock(i_mutex)
    inode_dio_wait(inode)  ->BLOCK
                             DEADLOCK<- mutex_trylock()
                                        inode_dio_done()
  #TEST_CASE1_BEGIN
  MNT=/mnt_scrach
  unlink $MNT/file
  fallocate -l $((1024*1024*1024)) $MNT/file
  aio-stress -I 100000 -O -s 100m -n -t 1 -c 10 -o 2 -o 3 $MNT/file
  sleep 2
  truncate -s 0 $MNT/file
  #TEST_CASE1_END

Or use 286's xfstests https://github.com/dmonakhov/xfstests/blob/devel/286

This patch makes state machine simple and clean:

(1) xxx_end_io schedule final extent conversion simply by calling
    ext4_add_complete_io(), which append it to ei->i_completed_io_list
    NOTE1: because of (2A) work should be queued only if
    ->i_completed_io_list was empty, otherwise the work is scheduled already.

(2) ext4_flush_completed_IO is responsible for handling all pending
    end_io from ei->i_completed_io_list
    Flushing sequence consists of following stages:
    A) LOCKED: Atomically drain completed_io_list to local_list
    B) Perform extents conversion
    C) LOCKED: move converted io's to to_free list for final deletion
       	     This logic depends on context which we was called from.
    D) Final end_io context destruction
    NOTE1: i_mutex is no longer required because end_io->flags modification
    is protected by ei->ext4_complete_io_lock

Full list of changes:
- Move all completion end_io related routines to page-io.c in order to improve
  logic locality
- Move open coded logic from various xx_end_xx routines to ext4_add_complete_io()
- remove EXT4_IO_END_FSYNC
- Improve SMP scalability by removing useless i_mutex which does not
  protect io->flags anymore.
- Reduce lock contention on i_completed_io_lock by optimizing list walk.
- Rename ext4_end_io_nolock to end4_end_io and make it static
- Check flush completion status to ext4_ext_punch_hole(). Because it is
  not good idea to punch blocks from corrupted inode.

Changes since V3 (in request to Jan's comments):
  Fall back to active flush_completed_IO() approach in order to prevent
  performance issues with nolocked DIO reads.
Changes since V2:
  Fix use-after-free caused by race truncate vs end_io_work

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-09-29 00:14:55 -04:00
..
9p 9p: Push file_update_time() into v9fs_vm_page_mkwrite() 2012-07-31 01:02:46 +04:00
adfs stop passing nameidata to ->lookup() 2012-07-14 16:34:32 +04:00
affs affs: use memweight() 2012-07-30 17:25:16 -07:00
afs VFS: Pass mount flags to sget() 2012-07-14 16:38:34 +04:00
autofs4 switch dentry_open() to struct path, make it grab references itself 2012-07-23 00:01:29 +04:00
befs stop passing nameidata to ->lookup() 2012-07-14 16:34:32 +04:00
bfs don't pass nameidata to ->create() 2012-07-14 16:34:47 +04:00
btrfs Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2012-08-01 10:26:23 -07:00
cachefiles fs: cachefiles: add support for large files in filesystem caching 2012-07-30 17:25:21 -07:00
ceph ceph: simplify+fix atomic_open 2012-08-02 09:11:19 -07:00
cifs CIFS: Add SMB2 support for rmdir 2012-07-27 15:17:50 -05:00
coda don't pass nameidata to ->create() 2012-07-14 16:34:47 +04:00
configfs stop passing nameidata to ->lookup() 2012-07-14 16:34:32 +04:00
cramfs stop passing nameidata to ->lookup() 2012-07-14 16:34:32 +04:00
debugfs Driver core merge for 3.6-rc1 2012-07-26 11:25:33 -07:00
devpts VFS: Pass mount flags to sget() 2012-07-14 16:38:34 +04:00
dlm dlm: fix missing dir remove 2012-07-16 14:24:43 -05:00
ecryptfs - Fixes a bug when the lower filesystem mount options include 'acl', but the 2012-08-02 10:56:34 -07:00
efs stop passing nameidata to ->lookup() 2012-07-14 16:34:32 +04:00
exofs Merge branch 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2012-07-23 12:27:27 -07:00
exportfs switch dentry_open() to struct path, make it grab references itself 2012-07-23 00:01:29 +04:00
ext2 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2012-08-01 10:26:23 -07:00
ext3 ext3: use memweight() 2012-07-30 17:25:16 -07:00
ext4 ext4: completed_io locking cleanup 2012-09-29 00:14:55 -04:00
fat Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2012-08-01 10:26:23 -07:00
freevxfs stop passing nameidata to ->lookup() 2012-07-14 16:34:32 +04:00
fscache
fuse fuse: Convert to new freezing mechanism 2012-07-31 09:45:50 +04:00
gfs2 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2012-08-01 10:26:23 -07:00
hfs hfs: get rid of hfs_sync_super 2012-07-22 23:58:09 +04:00
hfsplus hfsplus: use -ENOMEM when kzalloc() fails 2012-07-30 17:25:19 -07:00
hostfs don't pass nameidata to ->create() 2012-07-14 16:34:47 +04:00
hpfs don't pass nameidata to ->create() 2012-07-14 16:34:47 +04:00
hppfs switch dentry_open() to struct path, make it grab references itself 2012-07-23 00:01:29 +04:00
hugetlbfs hugetlb: use mmu_gather instead of a temporary linked list for accumulating pages 2012-07-31 18:42:40 -07:00
isofs Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs 2012-07-24 17:40:44 -07:00
jbd jbd: Check return value of blkdev_issue_flush() 2012-07-09 23:38:36 +02:00
jbd2 jbd2: fix assertion failure in commit code due to lacking transaction credits 2012-09-26 23:11:13 -04:00
jffs2 don't expose I_NEW inodes via dentry->d_inode 2012-07-23 00:00:58 +04:00
jfs don't expose I_NEW inodes via dentry->d_inode 2012-07-23 00:00:58 +04:00
lockd Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2012-08-01 10:26:23 -07:00
logfs VFS: Pass mount flags to sget() 2012-07-14 16:38:34 +04:00
minix minixfs: fix block limit check 2012-07-30 17:25:19 -07:00
ncpfs don't pass nameidata to ->create() 2012-07-14 16:34:47 +04:00
nfs Merge branch 'akpm' (Andrew's patch-bomb) 2012-07-31 19:25:39 -07:00
nfs_common
nfsd Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2012-08-01 10:26:23 -07:00
nilfs2 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2012-08-01 10:26:23 -07:00
nls nls: fix (and rename) mac NLS table files and config options 2012-06-01 19:51:22 -07:00
notify switch dentry_open() to struct path, make it grab references itself 2012-07-23 00:01:29 +04:00
ntfs Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2012-08-01 10:26:23 -07:00
ocfs2 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2012-08-01 10:26:23 -07:00
omfs don't pass nameidata to ->create() 2012-07-14 16:34:47 +04:00
openpromfs stop passing nameidata to ->lookup() 2012-07-14 16:34:32 +04:00
proc proc: do not allow negative offsets on /proc/<pid>/environ 2012-07-30 17:25:20 -07:00
pstore pstore/ram: Make tracing log versioned 2012-07-17 16:48:09 -07:00
qnx4 qnx4fs: use memweight() 2012-07-30 17:25:16 -07:00
qnx6 stop passing nameidata to ->lookup() 2012-07-14 16:34:32 +04:00
quota Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs 2012-07-24 17:40:44 -07:00
ramfs don't pass nameidata to ->create() 2012-07-14 16:34:47 +04:00
reiserfs don't expose I_NEW inodes via dentry->d_inode 2012-07-23 00:00:58 +04:00
romfs stop passing nameidata to ->lookup() 2012-07-14 16:34:32 +04:00
squashfs stop passing nameidata to ->lookup() 2012-07-14 16:34:32 +04:00
sysfs Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2012-08-01 10:26:23 -07:00
sysv fs/sysv: stop using write_super and s_dirt 2012-07-22 23:58:12 +04:00
ubifs * Added another debugfs knob for forcing UBIFS R/O mode without flushing caches 2012-07-23 15:50:52 -07:00
udf Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs 2012-07-24 17:40:44 -07:00
ufs fs/ufs: get rid of write_super 2012-07-22 23:58:16 +04:00
xfs Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2012-08-01 10:26:23 -07:00
aio.c aio: now fput() is OK from interrupt context; get rid of manual delayed __fput() 2012-07-22 23:57:59 +04:00
anon_inodes.c
attr.c notify_change(): check that i_mutex is held 2012-07-14 16:35:42 +04:00
bad_inode.c don't pass nameidata to ->create() 2012-07-14 16:34:47 +04:00
binfmt_aout.c VM: add "vm_mmap()" helper function 2012-04-20 17:29:13 -07:00
binfmt_elf_fdpic.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace 2012-05-23 17:42:39 -07:00
binfmt_elf.c binfmt_elf: switch elf_map() to vm_mmap/vm_munmap 2012-05-30 21:04:55 -04:00
binfmt_em86.c
binfmt_flat.c binfmt_flat: use vm_munmap, we are missing ->mmap_sem there 2012-05-30 21:04:56 -04:00
binfmt_misc.c vfs: Rename end_writeback() to clear_inode() 2012-05-06 13:43:41 +08:00
binfmt_script.c
binfmt_som.c VM: add "vm_mmap()" helper function 2012-04-20 17:29:13 -07:00
bio-integrity.c
bio.c Merge branch 'for-3.5/core' of git://git.kernel.dk/linux-block 2012-05-30 08:52:42 -07:00
block_dev.c vfs: Create function for iterating over block devices 2012-07-22 23:58:45 +04:00
buffer.c fs: Protect write paths by sb_start_write - sb_end_write 2012-07-31 09:45:47 +04:00
char_dev.c
compat_binfmt_elf.c
compat_ioctl.c
compat.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal 2012-06-01 11:53:44 -07:00
dcache.c __d_unalias() should refuse to move mountpoints 2012-07-14 16:35:15 +04:00
dcookies.c
direct-io.c fs/direct-io.c: adjust suspicious bit operation 2012-07-14 16:32:46 +04:00
drop_caches.c
eventfd.c eventfd: change int to __u64 in eventfd_signal() 2012-05-31 17:49:32 -07:00
eventpoll.c PM: Rename CAP_EPOLLWAKEUP to CAP_BLOCK_SUSPEND 2012-07-17 21:37:27 +02:00
exec.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2012-08-01 10:26:23 -07:00
fcntl.c c/r: fcntl: add F_GETOWNER_UIDS option 2012-07-30 17:25:21 -07:00
fhandle.c
fifo.c fifo: Do not restart open() if it already found a partner 2012-07-16 08:33:14 -07:00
file_table.c fs: Add freezing handling to mnt_want_write() / mnt_drop_write() 2012-07-31 09:40:38 +04:00
file.c
filesystems.c
fs_struct.c get rid of ->mnt_longterm 2012-07-14 16:32:47 +04:00
fs-writeback.c ext4: fix potential deadlock in ext4_nonda_switch() 2012-09-19 22:42:36 -04:00
generic_acl.c
inode.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2012-08-01 10:26:23 -07:00
internal.h fs: Add freezing handling to mnt_want_write() / mnt_drop_write() 2012-07-31 09:40:38 +04:00
ioctl.c
ioprio.c Merge branch 'for-3.5/core' of git://git.kernel.dk/linux-block 2012-05-30 08:52:42 -07:00
Kconfig
Kconfig.binfmt C6X: add support to build with BINFMT_ELF_FDPIC 2012-05-15 09:17:34 -04:00
libfs.c VFS: Pass mount flags to sget() 2012-07-14 16:38:34 +04:00
locks.c locks: remove unused lm_release_private 2012-08-01 09:01:46 -07:00
Makefile
mbcache.c
mount.h get rid of magic in proc_namespace.c 2012-07-14 16:32:48 +04:00
mpage.c
namei.c fs: Push mnt_want_write() outside of i_mutex 2012-07-31 01:02:49 +04:00
namespace.c fs: Add freezing handling to mnt_want_write() / mnt_drop_write() 2012-07-31 09:40:38 +04:00
no-block.c
open.c fs: Protect write paths by sb_start_write - sb_end_write 2012-07-31 09:45:47 +04:00
pipe.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2012-08-01 10:26:23 -07:00
pnode.c VFS: Make clone_mnt()/copy_tree()/collect_mounts() return errors 2012-07-14 16:37:27 +04:00
pnode.h
posix_acl.c
proc_namespace.c get rid of magic in proc_namespace.c 2012-07-14 16:32:48 +04:00
read_write.c vfs: allow custom EOF in generic_file_llseek code 2012-07-23 00:00:15 +04:00
read_write.h
readdir.c switch readdir/getdents to fget_light/fput_light 2012-05-29 23:28:29 -04:00
select.c posix_types.h: Cleanup stale __NFDBITS and related definitions 2012-07-26 13:36:43 -07:00
seq_file.c seq_file: Add seq_vprintf function and export it 2012-06-11 13:16:35 +01:00
signalfd.c switch signalfd4() to fget_light/fput_light 2012-05-29 23:28:30 -04:00
splice.c fs: Protect write paths by sb_start_write - sb_end_write 2012-07-31 09:45:47 +04:00
stack.c
stat.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace 2012-05-23 17:42:39 -07:00
statfs.c switch statfs to fget_light/fput_light 2012-05-29 23:28:31 -04:00
super.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2012-08-01 10:26:23 -07:00
sync.c vfs: Avoid unnecessary WB_SYNC_NONE writeback during sys_sync and reorder sync passes 2012-07-22 23:59:01 +04:00
timerfd.c
utimes.c switch utimes() to fget_light/fput_light 2012-05-29 23:28:32 -04:00
xattr_acl.c
xattr.c fs/xattr.c:getxattr(): improve handling of allocation failures 2012-07-30 17:25:11 -07:00