linux-kernel-test/fs
Jerome Marchand 09e099d4ba block: fix accounting bug on cross partition merges
/proc/diskstats would display a strange output as follows.

$ cat /proc/diskstats |grep sda
   8       0 sda 90524 7579 102154 20464 0 0 0 0 0 14096 20089
   8       1 sda1 19085 1352 21841 4209 0 0 0 0 4294967064 15689 4293424691
                                                ~~~~~~~~~~
   8       2 sda2 71252 3624 74891 15950 0 0 0 0 232 23995 1562390
   8       3 sda3 54 487 2188 92 0 0 0 0 0 88 92
   8       4 sda4 4 0 8 0 0 0 0 0 0 0 0
   8       5 sda5 81 2027 2130 138 0 0 0 0 0 87 137

Its reason is the wrong way of accounting hd_struct->in_flight. When a bio is
merged into a request belongs to different partition by ELEVATOR_FRONT_MERGE.

The detailed root cause is as follows.

Assuming that there are two partition, sda1 and sda2.

1. A request for sda2 is in request_queue. Hence sda1's hd_struct->in_flight
   is 0 and sda2's one is 1.

        | hd_struct->in_flight
   ---------------------------
   sda1 |          0
   sda2 |          1
   ---------------------------

2. A bio belongs to sda1 is issued and is merged into the request mentioned on
   step1 by ELEVATOR_BACK_MERGE. The first sector of the request is changed
   from sda2 region to sda1 region. However the two partition's
   hd_struct->in_flight are not changed.

        | hd_struct->in_flight
   ---------------------------
   sda1 |          0
   sda2 |          1
   ---------------------------

3. The request is finished and blk_account_io_done() is called. In this case,
   sda2's hd_struct->in_flight, not a sda1's one, is decremented.

        | hd_struct->in_flight
   ---------------------------
   sda1 |         -1
   sda2 |          1
   ---------------------------

The patch fixes the problem by caching the partition lookup
inside the request structure, hence making sure that the increment
and decrement will always happen on the same partition struct. This
also speeds up IO with accounting enabled, since it cuts down on
the number of lookups we have to do.

Also add a refcount to struct hd_struct to keep the partition in
memory as long as users exist. We use kref_test_and_get() to ensure
we don't add a reference to a partition which is going away.

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-01-05 16:57:38 +01:00
..
9p convert v9fs 2010-10-29 04:16:38 -04:00
adfs new helper: mount_bdev() 2010-10-29 04:16:13 -04:00
affs new helper: mount_bdev() 2010-10-29 04:16:13 -04:00
afs convert afs 2010-10-29 04:17:13 -04:00
autofs4 convert get_sb_nodev() users 2010-10-29 04:16:31 -04:00
befs new helper: mount_bdev() 2010-10-29 04:16:13 -04:00
bfs new helper: mount_bdev() 2010-10-29 04:16:13 -04:00
btrfs block: clean up blkdev_get() wrappers and their users 2010-11-13 11:55:18 +01:00
cachefiles llseek: automatically add .llseek fop 2010-10-15 15:53:27 +02:00
ceph convert ceph 2010-10-29 04:17:18 -04:00
cifs cifs: fix a memleak in cifs_setattr_nounix() 2010-11-09 15:17:53 +00:00
coda convert get_sb_nodev() users 2010-10-29 04:16:31 -04:00
configfs convert get_sb_single() users 2010-10-29 04:16:28 -04:00
cramfs new helper: mount_bdev() 2010-10-29 04:16:13 -04:00
debugfs convert get_sb_single() users 2010-10-29 04:16:28 -04:00
devpts convert get_sb_single() users 2010-10-29 04:16:28 -04:00
dlm Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/dlm 2010-10-22 17:33:16 -07:00
ecryptfs Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ecryptfs/ecryptfs-2.6 2010-10-29 14:15:12 -07:00
efs new helper: mount_bdev() 2010-10-29 04:16:13 -04:00
exofs convert get_sb_nodev() users 2010-10-29 04:16:31 -04:00
exportfs exportfs: use dget_parent 2010-10-25 21:26:13 -04:00
ext2 new helper: mount_bdev() 2010-10-29 04:16:13 -04:00
ext3 block: clean up blkdev_get() wrappers and their users 2010-11-13 11:55:18 +01:00
ext4 block: clean up blkdev_get() wrappers and their users 2010-11-13 11:55:18 +01:00
fat new helper: mount_bdev() 2010-10-29 04:16:13 -04:00
freevxfs new helper: mount_bdev() 2010-10-29 04:16:13 -04:00
fscache
fuse convert get_sb_nodev() users 2010-10-29 04:16:31 -04:00
gfs2 Merge branch 'cleanup-bd_claim' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc into for-2.6.38/core 2010-11-27 19:49:18 +01:00
hfs new helper: mount_bdev() 2010-10-29 04:16:13 -04:00
hfsplus new helper: mount_bdev() 2010-10-29 04:16:13 -04:00
hostfs convert get_sb_nodev() users 2010-10-29 04:16:31 -04:00
hpfs Merge branches 'irq-core-for-linus' and 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2010-10-31 20:40:24 -04:00
hppfs convert get_sb_nodev() users 2010-10-29 04:16:31 -04:00
hugetlbfs hugetlbfs: lessen the impact of a deprecation warning 2010-11-12 07:55:32 -08:00
isofs new helper: mount_bdev() 2010-10-29 04:16:13 -04:00
jbd Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6 2010-10-27 20:13:18 -07:00
jbd2 jbd2: Convert jbd2_slab_create_sem to mutex 2010-10-30 12:12:50 +02:00
jffs2 Merge git://git.infradead.org/mtd-2.6 2010-10-30 08:31:35 -07:00
jfs block: clean up blkdev_get() wrappers and their users 2010-11-13 11:55:18 +01:00
lockd lockd: fix nlmsvc_notify_blocked locking 2010-10-27 21:39:50 +02:00
logfs block: clean up blkdev_get() wrappers and their users 2010-11-13 11:55:18 +01:00
minix new helper: mount_bdev() 2010-10-29 04:16:13 -04:00
ncpfs convert get_sb_nodev() users 2010-10-29 04:16:31 -04:00
nfs locks: let the caller free file_lock on ->setlease failure 2010-10-31 06:35:15 -07:00
nfs_common
nfsd Fix compile warnings due to missing removal of a 'ret' variable 2010-12-20 09:15:19 +01:00
nilfs2 block: clean up blkdev_get() wrappers and their users 2010-11-13 11:55:18 +01:00
nls
notify make fanotify_read() restartable across signals 2010-10-30 14:07:35 -04:00
ntfs new helper: mount_bdev() 2010-10-29 04:16:13 -04:00
ocfs2 Merge branch 'cleanup-bd_claim' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc into for-2.6.38/core 2010-11-27 19:49:18 +01:00
omfs new helper: mount_bdev() 2010-10-29 04:16:13 -04:00
openpromfs sparc: fix openpromfs compile 2010-11-08 14:29:39 -08:00
partitions block: fix accounting bug on cross partition merges 2011-01-05 16:57:38 +01:00
proc switch procfs to ->mount() 2010-10-29 04:17:01 -04:00
qnx4 new helper: mount_bdev() 2010-10-29 04:16:13 -04:00
quota quota: Fix possible oops in __dquot_initialize() 2010-10-28 01:30:06 +02:00
ramfs convert get_sb_nodev() users 2010-10-29 04:16:31 -04:00
reiserfs block: clean up blkdev_get() wrappers and their users 2010-11-13 11:55:18 +01:00
romfs convert get_sb_mtd() users to ->mount() 2010-10-29 04:16:26 -04:00
squashfs Merge git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-linus 2010-10-29 08:48:58 -07:00
sysfs convert sysfs 2010-10-29 04:17:08 -04:00
sysv new helper: mount_bdev() 2010-10-29 04:16:13 -04:00
ubifs convert ubifs 2010-10-29 04:16:36 -04:00
udf new helper: mount_bdev() 2010-10-29 04:16:13 -04:00
ufs new helper: mount_bdev() 2010-10-29 04:16:13 -04:00
xfs Merge branch 'cleanup-bd_claim' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc into for-2.6.38/core 2010-11-27 19:49:18 +01:00
aio.c new helper: ihold() 2010-10-25 21:26:11 -04:00
anon_inodes.c convert get_sb_pseudo() users 2010-10-29 04:16:33 -04:00
attr.c
bad_inode.c
binfmt_aout.c Don't dump task struct in a.out core-dumps 2010-10-14 10:57:40 -07:00
binfmt_elf_fdpic.c
binfmt_elf.c ARM: 6342/1: fix ASLR of PIE executables 2010-10-08 10:02:53 +01:00
binfmt_em86.c
binfmt_flat.c
binfmt_misc.c convert get_sb_single() users 2010-10-29 04:16:28 -04:00
binfmt_script.c
binfmt_som.c
bio-integrity.c bio-integrity: mark kintegrityd_wq highpri and CPU intensive 2011-01-03 15:01:48 +01:00
bio.c bio: take care not overflow page count when mapping/copying user data 2010-11-10 14:40:43 +01:00
block_dev.c block: clean up blkdev_get() wrappers and their users 2010-11-13 11:55:18 +01:00
buffer.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 2010-10-26 17:58:44 -07:00
char_dev.c fs/block: type signature of major_to_index(int) to major_to_index(unsigned) 2010-12-17 09:00:18 +01:00
compat_binfmt_elf.c
compat_ioctl.c Merge 'staging-next' to Linus's tree 2010-10-28 09:44:56 -07:00
compat.c fs/compat.c: fix build on MIPS/s390 2010-10-30 08:19:35 -07:00
dcache.c fs: use RCU read side protection in d_validate 2010-10-25 21:26:13 -04:00
dcookies.c
direct-io.c fs/direct-io.c: fix truncation error in dio_complete() return 2010-10-26 16:52:13 -07:00
drop_caches.c
eventfd.c llseek: automatically add .llseek fop 2010-10-15 15:53:27 +02:00
eventpoll.c epoll: make epoll_wait() use the hrtimer range feature 2010-10-27 18:03:18 -07:00
exec.c exec: don't turn PF_KTHREAD off when a target command was not found 2010-10-27 18:03:13 -07:00
fcntl.c fasync: Fix placement of FASYNC flag comment 2010-10-27 18:17:02 -07:00
fifo.c llseek: automatically add .llseek fop 2010-10-15 15:53:27 +02:00
file_table.c fs: allow for more than 2^31 files 2010-10-26 16:52:15 -07:00
file.c
filesystems.c
fs_struct.c
fs-writeback.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable 2010-10-30 09:05:48 -07:00
generic_acl.c
inode.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 2010-10-26 17:58:44 -07:00
internal.h braino in internal.h 2010-10-29 05:49:13 -04:00
ioctl.c fs: Add FITRIM ioctl 2010-10-27 21:30:11 -04:00
ioprio.c ioprio: rcu_read_lock/unlock protect find_task_by_vpid call (V2) 2010-11-10 14:40:53 +01:00
Kconfig Merge 'staging-next' to Linus's tree 2010-10-28 09:44:56 -07:00
Kconfig.binfmt coredump: default CONFIG_CORE_DUMP_DEFAULT_ELF_HEADERS=y 2010-10-27 18:03:12 -07:00
libfs.c convert get_sb_pseudo() users 2010-10-29 04:16:33 -04:00
locks.c locks: remove dead lease error-handling code 2010-11-10 14:31:29 -05:00
Makefile Merge 'staging-next' to Linus's tree 2010-10-28 09:44:56 -07:00
mbcache.c
mpage.c
namei.c fix open/umount race 2010-10-29 04:14:56 -04:00
namespace.c vfs: fix infinite loop caused by clone_mnt race 2010-10-25 21:24:16 -04:00
nfsctl.c
no-block.c llseek: automatically add .llseek fop 2010-10-15 15:53:27 +02:00
open.c fix open/umount race 2010-10-29 04:14:56 -04:00
pipe.c convert get_sb_pseudo() users 2010-10-29 04:16:33 -04:00
pnode.c
pnode.h
posix_acl.c
read_write.c readv/writev: do the same MAX_RW_COUNT truncation that read/write does 2010-10-29 10:36:49 -07:00
read_write.h
readdir.c
select.c epoll: make epoll_wait() use the hrtimer range feature 2010-10-27 18:03:18 -07:00
seq_file.c fs: take dcache_lock inside __d_path 2010-10-25 21:26:12 -04:00
signalfd.c Merge branch 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6 2010-10-26 10:13:10 -07:00
splice.c fs/splice: Pull buf->ops->confirm() from splice_from_pipe actors 2010-12-17 08:56:44 +01:00
stack.c
stat.c
statfs.c
super.c block: clean up blkdev_get() wrappers and their users 2010-11-13 11:55:18 +01:00
sync.c
timerfd.c llseek: automatically add .llseek fop 2010-10-15 15:53:27 +02:00
utimes.c
xattr_acl.c
xattr.c