Commit Graph

44551 Commits

Author SHA1 Message Date
David Sterba
58d7bbf81f btrfs: ioctl: reorder exclusive op check in RM_DEV
Move the op exclusivity check before the other code (same as in
ADD_DEV).

Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-06 15:22:49 +02:00
David Sterba
7ab19625a9 btrfs: add write protection to SET_FEATURES ioctl
Perform the want_write check if we get far enough to do any writes.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-06 15:22:49 +02:00
Anand Jain
48b3b9d401 btrfs: fix lock dep warning move scratch super outside of chunk_mutex
Move scratch super outside of the chunk lock to avoid below
lockdep warning. The better place to scratch super is in
the function btrfs_rm_dev_replace_free_srcdev() just before
free_device, which is outside of the chunk lock as well.

To reproduce:
  (fresh boot)
  mkfs.btrfs -f -draid5 -mraid5 /dev/sdc /dev/sdd /dev/sde
  mount /dev/sdc /btrfs
  dd if=/dev/zero of=/btrfs/tf1 bs=4096 count=100
  (get devmgt from https://github.com/asj/devmgt.git)
  devmgt detach /dev/sde
  dd if=/dev/zero of=/btrfs/tf1 bs=4096 count=100
  sync
  btrfs replace start -Brf 3 /dev/sdf /btrfs <--
  devmgt attach host7

======================================================
[ INFO: possible circular locking dependency detected ]
4.6.0-rc2asj+ #1 Not tainted
---------------------------------------------------

btrfs/2174 is trying to acquire lock:
(sb_writers){.+.+.+}, at:
[<ffffffff812449b4>] __sb_start_write+0xb4/0xf0

but task is already holding lock:
(&fs_info->chunk_mutex){+.+.+.}, at:
[<ffffffffa05c5f55>] btrfs_dev_replace_finishing+0x145/0x980 [btrfs]

which lock already depends on the new lock.

Chain exists of:
sb_writers --> &fs_devs->device_list_mutex --> &fs_info->chunk_mutex
Possible unsafe locking scenario:
CPU0				CPU1
----				----
lock(&fs_info->chunk_mutex);
				lock(&fs_devs->device_list_mutex);
				lock(&fs_info->chunk_mutex);
lock(sb_writers);

*** DEADLOCK ***

-> #0 (sb_writers){.+.+.+}:
[<ffffffff810e6415>] __lock_acquire+0x1bc5/0x1ee0
[<ffffffff810e707e>] lock_acquire+0xbe/0x210
[<ffffffff810df49a>] percpu_down_read+0x4a/0xa0
[<ffffffff812449b4>] __sb_start_write+0xb4/0xf0
[<ffffffff81265534>] mnt_want_write+0x24/0x50
[<ffffffff812508a2>] path_openat+0x952/0x1190
[<ffffffff81252451>] do_filp_open+0x91/0x100
[<ffffffff8123f5cc>] file_open_name+0xfc/0x140
[<ffffffff8123f643>] filp_open+0x33/0x60
[<ffffffffa0572bb6>] update_dev_time+0x16/0x40 [btrfs]
[<ffffffffa057f60d>] btrfs_scratch_superblocks+0x5d/0xb0 [btrfs]
[<ffffffffa057f70e>] btrfs_rm_dev_replace_remove_srcdev+0xae/0xd0 [btrfs]
[<ffffffffa05c62c5>] btrfs_dev_replace_finishing+0x4b5/0x980 [btrfs]
[<ffffffffa05c6ae8>] btrfs_dev_replace_start+0x358/0x530 [btrfs]

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-06 15:22:49 +02:00
Ashish Samant
2473114981 btrfs: Fix BUG_ON condition in scrub_setup_recheck_block()
pagev array in scrub_block{} is of size SCRUB_MAX_PAGES_PER_BLOCK.
page_index should be checked with the same to trigger BUG_ON().

Signed-off-by: Ashish Samant <ashish.samant@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-06 15:22:49 +02:00
Josef Bacik
e042d1ec44 Btrfs: remove BUG_ON()'s in btrfs_map_block
btrfs_map_block can go horribly wrong in the face of fs corruption, lets agree
to not be assholes and panic at any possible chance things are all fucked up.

Signed-off-by: Josef Bacik <jbacik@fb.com>
[ removed type casts ]
Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-06 15:22:49 +02:00
Liu Bo
3d8da67817 Btrfs: fix divide error upon chunk's stripe_len
The struct 'map_lookup' uses type int for @stripe_len, while
btrfs_chunk_stripe_len() can return a u64 value, and it may end up with
@stripe_len being undefined value and it can lead to 'divide error' in
 __btrfs_map_block().

This changes 'map_lookup' to use type u64 for stripe_len, also right now
we only use BTRFS_STRIPE_LEN for stripe_len, so this adds a valid checker for
BTRFS_STRIPE_LEN.

Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
Reported-by: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ folded division fix to scrub_raid56_parity ]
Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-06 15:22:49 +02:00
David Sterba
ee17fc8005 btrfs: sysfs: protect reading label by lock
If the label setting ioctl races with sysfs label handler, we could get
mixed result in the output, part old part new. We should either get the
old or new label. The chances to hit this race are low.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-06 15:22:49 +02:00
David Sterba
66ac9fe7ba btrfs: add check to sysfs handler of label
Add a sanity check for the fs_info as we will dereference it, similar to
what the 'store features' handler does.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-06 15:22:49 +02:00
David Sterba
ee6111386a btrfs: add read-only check to sysfs handler of features
We don't want to trigger the change on a read-only filesystem, similar
to what the label handler does.

Signed-off-by: David Sterba <dsterba@suse.cz>
2016-05-06 15:22:49 +02:00
David Sterba
e6c11f9a46 btrfs: reuse existing variable in scrub_stripe, reduce stack usage
The key variable occupies 17 bytes, the key_start is used once, we can
simply reuse existing 'key' for that purpose. As the key is not a simple
type, compiler doest not do it on itself.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-06 15:22:49 +02:00
David Sterba
49a3c4d9b6 btrfs: use dynamic allocation for root item in create_subvol
The size of root item is more than 400 bytes, which is quite a lot of
stack space. As we do IO from inside the subvolume ioctls, we should
keep the stack usage low in case the filesystem is on top of other
layers (NFS, device mapper, iscsi, etc).

Reviewed-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-06 15:22:49 +02:00
David Sterba
153519559a btrfs: clone: use vmalloc only as fallback for nodesize bufer
Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-06 15:22:49 +02:00
David Sterba
2f91306a37 btrfs: send: use vmalloc only as fallback for clone_sources_tmp
Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-06 15:22:49 +02:00
David Sterba
c03d01f340 btrfs: send: use vmalloc only as fallback for clone_roots
Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-06 15:22:49 +02:00
David Sterba
e55d1153db btrfs: send: use temporary variable to store allocation size
We're going to use the argument multiple times later.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-06 15:22:49 +02:00
David Sterba
eb5b75fe2e btrfs: send: use vmalloc only as fallback for read_buf
Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-06 15:22:49 +02:00
David Sterba
6ff48ce06b btrfs: send: use vmalloc only as fallback for send_buf
Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-06 15:22:49 +02:00
Anand Jain
779bf3fefa btrfs: fix lock dep warning, move scratch dev out of device_list_mutex and uuid_mutex
When the replace target fails, the target device will be taken
out of fs device list, scratch + update_dev_time and freed. However
we could do the scratch  + update_dev_time and free part after the
device has been taken out of device list, so that we don't have to
hold the device_list_mutex and uuid_mutex locks.

Reported issue:

[ 5375.718845] ======================================================
[ 5375.718846] [ INFO: possible circular locking dependency detected ]
[ 5375.718849] 4.4.5-scst31x-debug-11+ #40 Not tainted
[ 5375.718849] -------------------------------------------------------
[ 5375.718851] btrfs-health/4662 is trying to acquire lock:
[ 5375.718861]  (sb_writers){.+.+.+}, at: [<ffffffff812214f7>] __sb_start_write+0xb7/0xf0
[ 5375.718862]
[ 5375.718862] but task is already holding lock:
[ 5375.718907]  (&fs_devs->device_list_mutex){+.+.+.}, at: [<ffffffffa028263c>] btrfs_destroy_dev_replace_tgtdev+0x3c/0x150 [btrfs]
[ 5375.718907]
[ 5375.718907] which lock already depends on the new lock.
[ 5375.718907]
[ 5375.718908]
[ 5375.718908] the existing dependency chain (in reverse order) is:
[ 5375.718911]
[ 5375.718911] -> #3 (&fs_devs->device_list_mutex){+.+.+.}:
[ 5375.718917]        [<ffffffff810da4be>] lock_acquire+0xce/0x1e0
[ 5375.718921]        [<ffffffff81633949>] mutex_lock_nested+0x69/0x3c0
[ 5375.718940]        [<ffffffffa0219bf6>] btrfs_show_devname+0x36/0x210 [btrfs]
[ 5375.718945]        [<ffffffff81267079>] show_vfsmnt+0x49/0x150
[ 5375.718948]        [<ffffffff81240b07>] m_show+0x17/0x20
[ 5375.718951]        [<ffffffff81246868>] seq_read+0x2d8/0x3b0
[ 5375.718955]        [<ffffffff8121df28>] __vfs_read+0x28/0xd0
[ 5375.718959]        [<ffffffff8121e806>] vfs_read+0x86/0x130
[ 5375.718962]        [<ffffffff8121f4c9>] SyS_read+0x49/0xa0
[ 5375.718966]        [<ffffffff81637976>] entry_SYSCALL_64_fastpath+0x16/0x7a
[ 5375.718968]
[ 5375.718968] -> #2 (namespace_sem){+++++.}:
[ 5375.718971]        [<ffffffff810da4be>] lock_acquire+0xce/0x1e0
[ 5375.718974]        [<ffffffff81635199>] down_write+0x49/0x80
[ 5375.718977]        [<ffffffff81243593>] lock_mount+0x43/0x1c0
[ 5375.718979]        [<ffffffff81243c13>] do_add_mount+0x23/0xd0
[ 5375.718982]        [<ffffffff81244afb>] do_mount+0x27b/0xe30
[ 5375.718985]        [<ffffffff812459dc>] SyS_mount+0x8c/0xd0
[ 5375.718988]        [<ffffffff81637976>] entry_SYSCALL_64_fastpath+0x16/0x7a
[ 5375.718991]
[ 5375.718991] -> #1 (&sb->s_type->i_mutex_key#5){+.+.+.}:
[ 5375.718994]        [<ffffffff810da4be>] lock_acquire+0xce/0x1e0
[ 5375.718996]        [<ffffffff81633949>] mutex_lock_nested+0x69/0x3c0
[ 5375.719001]        [<ffffffff8122d608>] path_openat+0x468/0x1360
[ 5375.719004]        [<ffffffff8122f86e>] do_filp_open+0x7e/0xe0
[ 5375.719007]        [<ffffffff8121da7b>] do_sys_open+0x12b/0x210
[ 5375.719010]        [<ffffffff8121db7e>] SyS_open+0x1e/0x20
[ 5375.719013]        [<ffffffff81637976>] entry_SYSCALL_64_fastpath+0x16/0x7a
[ 5375.719015]
[ 5375.719015] -> #0 (sb_writers){.+.+.+}:
[ 5375.719018]        [<ffffffff810d97ca>] __lock_acquire+0x17ba/0x1ae0
[ 5375.719021]        [<ffffffff810da4be>] lock_acquire+0xce/0x1e0
[ 5375.719026]        [<ffffffff810d3bef>] percpu_down_read+0x4f/0xa0
[ 5375.719028]        [<ffffffff812214f7>] __sb_start_write+0xb7/0xf0
[ 5375.719031]        [<ffffffff81242eb4>] mnt_want_write+0x24/0x50
[ 5375.719035]        [<ffffffff8122ded2>] path_openat+0xd32/0x1360
[ 5375.719037]        [<ffffffff8122f86e>] do_filp_open+0x7e/0xe0
[ 5375.719040]        [<ffffffff8121d8a4>] file_open_name+0xe4/0x130
[ 5375.719043]        [<ffffffff8121d923>] filp_open+0x33/0x60
[ 5375.719073]        [<ffffffffa02776a6>] update_dev_time+0x16/0x40 [btrfs]
[ 5375.719099]        [<ffffffffa02825be>] btrfs_scratch_superblocks+0x4e/0x90 [btrfs]
[ 5375.719123]        [<ffffffffa0282665>] btrfs_destroy_dev_replace_tgtdev+0x65/0x150 [btrfs]
[ 5375.719150]        [<ffffffffa02c6c80>] btrfs_dev_replace_finishing+0x6b0/0x990 [btrfs]
[ 5375.719175]        [<ffffffffa02c729e>] btrfs_dev_replace_start+0x33e/0x540 [btrfs]
[ 5375.719199]        [<ffffffffa02c7f58>] btrfs_auto_replace_start+0xf8/0x140 [btrfs]
[ 5375.719222]        [<ffffffffa02464e6>] health_kthread+0x246/0x490 [btrfs]
[ 5375.719225]        [<ffffffff810a70df>] kthread+0xef/0x110
[ 5375.719229]        [<ffffffff81637d2f>] ret_from_fork+0x3f/0x70
[ 5375.719230]
[ 5375.719230] other info that might help us debug this:
[ 5375.719230]
[ 5375.719233] Chain exists of:
[ 5375.719233]   sb_writers --> namespace_sem --> &fs_devs->device_list_mutex
[ 5375.719233]
[ 5375.719234]  Possible unsafe locking scenario:
[ 5375.719234]
[ 5375.719234]        CPU0                    CPU1
[ 5375.719235]        ----                    ----
[ 5375.719236]   lock(&fs_devs->device_list_mutex);
[ 5375.719238]                                lock(namespace_sem);
[ 5375.719239]                                lock(&fs_devs->device_list_mutex);
[ 5375.719241]   lock(sb_writers);
[ 5375.719241]
[ 5375.719241]  *** DEADLOCK ***
[ 5375.719241]
[ 5375.719243] 4 locks held by btrfs-health/4662:
[ 5375.719266]  #0:  (&fs_info->health_mutex){+.+.+.}, at: [<ffffffffa0246303>] health_kthread+0x63/0x490 [btrfs]
[ 5375.719293]  #1:  (&fs_info->dev_replace.lock_finishing_cancel_unmount){+.+.+.}, at: [<ffffffffa02c6611>] btrfs_dev_replace_finishing+0x41/0x990 [btrfs]
[ 5375.719319]  #2:  (uuid_mutex){+.+.+.}, at: [<ffffffffa0282620>] btrfs_destroy_dev_replace_tgtdev+0x20/0x150 [btrfs]
[ 5375.719343]  #3:  (&fs_devs->device_list_mutex){+.+.+.}, at: [<ffffffffa028263c>] btrfs_destroy_dev_replace_tgtdev+0x3c/0x150 [btrfs]
[ 5375.719343]
[ 5375.719343] stack backtrace:
[ 5375.719347] CPU: 2 PID: 4662 Comm: btrfs-health Not tainted 4.4.5-scst31x-debug-11+ #40
[ 5375.719348] Hardware name: Supermicro SYS-6018R-WTRT/X10DRW-iT, BIOS 1.0c 01/07/2015
[ 5375.719352]  0000000000000000 ffff880856f73880 ffffffff813529e3 ffffffff826182a0
[ 5375.719354]  ffffffff8260c090 ffff880856f738c0 ffffffff810d667c ffff880856f73930
[ 5375.719357]  ffff880861f32b40 ffff880861f32b68 0000000000000003 0000000000000004
[ 5375.719357] Call Trace:
[ 5375.719363]  [<ffffffff813529e3>] dump_stack+0x85/0xc2
[ 5375.719366]  [<ffffffff810d667c>] print_circular_bug+0x1ec/0x260
[ 5375.719369]  [<ffffffff810d97ca>] __lock_acquire+0x17ba/0x1ae0
[ 5375.719373]  [<ffffffff810f606d>] ? debug_lockdep_rcu_enabled+0x1d/0x20
[ 5375.719376]  [<ffffffff810da4be>] lock_acquire+0xce/0x1e0
[ 5375.719378]  [<ffffffff812214f7>] ? __sb_start_write+0xb7/0xf0
[ 5375.719383]  [<ffffffff810d3bef>] percpu_down_read+0x4f/0xa0
[ 5375.719385]  [<ffffffff812214f7>] ? __sb_start_write+0xb7/0xf0
[ 5375.719387]  [<ffffffff812214f7>] __sb_start_write+0xb7/0xf0
[ 5375.719389]  [<ffffffff81242eb4>] mnt_want_write+0x24/0x50
[ 5375.719393]  [<ffffffff8122ded2>] path_openat+0xd32/0x1360
[ 5375.719415]  [<ffffffffa02462a0>] ? btrfs_congested_fn+0x180/0x180 [btrfs]
[ 5375.719418]  [<ffffffff810f606d>] ? debug_lockdep_rcu_enabled+0x1d/0x20
[ 5375.719420]  [<ffffffff8122f86e>] do_filp_open+0x7e/0xe0
[ 5375.719423]  [<ffffffff810f615d>] ? rcu_read_lock_sched_held+0x6d/0x80
[ 5375.719426]  [<ffffffff81201a9b>] ? kmem_cache_alloc+0x26b/0x5d0
[ 5375.719430]  [<ffffffff8122e7d4>] ? getname_kernel+0x34/0x120
[ 5375.719433]  [<ffffffff8121d8a4>] file_open_name+0xe4/0x130
[ 5375.719436]  [<ffffffff8121d923>] filp_open+0x33/0x60
[ 5375.719462]  [<ffffffffa02776a6>] update_dev_time+0x16/0x40 [btrfs]
[ 5375.719485]  [<ffffffffa02825be>] btrfs_scratch_superblocks+0x4e/0x90 [btrfs]
[ 5375.719506]  [<ffffffffa0282665>] btrfs_destroy_dev_replace_tgtdev+0x65/0x150 [btrfs]
[ 5375.719530]  [<ffffffffa02c6c80>] btrfs_dev_replace_finishing+0x6b0/0x990 [btrfs]
[ 5375.719554]  [<ffffffffa02c6b23>] ? btrfs_dev_replace_finishing+0x553/0x990 [btrfs]
[ 5375.719576]  [<ffffffffa02c729e>] btrfs_dev_replace_start+0x33e/0x540 [btrfs]
[ 5375.719598]  [<ffffffffa02c7f58>] btrfs_auto_replace_start+0xf8/0x140 [btrfs]
[ 5375.719621]  [<ffffffffa02464e6>] health_kthread+0x246/0x490 [btrfs]
[ 5375.719641]  [<ffffffffa02463d8>] ? health_kthread+0x138/0x490 [btrfs]
[ 5375.719661]  [<ffffffffa02462a0>] ? btrfs_congested_fn+0x180/0x180 [btrfs]
[ 5375.719663]  [<ffffffff810a70df>] kthread+0xef/0x110
[ 5375.719666]  [<ffffffff810a6ff0>] ? kthread_create_on_node+0x200/0x200
[ 5375.719669]  [<ffffffff81637d2f>] ret_from_fork+0x3f/0x70
[ 5375.719672]  [<ffffffff810a6ff0>] ? kthread_create_on_node+0x200/0x200
[ 5375.719697] ------------[ cut here ]------------

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reported-by: Yauhen Kharuzhy <yauhen.kharuzhy@zavadatar.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-06 15:22:49 +02:00
Dan Carpenter
f5ecec3ce2 btrfs: send: silence an integer overflow warning
The "sizeof(*arg->clone_sources) * arg->clone_sources_count" expression
can overflow.  It causes several static checker warnings.  It's all
under CAP_SYS_ADMIN so it's not that serious but lets silence the
warnings.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-06 15:22:49 +02:00
Luis de Bethencourt
41b34accb2 btrfs: avoid overflowing f_bfree
Since mixed block groups accounting isn't byte-accurate and f_bree is an
unsigned integer, it could overflow. Avoid this.

Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com>
Suggested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-06 15:22:49 +02:00
Luis de Bethencourt
ae02d1bd07 btrfs: fix mixed block count of available space
Metadata for mixed block is already accounted in total data and should not
be counted as part of the free metadata space.

Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=114281
Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-06 15:22:49 +02:00
Austin S. Hemmelgarn
88be159c90 btrfs: allow balancing to dup with multi-device
Currently, we don't allow the user to try and rebalance to a dup profile
on a multi-device filesystem.  In most cases, this is a perfectly sensible
restriction as raid1 uses the same amount of space and provides better
protection.

However, when reshaping a multi-device filesystem down to a single device
filesystem, this requires the user to convert metadata and system chunks
to single profile before deleting devices, and then convert again to dup,
which leaves a period of time where metadata integrity is reduced.

This patch removes the single-device-only restriction from converting to
dup profile to remove this potential data integrity reduction.

Signed-off-by: Austin S. Hemmelgarn <ahferroin7@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-06 15:22:49 +02:00
David Sterba
2355ac8495 btrfs: ioctl: reorder exclusive op check in RM_DEV
Move the op exclusivity check before the other code (same as in
ADD_DEV).

Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-06 14:58:00 +02:00
David Sterba
58409edd2d btrfs: kill unused writepage_io_hook callback
It seems to be long time unused, since 2008 and
6885f308b5 ("Btrfs: Misc 2.6.25 updates").

Propagating the removal touches some code but has no functional effect.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-06 14:57:57 +02:00
Mathias Krause
8148a73c99 proc: prevent accessing /proc/<PID>/environ until it's ready
If /proc/<PID>/environ gets read before the envp[] array is fully set up
in create_{aout,elf,elf_fdpic,flat}_tables(), we might end up trying to
read more bytes than are actually written, as env_start will already be
set but env_end will still be zero, making the range calculation
underflow, allowing to read beyond the end of what has been written.

Fix this as it is done for /proc/<PID>/cmdline by testing env_end for
zero.  It is, apparently, intentionally set last in create_*_tables().

This bug was found by the PaX size_overflow plugin that detected the
arithmetic underflow of 'this_len = env_end - (env_start + src)' when
env_end is still zero.

The expected consequence is that userland trying to access
/proc/<PID>/environ of a not yet fully set up process may get
inconsistent data as we're in the middle of copying in the environment
variables.

Fixes: https://forums.grsecurity.net/viewtopic.php?f=3&t=4363
Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=116461
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Cc: Emese Revfy <re.emese@gmail.com>
Cc: Pax Team <pageexec@freemail.hu>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Mateusz Guzik <mguzik@redhat.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Jarod Wilson <jarod@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-05 17:38:53 -07:00
James Morris
0250abcd72 Merge tag 'keys-next-20160505' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs into next 2016-05-06 09:29:00 +10:00
Linus Torvalds
c5e0666c5a Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull userns fix from Eric Biederman:
 "This contains just a single fix for a nasty oops"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  propogate_mnt: Handle the first propogated copy being a slave
2016-05-05 08:41:57 -07:00
Eric W. Biederman
5ec0811d30 propogate_mnt: Handle the first propogated copy being a slave
When the first propgated copy was a slave the following oops would result:
> BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
> IP: [<ffffffff811fba4e>] propagate_one+0xbe/0x1c0
> PGD bacd4067 PUD bac66067 PMD 0
> Oops: 0000 [#1] SMP
> Modules linked in:
> CPU: 1 PID: 824 Comm: mount Not tainted 4.6.0-rc5userns+ #1523
> Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
> task: ffff8800bb0a8000 ti: ffff8800bac3c000 task.ti: ffff8800bac3c000
> RIP: 0010:[<ffffffff811fba4e>]  [<ffffffff811fba4e>] propagate_one+0xbe/0x1c0
> RSP: 0018:ffff8800bac3fd38  EFLAGS: 00010283
> RAX: 0000000000000000 RBX: ffff8800bb77ec00 RCX: 0000000000000010
> RDX: 0000000000000000 RSI: ffff8800bb58c000 RDI: ffff8800bb58c480
> RBP: ffff8800bac3fd48 R08: 0000000000000001 R09: 0000000000000000
> R10: 0000000000001ca1 R11: 0000000000001c9d R12: 0000000000000000
> R13: ffff8800ba713800 R14: ffff8800bac3fda0 R15: ffff8800bb77ec00
> FS:  00007f3c0cd9b7e0(0000) GS:ffff8800bfb00000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000000000010 CR3: 00000000bb79d000 CR4: 00000000000006e0
> Stack:
>  ffff8800bb77ec00 0000000000000000 ffff8800bac3fd88 ffffffff811fbf85
>  ffff8800bac3fd98 ffff8800bb77f080 ffff8800ba713800 ffff8800bb262b40
>  0000000000000000 0000000000000000 ffff8800bac3fdd8 ffffffff811f1da0
> Call Trace:
>  [<ffffffff811fbf85>] propagate_mnt+0x105/0x140
>  [<ffffffff811f1da0>] attach_recursive_mnt+0x120/0x1e0
>  [<ffffffff811f1ec3>] graft_tree+0x63/0x70
>  [<ffffffff811f1f6b>] do_add_mount+0x9b/0x100
>  [<ffffffff811f2c1a>] do_mount+0x2aa/0xdf0
>  [<ffffffff8117efbe>] ? strndup_user+0x4e/0x70
>  [<ffffffff811f3a45>] SyS_mount+0x75/0xc0
>  [<ffffffff8100242b>] do_syscall_64+0x4b/0xa0
>  [<ffffffff81988f3c>] entry_SYSCALL64_slow_path+0x25/0x25
> Code: 00 00 75 ec 48 89 0d 02 22 22 01 8b 89 10 01 00 00 48 89 05 fd 21 22 01 39 8e 10 01 00 00 0f 84 e0 00 00 00 48 8b 80 d8 00 00 00 <48> 8b 50 10 48 89 05 df 21 22 01 48 89 15 d0 21 22 01 8b 53 30
> RIP  [<ffffffff811fba4e>] propagate_one+0xbe/0x1c0
>  RSP <ffff8800bac3fd38>
> CR2: 0000000000000010
> ---[ end trace 2725ecd95164f217 ]---

This oops happens with the namespace_sem held and can be triggered by
non-root users.  An all around not pleasant experience.

To avoid this scenario when finding the appropriate source mount to
copy stop the walk up the mnt_master chain when the first source mount
is encountered.

Further rewrite the walk up the last_source mnt_master chain so that
it is clear what is going on.

The reason why the first source mount is special is that it it's
mnt_parent is not a mount in the dest_mnt propagation tree, and as
such termination conditions based up on the dest_mnt mount propgation
tree do not make sense.

To avoid other kinds of confusion last_dest is not changed when
computing last_source.  last_dest is only used once in propagate_one
and that is above the point of the code being modified, so changing
the global variable is meaningless and confusing.

Cc: stable@vger.kernel.org
fixes: f2ebb3a921 ("smarter propagate_mnt()")
Reported-by: Tycho Andersen <tycho.andersen@canonical.com>
Reviewed-by: Seth Forshee <seth.forshee@canonical.com>
Tested-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2016-05-05 09:54:45 -05:00
Al Viro
6a480a7842 ecryptfs: fix handling of directory opening
First of all, trying to open them r/w is idiocy; it's guaranteed to fail.
Moreover, assigning ->f_pos and assuming that everything will work is
blatantly broken - try that with e.g. tmpfs as underlying layer and watch
the fireworks.  There may be a non-trivial amount of state associated with
current IO position, well beyond the numeric offset.  Using the single
struct file associated with underlying inode is really not a good idea;
we ought to open one for each ecryptfs directory struct file.

Additionally, file_operations both for directories and non-directories are
full of pointless methods; non-directories should *not* have ->iterate(),
directories should not have ->flush(), ->fasync() and ->splice_read().

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-04 14:04:13 -04:00
Anand Jain
88acff64c6 btrfs: cleanup assigning next active device with a check
Creates helper fucntion as needed by the device delete
and replace operations. Also now it checks if the next
device being assigned is an active device.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-04 10:41:08 +02:00
Anand Jain
8ed01abe7d btrfs: s_bdev is not null after missing replace
Yauhen reported in the ML that s_bdev is null at mount, and
s_bdev gets updated to some device when missing device is
replaced, as because bdev is null for missing device, things
gets matched up. Fix this by checking if s_bdev is set. I
didn't want to completely remove updating s_bdev because
the future multi device support at vfs layer may need it.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reported-by: Yauhen Kharuzhy <yauhen.kharuzhy@zavadatar.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-04 09:52:44 +02:00
David S. Miller
cba6532100 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	net/ipv4/ip_gre.c

Minor conflicts between tunnel bug fixes in net and
ipv6 tunnel cleanups in net-next.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-04 00:52:29 -04:00
Linus Torvalds
610603a520 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse fixes from Miklos Szeredi:
 "Fix a regression and update the MAINTAINERS entry for fuse"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
  fuse: update mailing list in MAINTAINERS
  fuse: Fix return value from fuse_get_user_pages()
2016-05-03 14:23:58 -07:00
Chao Yu
3f8ab27085 f2fs: factor out fsync inode entry operations
Factor out fsync inode entry operations into {add,del}_fsync_inode.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-05-03 11:20:07 -07:00
Chao Yu
c81ced05b2 f2fs: fix to clear page private flag
Commit 28bc106b23 ("f2fs: support revoking atomic written pages")
forgot to clear page private flag correctly, fix it.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-05-03 11:20:06 -07:00
Chao Yu
23dc974eed f2fs: fix to clear private data in page
Private data in page should be removed during ->releasepage or
->invalidatepage, otherwise garbage data would be remained in that page.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-05-03 11:20:05 -07:00
Al Viro
9ac3d3e846 nfs: switch to ->iterate_shared()
aside of the usual care about seeding dcache from readdir, we need
to be careful about the pagecache evictions here.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-02 19:51:53 -04:00
Al Viro
9cf843e3f4 lookup_open(): lock the parent shared unless O_CREAT is given
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-02 19:51:17 -04:00
Al Viro
6fbd07146d lookup_open(): put the dentry fed to ->lookup() or ->atomic_open() into in-lookup hash
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-02 19:51:16 -04:00
Al Viro
12fa5e2404 lookup_open(): expand the call of real_lookup()
... and lose the duplicate IS_DEADDIR() - we'd already checked that.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-02 19:51:16 -04:00
Al Viro
384f26e28f atomic_open(): reorder and clean up a bit
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-02 19:51:15 -04:00
Al Viro
1643b43fbd lookup_open(): lift the "fallback to !O_CREAT" logics from atomic_open()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-02 19:51:15 -04:00
Al Viro
b3d58eaffb atomic_open(): be paranoid about may_open() return value
It should never return positives; however, with Linux S&M crowd
involved, no bogosity is impossible.  Results would be unpleasant...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-02 19:51:14 -04:00
Al Viro
0fb1ea0933 atomic_open(): delay open_to_namei_flags() until the method call
nobody else needs that transformation.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-02 19:51:14 -04:00
Al Viro
fe9ec8291f do_last(): take fput() on error after opening to out:
make it conditional on *opened & FILE_OPENED; in addition to getting
rid of exit_fput: thing, it simplifies atomic_open() cleanup on
may_open() failure.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-02 19:51:13 -04:00
Al Viro
47f9dbd387 do_last(): get rid of duplicate ELOOP check
may_open() will catch it

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-02 19:51:13 -04:00
Al Viro
55db2fd936 atomic_open(): massage the create_error logics a bit
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-02 19:51:12 -04:00
Al Viro
9d0728e16e atomic_open(): consolidate "overridden ENOENT" in open-yourself cases
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-02 19:51:12 -04:00
Al Viro
5249e411b4 atomic_open(): don't bother with EEXIST check - it's done in do_last()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-02 19:51:11 -04:00
Al Viro
df889b3631 Merge branch 'for-linus' into work.lookups 2016-05-02 19:49:46 -04:00
Al Viro
ce8644fcad lookup_open(): expand the call of vfs_create()
Lift IS_DEADDIR handling up into the part common with atomic_open(),
remove it from the latter.  Collapse permission checks into the
call of may_o_create(), getting it closer to atomic_open() case.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-02 19:49:33 -04:00
Al Viro
6ac087099e path_openat(): take O_PATH handling out of do_last()
do_last() and lookup_open() simpler that way and so does O_PATH
itself.  As it bloody well should: we find what the pathname
resolves to, same way as in stat() et.al. and associate it with
FMODE_PATH struct file.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-02 19:49:33 -04:00
Al Viro
3b0a3c1ac1 simple local filesystems: switch to ->iterate_shared()
no changes needed (XFS isn't simple, but it has the same parallelism
in the interesting parts exercised from CXFS).

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-02 19:49:32 -04:00
Al Viro
4e82901cd6 dcache_{readdir,dir_lseek}() users: switch to ->iterate_shared
no need to lock directory in dcache_dir_lseek(), while we are
at it - per-struct file exclusion is enough.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-02 19:49:32 -04:00
Al Viro
3125d2650c cifs: switch to ->iterate_shared()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-02 19:49:31 -04:00
Al Viro
d9b3dbdcfd fuse: switch to ->iterate_shared()
Switch dcache pre-seeding on readdir to d_alloc_parallel();
nothing else is needed.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-02 19:49:31 -04:00
Al Viro
f50752eaa0 switch all procfs directories ->iterate_shared()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-02 19:49:30 -04:00
Al Viro
76aab3ab61 proc_sys_fill_cache(): switch to d_alloc_parallel()
make it usable with directory locked shared

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-02 19:49:30 -04:00
Al Viro
3781764b5c proc_fill_cache(): switch to d_alloc_parallel()
... making it usable with directory locked shared

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-02 19:49:29 -04:00
Al Viro
6192269444 introduce a parallel variant of ->iterate()
New method: ->iterate_shared().  Same arguments as in ->iterate(),
called with the directory locked only shared.  Once all filesystems
switch, the old one will be gone.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-02 19:49:29 -04:00
Al Viro
63b6df1413 give readdir(2)/getdents(2)/etc. uniform exclusion with lseek()
same as read() on regular files has, and for the same reason.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-02 19:49:28 -04:00
Al Viro
9902af79c0 parallel lookups: actual switch to rwsem
ta-da!

The main issue is the lack of down_write_killable(), so the places
like readdir.c switched to plain inode_lock(); once killable
variants of rwsem primitives appear, that'll be dealt with.

lockdep side also might need more work

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-02 19:49:28 -04:00
Al Viro
d9171b9345 parallel lookups machinery, part 4 (and last)
If we *do* run into an in-lookup match, we need to wait for it to
cease being in-lookup.  Fortunately, we do have unused space in
in-lookup dentries - d_lru is never looked at until it stops being
in-lookup.

So we can stash a pointer to wait_queue_head from stack frame of
the caller of ->lookup().  Some precautions are needed while
waiting, but it's not that hard - we do hold a reference to dentry
we are waiting for, so it can't go away.  If it's found to be
in-lookup the wait_queue_head is still alive and will remain so
at least while ->d_lock is held.  Moreover, the condition we
are waiting for becomes true at the same point where everything
on that wq gets woken up, so we can just add ourselves to the
queue once.

d_alloc_parallel() gets a pointer to wait_queue_head_t from its
caller; lookup_slow() adjusted, d_add_ci() taught to use
d_alloc_parallel() if the dentry passed to it happens to be
in-lookup one (i.e. if it's been called from the parallel lookup).

That's pretty much it - all that remains is to switch ->i_mutex
to rwsem and have lookup_slow() take it shared.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-02 19:49:27 -04:00
Al Viro
94bdd655ca parallel lookups machinery, part 3
We will need to be able to check if there is an in-lookup
dentry with matching parent/name.  Right now it's impossible,
but as soon as start locking directories shared such beasts
will appear.

Add a secondary hash for locating those.  Hash chains go through
the same space where d_alias will be once it's not in-lookup anymore.
Search is done under the same bitlock we use for modifications -
with the primary hash we can rely on d_rehash() into the wrong
chain being the worst that could happen, but here the pointers are
buggered once it's removed from the chain.  On the other hand,
the chains are not going to be long and normally we'll end up
adding to the chain anyway.  That allows us to avoid bothering with
->d_lock when doing the comparisons - everything is stable until
removed from chain.

New helper: d_alloc_parallel().  Right now it allocates, verifies
that no hashed and in-lookup matches exist and adds to in-lookup
hash.

Returns ERR_PTR() for error, hashed match (in the unlikely case it's
been found) or new dentry.  In-lookup matches trigger BUG() for
now; that will change in the next commit when we introduce waiting
for ongoing lookup to finish.  Note that in-lookup matches won't be
possible until we actually go for shared locking.

lookup_slow() switched to use of d_alloc_parallel().

Again, these commits are separated only for making it easier to
review.  All this machinery will start doing something useful only
when we go for shared locking; it's just that the combination is
too large for my taste.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-02 19:49:27 -04:00
Al Viro
84e710da2a parallel lookups machinery, part 2
We'll need to verify that there's neither a hashed nor in-lookup
dentry with desired parent/name before adding to in-lookup set.

One possible solution would be to hold the parent's ->d_lock through
both checks, but while the in-lookup set is relatively small at any
time, dcache is not.  And holding the parent's ->d_lock through
something like __d_lookup_rcu() would suck too badly.

So we leave the parent's ->d_lock alone, which means that we watch
out for the following scenario:
	* we verify that there's no hashed match
	* existing in-lookup match gets hashed by another process
	* we verify that there's no in-lookup matches and decide
that everything's fine.

Solution: per-directory kinda-sorta seqlock, bumped around the times
we hash something that used to be in-lookup or move (and hash)
something in place of in-lookup.  Then the above would turn into
	* read the counter
	* do dcache lookup
	* if no matches found, check for in-lookup matches
	* if there had been none of those either, check if the
counter has changed; repeat if it has.

The "kinda-sorta" part is due to the fact that we don't have much spare
space in inode.  There is a spare word (shared with i_bdev/i_cdev/i_pipe),
so the counter part is not a problem, but spinlock is a different story.

We could use the parent's ->d_lock, and it would be less painful in
terms of contention, for __d_add() it would be rather inconvenient to
grab; we could do that (using lock_parent()), but...

Fortunately, we can get serialization on the counter itself, and it
might be a good idea in general; we can use cmpxchg() in a loop to
get from even to odd and smp_store_release() from odd to even.

This commit adds the counter and updating logics; the readers will be
added in the next commit.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-02 19:49:26 -04:00
Al Viro
85c7f81041 beginning of transition to parallel lookups - marking in-lookup dentries
marked as such when (would be) parallel lookup is about to pass them
to actual ->lookup(); unmarked when
	* __d_add() is about to make it hashed, positive or not.
	* __d_move() (from d_splice_alias(), directly or via
__d_unalias()) puts a preexisting dentry in its place
	* in caller of ->lookup() if it has escaped all of the
above.  Bug (WARN_ON, actually) if it reaches the final dput()
or d_instantiate() while still marked such.

As the result, we are guaranteed that for as long as the flag is
set, dentry will
	* remain negative unhashed with positive refcount
	* never have its ->d_alias looked at
	* never have its ->d_lru looked at
	* never have its ->d_parent and ->d_name changed

Right now we have at most one such for any given parent directory.
With parallel lookups that restriction will weaken to
	* only exist when parent is locked shared
	* at most one with given (parent,name) pair (comparison of
names is according to ->d_compare())
	* only exist when there's no hashed dentry with the same
(parent,name)

Transition will take the next several commits; unfortunately, we'll
only be able to switch to rwsem at the end of this series.  The
reason for not making it a single patch is to simplify review.

New primitives: d_in_lookup() (a predicate checking if dentry is in
the in-lookup state) and d_lookup_done() (tells the system that
we are done with lookup and if it's still marked as in-lookup, it
should cease to be such).

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-02 19:47:51 -04:00
Al Viro
0568d705b0 __d_add(): don't drop/regain ->d_lock
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-02 19:47:26 -04:00
Al Viro
1936386ea9 lookup_slow(): bugger off on IS_DEADDIR() from the very beginning
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-02 19:47:26 -04:00
Al Viro
d2caaa0a77 nfs: missing wakeup in nfs_unblock_sillyrename()
will be needed as soon as lookups are not serialized by ->i_mutex

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-02 19:47:25 -04:00
Al Viro
be5b82dbfe make ext2_get_page() and friends work without external serialization
Right now ext2_get_page() (and its analogues in a bunch of other filesystems)
relies upon the directory being locked - the way it sets and tests Checked and
Error bits would be racy without that.  Switch to a slightly different scheme,
_not_ setting Checked in case of failure.  That way the logics becomes
	if Checked => OK
	else if Error => fail
	else if !validate => fail
	else => OK
with validation setting Checked or Error on success and failure resp. and
returning which one had happened.  Equivalent to the current logics, but unlike
the current logics not sensitive to the order of set_bit, test_bit getting
reordered by CPU, etc.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-02 19:47:25 -04:00
Al Viro
b9e1d435fd ovl_lookup_real(): use lookup_one_len_unlocked()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-02 19:47:24 -04:00
Al Viro
383d4e8ab0 reconnect_one(): use lookup_one_len_unlocked()
... and explain the non-obvious logics in case when lookup yields
a different dentry.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-02 19:47:24 -04:00
Al Viro
1ae1f3f647 reiserfs: open-code reiserfs_mutex_lock_safe() in reiserfs_unpack()
... and have it use inode_lock()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-02 19:47:23 -04:00
Al Viro
5ecfcb265f orangefs: don't open-code inode_lock/inode_unlock
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-02 19:47:23 -04:00
Al Viro
7b9743eb89 ocfs2: don't open-code inode_lock/inode_unlock
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-02 19:47:22 -04:00
Al Viro
48f35b7b73 configfs_detach_prep(): make sure that wait_mutex won't go away
grab a reference to dentry we'd got the sucker from, and return
that dentry via *wait, rather than just returning the address of
->i_mutex.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-02 19:47:22 -04:00
Al Viro
779b839133 kernfs: use lookup_one_len_unlocked()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-02 19:47:21 -04:00
Al Viro
b96809173e security_d_instantiate(): move to the point prior to attaching dentry to inode
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-02 19:47:02 -04:00
Al Viro
84695ffee7 Merge getxattr prototype change into work.lookups
The rest of work.xattr stuff isn't needed for this branch
2016-05-02 19:45:47 -04:00
Linus Torvalds
33656a1f2e Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull UDF fix from Jan Kara:
 "A fix of a regression in UDF that got introduced in 4.6-rc1 by one of
  the charset encoding fixes"

* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  udf: Fix conversion of 'dstring' fields to UTF8
2016-05-02 09:59:57 -07:00
Serge Hallyn
e99ed4de17 kernfs_path_from_node_locked: don't overwrite nlen
We've calculated @len to be the bytes we need for '/..' entries from
@kn_from to the common ancestor, and calculated @nlen to be the extra
bytes we need to get from the common ancestor to @kn_to.  We use them
as such at the end.  But in the loop copying the actual entries, we
overwrite @nlen.  Use a temporary variable for that instead.

Without this, the return length, when the buffer is large enough, is
wrong.  (When the buffer is NULL or too small, the returned value is
correct. The buffer contents are also correct.)

Interestingly, no callers of this function are affected by this as of
yet.  However the upcoming cgroup_show_path() will be.

Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
2016-05-02 12:36:00 -04:00
Anand Jain
ad8403df05 btrfs: pass the right error code to the btrfs_std_error
Also drop the newline from the message.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-02 17:50:13 +02:00
Bob Peterson
8381e60227 GFS2: Remove allocation parms from gfs2_rbm_find
Struct gfs2_alloc_parms ap is never referenced in function
gfs2_rbm_find, so this patch removes it.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2016-05-02 09:49:22 -05:00
Abhi Das
80f4781d2c gfs2: use inode_lock/unlock instead of accessing i_mutex directly
i_mutex has been replaced by i_rwsem and directly accessing the
non-existent i_mutex breaks the kernel build.

Signed-off-by: Abhi Das <adas@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2016-05-02 07:07:01 -05:00
Christoph Hellwig
24368aad47 nfsd: use RWF_SYNC
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-01 19:58:39 -04:00
Christoph Hellwig
e864f39569 fs: add RWF_DSYNC aand RWF_SYNC
This is the per-I/O equivalent of O_DSYNC and O_SYNC, and very useful for
all kinds of file servers and storage targets.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-01 19:58:39 -04:00
Christoph Hellwig
6aa657c852 ceph: use generic_write_sync
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-01 19:58:39 -04:00
Christoph Hellwig
e259221763 fs: simplify the generic_write_sync prototype
The kiocb already has the new position, so use that.  The only interesting
case is AIO, where we currently don't bother updating ki_pos.  We're about
to free the kiocb after we're done, so we might as well update it to make
everyone's life simpler.

While we're at it also return the bytes written argument passed in if
we were successful so that the boilerplate error switch code in the
callers can go away.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-01 19:58:39 -04:00
Christoph Hellwig
dde0c2e798 fs: add IOCB_SYNC and IOCB_DSYNC
This will allow us to do per-I/O sync file writes, as required by a lot
of fileservers or storage targets.

XXX: Will need a few additional audits for O_DSYNC

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-01 19:58:39 -04:00
Christoph Hellwig
716b9bc0cb direct-io: remove the offset argument to dio_complete
It has to be identical to ki_pos of the iocb, so use that instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-01 19:58:39 -04:00
Christoph Hellwig
c8b8e32d70 direct-io: eliminate the offset argument to ->direct_IO
Including blkdev_direct_IO and dax_do_io.  It has to be ki_pos to actually
work, so eliminate the superflous argument.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-01 19:58:39 -04:00
Christoph Hellwig
13712713ca xfs: eliminate the pos variable in xfs_file_dio_aio_write
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-01 19:58:39 -04:00
Christoph Hellwig
1af5bb491f filemap: remove the pos argument to generic_file_direct_write
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-01 19:58:39 -04:00
Mimi Zohar
05d1a717ec ima: add support for creating files using the mknodat syscall
Commit 3034a14 "ima: pass 'opened' flag to identify newly created files"
stopped identifying empty files as new files.  However new empty files
can be created using the mknodat syscall.  On systems with IMA-appraisal
enabled, these empty files are not labeled with security.ima extended
attributes properly, preventing them from subsequently being opened in
order to write the file data contents.  This patch defines a new hook
named ima_post_path_mknod() to mark these empty files, created using
mknodat, as new in order to allow the file data contents to be written.

In addition, files with security.ima xattrs containing a file signature
are considered "immutable" and can not be modified.  The file contents
need to be written, before signing the file.  This patch relaxes this
requirement for new files, allowing the file signature to be written
before the file contents.

Changelog:
- defer identifying files with signatures stored as security.ima
  (based on Dmitry Rozhkov's comments)
- removing tests (eg. dentry, dentry->d_inode, inode->i_size == 0)
  (based on Al's review)

Signed-off-by: Mimi Zohar <zohar@linux.vnet.ibm.com>
Cc: Al Viro <<viro@zeniv.linux.org.uk>
Tested-by: Dmitry Rozhkov <dmitry.rozhkov@linux.intel.com>
2016-05-01 09:23:52 -04:00
Dmitry Kasatkin
39d637af5a vfs: forbid write access when reading a file into memory
This patch is based on top of the "vfs: support for a common kernel file
loader" patch set.  In general when the kernel is reading a file into
memory it does not want anything else writing to it.

The kernel currently only forbids write access to a file being executed.
This patch extends this locking to files being read by the kernel.

Changelog:
- moved function to kernel_read_file() - Mimi
- updated patch description - Mimi

Signed-off-by: Dmitry Kasatkin <dmitry.kasatkin@huawei.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Mimi Zohar <zohar@linux.vnet.ibm.com>
Reviewed-by: Luis R. Rodriguez <mcgrof@kernel.org>
Acked-by: Kees Cook <keescook@chromium.org>
2016-05-01 09:23:51 -04:00
Al Viro
10c64cea04 atomic_open(): fix the handling of create_error
* if we have a hashed negative dentry and either CREAT|EXCL on
r/o filesystem, or CREAT|TRUNC on r/o filesystem, or CREAT|EXCL
with failing may_o_create(), we should fail with EROFS or the
error may_o_create() has returned, but not ENOENT.  Which is what
the current code ends up returning.

* if we have CREAT|TRUNC hitting a regular file on a read-only
filesystem, we can't fail with EROFS here.  At the very least,
not until we'd done follow_managed() - we might have a writable
file (or a device, for that matter) bound on top of that one.
Moreover, the code downstream will see that O_TRUNC and attempt
to grab the write access (*after* following possible mount), so
if we really should fail with EROFS, it will happen.  No need
to do that inside atomic_open().

The real logics is much simpler than what the current code is
trying to do - if we decided to go for simple lookup, ended
up with a negative dentry *and* had create_error set, fail with
create_error.  No matter whether we'd got that negative dentry
from lookup_real() or had found it in dcache.

Cc: stable@vger.kernel.org # v3.6+
Acked-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-04-30 16:40:52 -04:00
Chris Wilson
e4234a1fc3 kernfs: Move faulting copy_user operations outside of the mutex
A fault in a user provided buffer may lead anywhere, and lockdep warns
that we have a potential deadlock between the mm->mmap_sem and the
kernfs file mutex:

[   82.811702] ======================================================
[   82.811705] [ INFO: possible circular locking dependency detected ]
[   82.811709] 4.5.0-rc4-gfxbench+ #1 Not tainted
[   82.811711] -------------------------------------------------------
[   82.811714] kms_setmode/5859 is trying to acquire lock:
[   82.811717]  (&dev->struct_mutex){+.+.+.}, at: [<ffffffff8150d9c1>] drm_gem_mmap+0x1a1/0x270
[   82.811731]
but task is already holding lock:
[   82.811734]  (&mm->mmap_sem){++++++}, at: [<ffffffff8117b364>] vm_mmap_pgoff+0x44/0xa0
[   82.811745]
which lock already depends on the new lock.

[   82.811749]
the existing dependency chain (in reverse order) is:
[   82.811752]
-> #3 (&mm->mmap_sem){++++++}:
[   82.811761]        [<ffffffff810cc883>] lock_acquire+0xc3/0x1d0
[   82.811766]        [<ffffffff8118bc65>] __might_fault+0x75/0xa0
[   82.811771]        [<ffffffff8124da4a>] kernfs_fop_write+0x8a/0x180
[   82.811787]        [<ffffffff811d1023>] __vfs_write+0x23/0xe0
[   82.811792]        [<ffffffff811d1d74>] vfs_write+0xa4/0x190
[   82.811797]        [<ffffffff811d2c14>] SyS_write+0x44/0xb0
[   82.811801]        [<ffffffff817bb81b>] entry_SYSCALL_64_fastpath+0x16/0x73
[   82.811807]
-> #2 (s_active#6){++++.+}:
[   82.811814]        [<ffffffff810cc883>] lock_acquire+0xc3/0x1d0
[   82.811819]        [<ffffffff8124c070>] __kernfs_remove+0x210/0x2f0
[   82.811823]        [<ffffffff8124d040>] kernfs_remove_by_name_ns+0x40/0xa0
[   82.811828]        [<ffffffff8124e9e0>] sysfs_remove_file_ns+0x10/0x20
[   82.811832]        [<ffffffff815318d4>] device_del+0x124/0x250
[   82.811837]        [<ffffffff81531a19>] device_unregister+0x19/0x60
[   82.811841]        [<ffffffff8153c051>] cpu_cache_sysfs_exit+0x51/0xb0
[   82.811846]        [<ffffffff8153c628>] cacheinfo_cpu_callback+0x38/0x70
[   82.811851]        [<ffffffff8109ae89>] notifier_call_chain+0x39/0xa0
[   82.811856]        [<ffffffff8109aef9>] __raw_notifier_call_chain+0x9/0x10
[   82.811860]        [<ffffffff810786de>] cpu_notify+0x1e/0x40
[   82.811865]        [<ffffffff81078779>] cpu_notify_nofail+0x9/0x20
[   82.811869]        [<ffffffff81078ac3>] _cpu_down+0x233/0x340
[   82.811874]        [<ffffffff81079019>] disable_nonboot_cpus+0xc9/0x350
[   82.811878]        [<ffffffff810d2e11>] suspend_devices_and_enter+0x5a1/0xb50
[   82.811883]        [<ffffffff810d3903>] pm_suspend+0x543/0x8d0
[   82.811888]        [<ffffffff810d1b77>] state_store+0x77/0xe0
[   82.811892]        [<ffffffff813fa68f>] kobj_attr_store+0xf/0x20
[   82.811897]        [<ffffffff8124e740>] sysfs_kf_write+0x40/0x50
[   82.811902]        [<ffffffff8124dafc>] kernfs_fop_write+0x13c/0x180
[   82.811906]        [<ffffffff811d1023>] __vfs_write+0x23/0xe0
[   82.811910]        [<ffffffff811d1d74>] vfs_write+0xa4/0x190
[   82.811914]        [<ffffffff811d2c14>] SyS_write+0x44/0xb0
[   82.811918]        [<ffffffff817bb81b>] entry_SYSCALL_64_fastpath+0x16/0x73
[   82.811923]
-> #1 (cpu_hotplug.lock){+.+.+.}:
[   82.811929]        [<ffffffff810cc883>] lock_acquire+0xc3/0x1d0
[   82.811933]        [<ffffffff817b6f72>] mutex_lock_nested+0x62/0x3b0
[   82.811940]        [<ffffffff810784c1>] get_online_cpus+0x61/0x80
[   82.811944]        [<ffffffff811170eb>] stop_machine+0x1b/0xe0
[   82.811949]        [<ffffffffa0178edd>] gen8_ggtt_insert_entries__BKL+0x2d/0x30 [i915]
[   82.812009]        [<ffffffffa017d3a6>] ggtt_bind_vma+0x46/0x70 [i915]
[   82.812045]        [<ffffffffa017eb70>] i915_vma_bind+0x140/0x290 [i915]
[   82.812081]        [<ffffffffa01862b9>] i915_gem_object_do_pin+0x899/0xb00 [i915]
[   82.812117]        [<ffffffffa0186555>] i915_gem_object_pin+0x35/0x40 [i915]
[   82.812154]        [<ffffffffa019a23e>] intel_init_pipe_control+0xbe/0x210 [i915]
[   82.812192]        [<ffffffffa0197312>] intel_logical_rings_init+0xe2/0xde0 [i915]
[   82.812232]        [<ffffffffa0186fe3>] i915_gem_init+0xf3/0x130 [i915]
[   82.812278]        [<ffffffffa02097ed>] i915_driver_load+0xf2d/0x1770 [i915]
[   82.812318]        [<ffffffff81512474>] drm_dev_register+0xa4/0xb0
[   82.812323]        [<ffffffff8151467e>] drm_get_pci_dev+0xce/0x1e0
[   82.812328]        [<ffffffffa01472cf>] i915_pci_probe+0x2f/0x50 [i915]
[   82.812360]        [<ffffffff8143f907>] pci_device_probe+0x87/0xf0
[   82.812366]        [<ffffffff81535f89>] driver_probe_device+0x229/0x450
[   82.812371]        [<ffffffff81536233>] __driver_attach+0x83/0x90
[   82.812375]        [<ffffffff81533c61>] bus_for_each_dev+0x61/0xa0
[   82.812380]        [<ffffffff81535879>] driver_attach+0x19/0x20
[   82.812384]        [<ffffffff8153535f>] bus_add_driver+0x1ef/0x290
[   82.812388]        [<ffffffff81536e9b>] driver_register+0x5b/0xe0
[   82.812393]        [<ffffffff8143e83b>] __pci_register_driver+0x5b/0x60
[   82.812398]        [<ffffffff81514866>] drm_pci_init+0xd6/0x100
[   82.812402]        [<ffffffffa027c094>] 0xffffffffa027c094
[   82.812406]        [<ffffffff810003de>] do_one_initcall+0xae/0x1d0
[   82.812412]        [<ffffffff811595a0>] do_init_module+0x5b/0x1cb
[   82.812417]        [<ffffffff81106160>] load_module+0x1c20/0x2480
[   82.812422]        [<ffffffff81106bae>] SyS_finit_module+0x7e/0xa0
[   82.812428]        [<ffffffff817bb81b>] entry_SYSCALL_64_fastpath+0x16/0x73
[   82.812433]
-> #0 (&dev->struct_mutex){+.+.+.}:
[   82.812439]        [<ffffffff810cbe59>] __lock_acquire+0x1fc9/0x20f0
[   82.812443]        [<ffffffff810cc883>] lock_acquire+0xc3/0x1d0
[   82.812456]        [<ffffffff8150d9e7>] drm_gem_mmap+0x1c7/0x270
[   82.812460]        [<ffffffff81196a14>] mmap_region+0x334/0x580
[   82.812466]        [<ffffffff81196fc4>] do_mmap+0x364/0x410
[   82.812470]        [<ffffffff8117b38d>] vm_mmap_pgoff+0x6d/0xa0
[   82.812474]        [<ffffffff811950f4>] SyS_mmap_pgoff+0x184/0x220
[   82.812479]        [<ffffffff8100a0fd>] SyS_mmap+0x1d/0x20
[   82.812484]        [<ffffffff817bb81b>] entry_SYSCALL_64_fastpath+0x16/0x73
[   82.812489]
other info that might help us debug this:

[   82.812493] Chain exists of:
  &dev->struct_mutex --> s_active#6 --> &mm->mmap_sem

[   82.812502]  Possible unsafe locking scenario:

[   82.812506]        CPU0                    CPU1
[   82.812508]        ----                    ----
[   82.812510]   lock(&mm->mmap_sem);
[   82.812514]                                lock(s_active#6);
[   82.812519]                                lock(&mm->mmap_sem);
[   82.812522]   lock(&dev->struct_mutex);
[   82.812526]
 *** DEADLOCK ***

[   82.812531] 1 lock held by kms_setmode/5859:
[   82.812533]  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff8117b364>] vm_mmap_pgoff+0x44/0xa0
[   82.812541]
stack backtrace:
[   82.812547] CPU: 0 PID: 5859 Comm: kms_setmode Not tainted 4.5.0-rc4-gfxbench+ #1
[   82.812550] Hardware name:                  /NUC5CPYB, BIOS PYBSWCEL.86A.0040.2015.0814.1353 08/14/2015
[   82.812553]  0000000000000000 ffff880079407bf0 ffffffff813f8505 ffffffff825fb270
[   82.812560]  ffffffff825c4190 ffff880079407c30 ffffffff810c84ac ffff880079407c90
[   82.812566]  ffff8800797ed328 ffff8800797ecb00 0000000000000001 ffff8800797ed350
[   82.812573] Call Trace:
[   82.812578]  [<ffffffff813f8505>] dump_stack+0x67/0x92
[   82.812582]  [<ffffffff810c84ac>] print_circular_bug+0x1fc/0x310
[   82.812586]  [<ffffffff810cbe59>] __lock_acquire+0x1fc9/0x20f0
[   82.812590]  [<ffffffff810cc883>] lock_acquire+0xc3/0x1d0
[   82.812594]  [<ffffffff8150d9c1>] ? drm_gem_mmap+0x1a1/0x270
[   82.812599]  [<ffffffff8150d9e7>] drm_gem_mmap+0x1c7/0x270
[   82.812603]  [<ffffffff8150d9c1>] ? drm_gem_mmap+0x1a1/0x270
[   82.812608]  [<ffffffff81196a14>] mmap_region+0x334/0x580
[   82.812612]  [<ffffffff81196fc4>] do_mmap+0x364/0x410
[   82.812616]  [<ffffffff8117b38d>] vm_mmap_pgoff+0x6d/0xa0
[   82.812629]  [<ffffffff811950f4>] SyS_mmap_pgoff+0x184/0x220
[   82.812633]  [<ffffffff8100a0fd>] SyS_mmap+0x1d/0x20
[   82.812637]  [<ffffffff817bb81b>] entry_SYSCALL_64_fastpath+0x16/0x73

Highly unlikely though this scenario is, we can avoid the issue entirely
by moving the copy operation from out under the kernfs_get_active()
tracking by assigning the preallocated buffer its own mutex. The
temporary buffer allocation doesn't require mutex locking as it is
entirely local.

The locked section was extended by the addition of the preallocated buf
to speed up md user operations in

commit 2b75869bba
Author: NeilBrown <neilb@suse.de>
Date:   Mon Oct 13 16:41:28 2014 +1100

    sysfs/kernfs: allow attributes to request write buffer be pre-allocated.

Reported-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=94350
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: NeilBrown <neilb@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-04-30 10:05:05 -07:00
Linus Torvalds
1d003af2ef Merge branch 'akpm' (patches from Andrew)
Merge fixes from Andrew Morton:
 "20 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  Documentation/sysctl/vm.txt: update numa_zonelist_order description
  lib/stackdepot.c: allow the stack trace hash to be zero
  rapidio: fix potential NULL pointer dereference
  mm/memory-failure: fix race with compound page split/merge
  ocfs2/dlm: return zero if deref_done message is successfully handled
  Ananth has moved
  kcov: don't profile branches in kcov
  kcov: don't trace the code coverage code
  mm: wake kcompactd before kswapd's short sleep
  .mailmap: add Frank Rowand
  mm/hwpoison: fix wrong num_poisoned_pages accounting
  mm: call swap_slot_free_notify() with page lock held
  mm: vmscan: reclaim highmem zone if buffer_heads is over limit
  numa: fix /proc/<pid>/numa_maps for THP
  mm/huge_memory: replace VM_NO_THP VM_BUG_ON with actual VMA check
  mailmap: fix Krzysztof Kozlowski's misspelled name
  thp: keep huge zero page pinned until tlb flush
  mm: exclude HugeTLB pages from THP page_mapped() logic
  kexec: export OFFSET(page.compound_head) to find out compound tail page
  kexec: update VMCOREINFO for compound_order/dtor
2016-04-29 11:21:22 -07:00
xuejiufei
b73413647e ocfs2/dlm: return zero if deref_done message is successfully handled
dlm_deref_lockres_done_handler() should return zero if the message is
successfully handled.

Fixes: 60d663cb52 ("ocfs2/dlm: add DEREF_DONE message").
Signed-off-by: xuejiufei <xuejiufei@huawei.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-28 19:34:04 -07:00
Gerald Schaefer
28093f9f34 numa: fix /proc/<pid>/numa_maps for THP
In gather_pte_stats() a THP pmd is cast into a pte, which is wrong
because the layouts may differ depending on the architecture.  On s390
this will lead to inaccurate numa_maps accounting in /proc because of
misguided pte_present() and pte_dirty() checks on the fake pte.

On other architectures pte_present() and pte_dirty() may work by chance,
but there may be an issue with direct-access (dax) mappings w/o
underlying struct pages when HAVE_PTE_SPECIAL is set and THP is
available.  In vm_normal_page() the fake pte will be checked with
pte_special() and because there is no "special" bit in a pmd, this will
always return false and the VM_PFNMAP | VM_MIXEDMAP checking will be
skipped.  On dax mappings w/o struct pages, an invalid struct page
pointer would then be returned that can crash the kernel.

This patch fixes the numa_maps THP handling by introducing new "_pmd"
variants of the can_gather_numa_stats() and vm_normal_page() functions.

Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Cc: <stable@vger.kernel.org>	[4.3+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-28 19:34:04 -07:00
Linus Torvalds
6fa9bffbcc Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull Ceph fixes from Sage Weil:
 "There is a lifecycle fix in the auth code, a fix for a narrow race
  condition on map, and a helpful message in the log when there is a
  feature mismatch (which happens frequently now that the default
  server-side options have changed)"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  rbd: report unsupported features to syslog
  rbd: fix rbd map vs notify races
  libceph: make authorizer destruction independent of ceph_auth_client
2016-04-28 18:59:24 -07:00
Jeff Mahoney
db6711600e btrfs: uapi/linux/btrfs_tree.h migration, item types and defines
The BTRFS_IOC_SEARCH_TREE ioctl returns file system items directly
to userspace.  In order to decode them, full type information is required.

Create a new header, btrfs_tree to contain these since most users won't
need them.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 11:06:41 +02:00
Jeff Mahoney
33ca913349 btrfs: uapi/linux/btrfs.h migration, move struct btrfs_ioctl_defrag_range_args
struct btrfs_ioctl_defrag_range_args is used by the BTRFS_IOC_DEFRAG_RANGE
ioctl.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 11:06:41 +02:00
Jeff Mahoney
04cd01dffb btrfs: uapi/linux/btrfs.h migration, move balance flags
The BTRFS_BALANCE_* flags are used by struct btrfs_ioctl_balance_args.flags
and btrfs_ioctl_balance_args.{data,meta,sys}.flags in the BTRFS_IOC_BALANCE
ioctl.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 11:06:41 +02:00
Jeff Mahoney
18db9ac644 btrfs: uapi/linux/btrfs.h migration, move feature flags
The compat/compat_ro/incompat feature flags are used by the feature set/get
ioctls.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 11:06:41 +02:00
Jeff Mahoney
83288b60bf btrfs: uapi/linux/btrfs.h migration, qgroup limit flags
The BTRFS_QGROUP_LIMIT_* flags are required to tell the kernel which
fields are valid when using the BTRFS_IOC_QGROUP_LIMIT ioctl.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 11:06:41 +02:00
Jeff Mahoney
d4ae133b2d btrfs: uapi/linux/btrfs.h migration, move BTRFS_LABEL_SIZE
BTRFS_LABEL_SIZE is required to define the BTRFS_IOC_GET_FSLABEL and
BTRFS_IOC_SET_FSLABEL ioctls.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 11:06:41 +02:00
Anand Jain
b5255456c5 btrfs: refactor btrfs_dev_replace_start for reuse
A refactor patch, and avoids user input verification in the
btrfs_dev_replace_start(), and so this function can be reused.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:59:13 +02:00
Anand Jain
fc23c246d7 btrfs: use fs_info directly
Local variable fs_info, contains root->fs_info, use it.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:59:13 +02:00
David Sterba
735654ea91 btrfs: rename flags for vol args v2
Rename BTRFS_DEVICE_BY_ID so it's more descriptive that we specify the
device by id, it'll be part of the public API. The mask of supported
flags is also renamed, only for internal use.

The error code for unknown flags is EOPNOTSUPP, fixed.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:59:13 +02:00
David Sterba
5c5c0df05d btrfs: rename btrfs_find_device_by_user_input
For clarity how we are going to find the device, let's call it a device
specifier, devspec for short. Also rename the arguments that are a
leftover from previous function purpose.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:59:13 +02:00
David Sterba
418775a22b btrfs: use existing device constraints table btrfs_raid_array
We should avoid duplicating the device constraints, let's use the
btrfs_raid_array in btrfs_check_raid_min_devices.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:59:13 +02:00
David Sterba
621292bae6 btrfs: introduce raid-type to error-code table, for minimum device constraint
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:59:13 +02:00
David Sterba
3cc31a0d5b btrfs: pass number of devices to btrfs_check_raid_min_devices
Before this patch, btrfs_check_raid_min_devices would do an off-by-one
check of the constraints and not the miminmum check, as its name
suggests. This is not a problem if the only caller is device remove, but
would be confusing for others.

Add an argument with the exact number and let the caller(s) decide if
this needs any adjustments, like when device replace is running.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Tested-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:59:13 +02:00
David Sterba
f47ab2588e btrfs: rename __check_raid_min_devices
Underscores are for special functions, use the full prefix for better
stacktrace recognition.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:59:13 +02:00
Anand Jain
02feae3c55 btrfs: optimize check for stale device
Optimize check for stale device to only be checked when there is device
added or changed. If there is no update to the device, there is no need
to call btrfs_free_stale_device().

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:59:13 +02:00
Anand Jain
6b526ed70c btrfs: introduce device delete by devid
This introduces new ioctl BTRFS_IOC_RM_DEV_V2, which uses enhanced struct
btrfs_ioctl_vol_args_v2 to carry devid as an user argument.

The patch won't delete the old ioctl interface and so kernel remains
backward compatible with user land progs.

Test case/script:
echo "0 $(blockdev --getsz /dev/sdf) linear /dev/sdf 0" | dmsetup create bad_disk
mkfs.btrfs -f -d raid1 -m raid1 /dev/sdd /dev/sde /dev/mapper/bad_disk
mount /dev/sdd /btrfs
dmsetup suspend bad_disk
echo "0 $(blockdev --getsz /dev/sdf) error /dev/sdf 0" | dmsetup load bad_disk
dmsetup resume bad_disk
echo "bad disk failed. now deleting/replacing"
btrfs dev del  3  /btrfs
echo $?
btrfs fi show /btrfs
umount /btrfs
btrfs-show-super /dev/sdd | egrep num_device
dmsetup remove bad_disk
wipefs -a /dev/sdf

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reported-by: Martin <m_btrfs@ml1.co.uk>
[ adjust messages, s/disk/device/ ]
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:59:13 +02:00
Anand Jain
42b6742715 btrfs: make use of btrfs_scratch_superblocks() in btrfs_rm_device()
With the previous patches now the btrfs_scratch_superblocks() is ready to
be used in btrfs_rm_device() so use it.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
[ use GFP_KERNEL ]
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:59:13 +02:00
Anand Jain
b3d1b1532f btrfs: enhance btrfs_find_device_by_user_input() to check device path
The operation of device replace and device delete follows same steps upto
some depth with in btrfs kernel, however they don't share codes. This
enhancement will help replace and delete to share codes.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:59:13 +02:00
Anand Jain
24fc572fe4 btrfs: make use of btrfs_find_device_by_user_input()
btrfs_rm_device() has a section of the code which can be replaced
btrfs_find_device_by_user_input()

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:59:13 +02:00
Anand Jain
24e0474b59 btrfs: create helper btrfs_find_device_by_user_input()
The patch renames btrfs_dev_replace_find_srcdev() to
btrfs_find_device_by_user_input() and moves it to volumes.c, so that
delete device can use it.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:59:13 +02:00
Anand Jain
bd45ffbcb1 btrfs: clean up and optimize __check_raid_min_device()
__check_raid_min_device() which was pealed from btrfs_rm_device()
maintianed its original code to show the block move. This patch cleans up
__check_raid_min_device().

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:59:13 +02:00
Anand Jain
f1fa7f2642 btrfs: create helper function __check_raid_min_devices()
move a section of btrfs_rm_device() code to check for min number of the
devices into the function __check_raid_min_devices()

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:59:13 +02:00
Anand Jain
6cf86a006b btrfs: create a helper function to read the disk super
A part of code from btrfs_scan_one_device() is moved to a new function
btrfs_read_disk_super(), so that former function looks cleaner. (In this
process it also moves the code which ensures null terminating label). So
this creates easy opportunity to merge various duplicate codes on read
disk super. Earlier attempt to merge duplicate codes highlighted that
there were some issues for which there are duplicate codes (to read disk
super), however it was not clear what was the issue. So until we figure
that out, its better to keep them in a separate functions.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
[ use GFP_KERNEL, PAGE_CACHE_ removal related fixups ]
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:59:04 +02:00
Jan Kara
2a28900be2 udf: Export superblock magic to userspace
Currently UDF superblock magic doesn't appear in any userspace header
files and thus userspace apps have hard time checking for this fs. Let's
export the magic to userspace as with any other filesystem.

Signed-off-by: Jan Kara <jack@suse.cz>
2016-04-28 10:43:31 +02:00
Liu Bo
cf25ce518e Btrfs: do not create empty block group if we have allocated data
Now we force to create empty block group to keep data profile alive,
however, in the below example, we eventually get an empty block group
while we're trying to get more space for other types (metadata/system),

- Before,
block group "A": size=2G, used=1.2G
block group "B": size=2G, used=512M

- After "btrfs balance start -dusage=50 mount_point",
block group "A": size=2G, used=(1.2+0.5)G
block group "C": size=2G, used=0

Since there is no data in block group C, it won't be deleted
automatically and we have to get the unused 2G until the next mount.

Balance itself just moves data and doesn't remove data, so it's safe
to not create such a empty block group if we already have data
 allocated in other block groups.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:41:47 +02:00
Chandan Rajendra
a2af23b7d7 Btrfs: __btrfs_buffered_write: Pass valid file offset when releasing delalloc space
The delalloc reserved space is calculated in terms of number of bytes
used by an integral number of blocks. This is done by rounding down the
value of 'pos' to the nearest multiple of sectorsize.

The file offset value held by 'pos' variable may not be aligned to
sectorsize and hence when passing it as an argument to
btrfs_delalloc_release_space(), we may end up releasing larger delalloc
space than we originally had reserved.

Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:41:47 +02:00
Liu Bo
894b36e35a Btrfs: cleanup error handling in extent_write_cached_pages
Now that we bail out immediately if ->writepage() returns an error,
we don't need an extra error to retain the error code.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:41:47 +02:00
Liu Bo
a91326679f Btrfs: make mapping->writeback_index point to the last written page
If sequential writer is writing in the middle of the page and it just redirties
the last written page by continuing from it.

In the above case this can end up with seeking back to that firstly redirtied
page after writing all the pages at the end of file because btrfs updates
mapping->writeback_index to 1 past the current one.

For non-cow filesystems, the cost is only about extra seek, while for cow
filesystems such as btrfs, it means unnecessary fragments.

To avoid it, we just need to continue writeback from the last written page.

This also updates btrfs to behave like what write_cache_pages() does, ie, bail
 out immediately if there is an error in writepage().

<Ref: https://www.spinics.net/lists/linux-btrfs/msg52628.html>

Reported-by: Holger Hoffstätte <holger.hoffstaette@googlemail.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:41:47 +02:00
Luke Dashjr
4c63c2454e btrfs: bugfix: handle FS_IOC32_{GETFLAGS,SETFLAGS,GETVERSION} in btrfs_ioctl
32-bit ioctl uses these rather than the regular FS_IOC_* versions. They can
be handled in btrfs using the same code. Without this, 32-bit {ch,ls}attr
fail.

Signed-off-by: Luke Dashjr <luke-jr+git@utopios.org>
Cc: stable@vger.kernel.org
Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:40:27 +02:00
Luis de Bethencourt
180e4d4700 btrfs: fix typos in comments
Correct a typo in the chunk_mutex name to make it grepable.

Since it is better to fix several typos at once, fixing the 2 more in the
same file.

Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:36:54 +02:00
Geert Uytterhoeven
6719afdcf8 Btrfs: Refactor btrfs_lock_cluster() to kill compiler warning
fs/btrfs/extent-tree.c: In function ‘btrfs_lock_cluster’:
fs/btrfs/extent-tree.c:6399: warning: ‘used_bg’ may be used uninitialized in this function

  - Replace "again: ... goto again;" by standard C "while (1) { ... }",
  - Move block not processed during the first iteration of the loop to the
    end of the loop, which allows to kill the "locked" variable,

Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-and-Tested-by: Miao Xie <miaox@cn.fujitsu.com>
[ the compilation warning has been fixed by other patch, now we want to
  clean up the function ]
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:36:54 +02:00
Anand Jain
0713d90c75 btrfs: remove save_error_info()
Actually save_error_info() sets the FS state to error and nothing else.
Further the word save doesn't induce caffeine when compared to the word
set in what actually it does.

So to make it better understandable move save_error_info() code to its
only consumer itself.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:36:54 +02:00
Satoru Takeuchi
13f48dc909 btrfs: Simplify conditions about compress while mapping btrfs flags to inode flags
Signed-off-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:36:54 +02:00
Anand Jain
c5f4ccb2f7 btrfs: move error handling code together in ctree.h
Looks like we added the incompatible defines in between the error
handling defines in the file ctree.h. Now group them back.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:36:54 +02:00
Anand Jain
2351d743f6 btrfs: remove unused function btrfs_assert()
Apparently looks like ASSERT does the same intended job,
as intended btrfs_assert().

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:36:54 +02:00
Anand Jain
34d9700702 btrfs: rename btrfs_std_error to btrfs_handle_fs_error
btrfs_std_error() handles errors, puts FS into readonly mode
(as of now). So its good idea to rename it to btrfs_handle_fs_error().

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ edit changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:36:54 +02:00
Yunlong Song
fe216c7a0f f2fs: fix to return 0 if err == -ENOENT in f2fs_readdir
Commit 57b62d29ad ("f2fs: fix to report
error in f2fs_readdir") causes f2fs_readdir to return -ENOENT when
get_lock_data_page returns -ENOENT. However, the original logic is to
continue when get_lock_data_page returns -ENOENT, but it forgets to
reset err to 0.

This will cause getdents64 incorretly return -ENOENT when lastdirent is
NULL in getdents64. This will lead to a wrong return value for syscall
caller.

Signed-off-by: Yunlong Song <yunlong.song@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-04-27 14:10:43 -07:00
Chao Yu
da011cc0da f2fs: move node pages only in victim section during GC
For foreground GC, we cache node blocks in victim section and set them
dirty, then we call sync_node_pages to flush these node pages, but
meanwhile, those node pages which does not locate in victim section
will be flushed together, so more bandwidth and continuous free space
would be occupied.

So for this condition, it's better to leave those unrelated node page
in cache for further write hit, and let CP or VM to flush them afterward.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-04-27 14:10:42 -07:00
Chao Yu
a4a13f582c f2fs: be aware of invalid filename length
The filename length in dirent of may become zero-sized after random junk
data injection, once encounter such dirent, find_target_dentry or
f2fs_add_inline_entries will run into an infinite loop. So let f2fs being
aware of that to avoid deadloop.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-04-27 14:10:42 -07:00
David S. Miller
c0cc53162a Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Minor overlapping changes in the conflicts.

In the macsec case, the change of the default ID macro
name overlapped with the 64-bit netlink attribute alignment
fixes in net-next.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-27 15:43:10 -04:00
Linus Torvalds
8ead9dd547 devpts: more pty driver interface cleanups
This is more prep-work for the upcoming pty changes.  Still just code
cleanup with no actual semantic changes.

This removes a bunch pointless complexity by just having the slave pty
side remember the dentry associated with the devpts slave rather than
the inode.  That allows us to remove all the "look up the dentry" code
for when we want to remove it again.

Together with moving the tty pointer from "inode->i_private" to
"dentry->d_fsdata" and getting rid of pointless inode locking, this
removes about 30 lines of code.  Not only is the end result smaller,
it's simpler and easier to understand.

The old code, for example, depended on the d_find_alias() to not just
find the dentry, but also to check that it is still hashed, which in
turn validated the tty pointer in the inode.

That is a _very_ roundabout way to say "invalidate the cached tty
pointer when the dentry is removed".

The new code just does

	dentry->d_fsdata = NULL;

in devpts_pty_kill() instead, invalidating the tty pointer rather more
directly and obviously.  Don't do something complex and subtle when the
obvious straightforward approach will do.

The rest of the patch (ie apart from code deletion and the above tty
pointer clearing) is just switching the calling convention to pass the
dentry or file pointer around instead of the inode.

Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Peter Anvin <hpa@zytor.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Peter Hurley <peter@hurleysoftware.com>
Cc: Serge Hallyn <serge.hallyn@ubuntu.com>
Cc: Willy Tarreau <w@1wt.eu>
Cc: Aurelien Jarno <aurelien@aurel32.net>
Cc: Alan Cox <gnomes@lxorguk.ukuu.org.uk>
Cc: Jann Horn <jann@thejh.net>
Cc: Greg KH <greg@kroah.com>
Cc: Jiri Slaby <jslaby@suse.com>
Cc: Florian Weimer <fw@deneb.enyo.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-26 15:47:32 -07:00
Jaegeuk Kim
6bfc49197e f2fs: issue cache flush on direct IO
Under direct IO path with O_(D)SYNC, it needs to set proper APPEND or UPDATE
flags, so taht f2fs_sync_file can make its data safe.

Acked-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-04-26 14:25:05 -07:00
Jaegeuk Kim
608514deba f2fs: set fsync mark only for the last dnode
In order to give atomic writes, we should consider power failure during
sync_node_pages in fsync.
So, this patch marks fsync flag only in the last dnode block.

Acked-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-04-26 14:24:59 -07:00
Jaegeuk Kim
c267ec1526 f2fs: report unwritten status in fsync_node_pages
The fsync_node_pages should return pass or failure so that user could know
fsync is completed or not.

Acked-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-04-26 14:24:54 -07:00
Jaegeuk Kim
5268137564 f2fs: split sync_node_pages with fsync_node_pages
This patch splits the existing sync_node_pages into (f)sync_node_pages.
The fsync_node_pages is used for f2fs_sync_file only.

Acked-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-04-26 14:24:48 -07:00
Jaegeuk Kim
e6e5f5610d f2fs: avoid writing 0'th page in volatile writes
The first page of volatile writes usually contains a sort of header information
which will be used for recovery.
(e.g., journal header of sqlite)

If this is written without other journal data, user needs to handle the stale
journal information.

Acked-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-04-26 14:24:39 -07:00
Jaegeuk Kim
eca76e783c f2fs: avoid needless lock for node pages when fsyncing a file
When fsync is called, sync_node_pages finds a proper direct node pages to flush.
But, it locks unrelated direct node pages together unnecessarily.

Acked-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-04-26 14:24:30 -07:00
Nicolas Dichtel
3c6f3714d6 fs/quota: use nla_put_u64_64bit()
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-26 12:00:48 -04:00
Alden Tondettar
a47241cdee udf: Prevent stack overflow on corrupted filesystem mount
Presently, a corrupted or malicious UDF filesystem containing a very large
number (or cycle) of Logical Volume Integrity Descriptor extent
indirections may trigger a stack overflow and kernel panic in
udf_load_logicalvolint() on mount.

Replace the unnecessary recursion in udf_load_logicalvolint() with
simple iteration. Set an arbitrary limit of 1000 indirections (which would
have almost certainly overflowed the stack without this fix), and treat
such cases as if there were no LVID.

Signed-off-by: Alden Tondettar <alden.tondettar@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2016-04-26 08:25:07 +02:00
Ilya Dryomov
6c1ea260f8 libceph: make authorizer destruction independent of ceph_auth_client
Starting the kernel client with cephx disabled and then enabling cephx
and restarting userspace daemons can result in a crash:

    [262671.478162] BUG: unable to handle kernel paging request at ffffebe000000000
    [262671.531460] IP: [<ffffffff811cd04a>] kfree+0x5a/0x130
    [262671.584334] PGD 0
    [262671.635847] Oops: 0000 [#1] SMP
    [262672.055841] CPU: 22 PID: 2961272 Comm: kworker/22:2 Not tainted 4.2.0-34-generic #39~14.04.1-Ubuntu
    [262672.162338] Hardware name: Dell Inc. PowerEdge R720/068CDY, BIOS 2.4.3 07/09/2014
    [262672.268937] Workqueue: ceph-msgr con_work [libceph]
    [262672.322290] task: ffff88081c2d0dc0 ti: ffff880149ae8000 task.ti: ffff880149ae8000
    [262672.428330] RIP: 0010:[<ffffffff811cd04a>]  [<ffffffff811cd04a>] kfree+0x5a/0x130
    [262672.535880] RSP: 0018:ffff880149aeba58  EFLAGS: 00010286
    [262672.589486] RAX: 000001e000000000 RBX: 0000000000000012 RCX: ffff8807e7461018
    [262672.695980] RDX: 000077ff80000000 RSI: ffff88081af2be04 RDI: 0000000000000012
    [262672.803668] RBP: ffff880149aeba78 R08: 0000000000000000 R09: 0000000000000000
    [262672.912299] R10: ffffebe000000000 R11: ffff880819a60e78 R12: ffff8800aec8df40
    [262673.021769] R13: ffffffffc035f70f R14: ffff8807e5b138e0 R15: ffff880da9785840
    [262673.131722] FS:  0000000000000000(0000) GS:ffff88081fac0000(0000) knlGS:0000000000000000
    [262673.245377] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [262673.303281] CR2: ffffebe000000000 CR3: 0000000001c0d000 CR4: 00000000001406e0
    [262673.417556] Stack:
    [262673.472943]  ffff880149aeba88 ffff88081af2be04 ffff8800aec8df40 ffff88081af2be04
    [262673.583767]  ffff880149aeba98 ffffffffc035f70f ffff880149aebac8 ffff8800aec8df00
    [262673.694546]  ffff880149aebac8 ffffffffc035c89e ffff8807e5b138e0 ffff8805b047f800
    [262673.805230] Call Trace:
    [262673.859116]  [<ffffffffc035f70f>] ceph_x_destroy_authorizer+0x1f/0x50 [libceph]
    [262673.968705]  [<ffffffffc035c89e>] ceph_auth_destroy_authorizer+0x3e/0x60 [libceph]
    [262674.078852]  [<ffffffffc0352805>] put_osd+0x45/0x80 [libceph]
    [262674.134249]  [<ffffffffc035290e>] remove_osd+0xae/0x140 [libceph]
    [262674.189124]  [<ffffffffc0352aa3>] __reset_osd+0x103/0x150 [libceph]
    [262674.243749]  [<ffffffffc0354703>] kick_requests+0x223/0x460 [libceph]
    [262674.297485]  [<ffffffffc03559e2>] ceph_osdc_handle_map+0x282/0x5e0 [libceph]
    [262674.350813]  [<ffffffffc035022e>] dispatch+0x4e/0x720 [libceph]
    [262674.403312]  [<ffffffffc034bd91>] try_read+0x3d1/0x1090 [libceph]
    [262674.454712]  [<ffffffff810ab7c2>] ? dequeue_entity+0x152/0x690
    [262674.505096]  [<ffffffffc034cb1b>] con_work+0xcb/0x1300 [libceph]
    [262674.555104]  [<ffffffff8108fb3e>] process_one_work+0x14e/0x3d0
    [262674.604072]  [<ffffffff810901ea>] worker_thread+0x11a/0x470
    [262674.652187]  [<ffffffff810900d0>] ? rescuer_thread+0x310/0x310
    [262674.699022]  [<ffffffff810957a2>] kthread+0xd2/0xf0
    [262674.744494]  [<ffffffff810956d0>] ? kthread_create_on_node+0x1c0/0x1c0
    [262674.789543]  [<ffffffff817bd81f>] ret_from_fork+0x3f/0x70
    [262674.834094]  [<ffffffff810956d0>] ? kthread_create_on_node+0x1c0/0x1c0

What happens is the following:

    (1) new MON session is established
    (2) old "none" ac is destroyed
    (3) new "cephx" ac is constructed
    ...
    (4) old OSD session (w/ "none" authorizer) is put
          ceph_auth_destroy_authorizer(ac, osd->o_auth.authorizer)

osd->o_auth.authorizer in the "none" case is just a bare pointer into
ac, which contains a single static copy for all services.  By the time
we get to (4), "none" ac, freed in (2), is long gone.  On top of that,
a new vtable installed in (3) points us at ceph_x_destroy_authorizer(),
so we end up trying to destroy a "none" authorizer with a "cephx"
destructor operating on invalid memory!

To fix this, decouple authorizer destruction from ac and do away with
a single static "none" authorizer by making a copy for each OSD or MDS
session.  Authorizers themselves are independent of ac and so there is
no reason for destroy_authorizer() to be an ac op.  Make it an op on
the authorizer itself by turning ceph_authorizer into a real struct.

Fixes: http://tracker.ceph.com/issues/15447

Reported-by: Alan Zhang <alan.zhang@linux.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2016-04-25 20:54:13 +02:00
Andrew Gabbasov
c26f6c6157 udf: Fix conversion of 'dstring' fields to UTF8
Commit 9293fcfbc1
("udf: Remove struct ustr as non-needed intermediate storage"),
while getting rid of 'struct ustr', does not take any special care
of 'dstring' fields and effectively use fixed field length instead
of actual string length, encoded in the last byte of the field.

Also, commit 484a10f493
("udf: Merge linux specific translation into CS0 conversion function")
introduced checking of the length of the string being converted,
requiring proper alignment to number of bytes constituing each
character.

The UDF volume identifier is represented as a 32-bytes 'dstring',
and needs to be converted from CS0 to UTF8, while mounting UDF
filesystem. The changes in mentioned commits can in some cases
lead to incorrect handling of volume identifier:
- if the actual string in 'dstring' is of maximal length and
does not have zero bytes separating it from dstring encoded
length in last byte, that last byte may be included in conversion,
thus making incorrect resulting string;
- if the identifier is encoded with 2-bytes characters (compression
code is 16), the length of 31 bytes (32 bytes of field length minus
1 byte of compression code), taken as the string length, is reported
as an incorrect (unaligned) length, and the conversion fails, which
in its turn leads to volume mounting failure.

This patch introduces handling of 'dstring' encoded length field
in udf_CS0toUTF8 function, that is used in all and only cases
when 'dstring' fields are converted. Currently these cases are
processing of Volume Identifier and Volume Set Identifier fields.
The function is also renamed to udf_dstrCS0toUTF8 to distinctly
indicate that it handles 'dstring' input.

Signed-off-by: Andrew Gabbasov <andrew_gabbasov@mentor.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2016-04-25 15:18:50 +02:00
Ashish Samant
2c932d4c91 fuse: Fix return value from fuse_get_user_pages()
fuse_get_user_pages() should return error or 0. Otherwise fuse_direct_io
read will not return 0 to indicate that read has completed.

Fixes: 742f992708 ("fuse: return patrial success from fuse_direct_io()")
Signed-off-by: Ashish Samant <ashish.samant@oracle.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-04-25 13:01:04 +02:00
David S. Miller
1602f49b58 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts were two cases of simple overlapping changes,
nothing serious.

In the UDP case, we need to add a hlist_add_tail_rcu()
to linux/rculist.h, because we've moved UDP socket handling
away from using nulls lists.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-23 18:51:33 -04:00
Yan, Zheng
b971e94e8f ceph: kill __ceph_removexattr()
when removing a xattr, generic_removexattr() calls __ceph_setxattr()
with NULL value and XATTR_REPLACE flag. __ceph_removexattr() is not
used any more.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-04-23 15:41:30 -04:00
Andreas Gruenbacher
2cdeb1e472 ceph: Switch to generic xattr handlers
Add a catch-all xattr handler at the end of ceph_xattr_handlers.  Check
for valid attribute names there, and remove those checks from
__ceph_{get,set,remove}xattr instead.  No "system.*" xattrs need to be
handled by the catch-all handler anymore.

The set xattr handler is called with a NULL value to indicate that the
attribute should be removed; __ceph_setxattr already handles that case
correctly (ceph_set_acl could already calling __ceph_setxattr with a NULL
value).

Move the check for snapshots from ceph_{set,remove}xattr into
__ceph_{set,remove}xattr.  With that, ceph_{get,set,remove}xattr can be
replaced with the generic iops.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-04-23 15:41:30 -04:00
Andreas Gruenbacher
a26feccaba ceph: Get rid of d_find_alias in ceph_set_acl
Create a variant of ceph_setattr that takes an inode instead of a
dentry.  Change __ceph_setxattr (and also __ceph_removexattr) to take an
inode instead of a dentry.  Use those in ceph_set_acl so that we no
longer need a dentry there.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-04-23 15:41:30 -04:00
Andreas Gruenbacher
a9ae008f40 cifs: Switch to generic xattr handlers
Use xattr handlers for resolving attribute names.  The amount of setup
code required on cifs is nontrivial, so use the same get and set
functions for all handlers, with switch statements for the different
types of attributes in them.

The set_EA handler can handle NULL values, so we don't need a separate
removexattr function anymore.  Remove the cifs_dbg statements related to
xattr name resolution; they don't add much.  Don't build xattr.o when
CONFIG_CIFS_XATTR is not defined.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-04-23 15:33:03 -04:00
Andreas Gruenbacher
534bb0c7bd cifs: Fix removexattr for os2.* xattrs
If cifs_removexattr finds a "user." or "os2." xattr name prefix, it
skips 5 bytes, one byte too many for "os2.".

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-04-23 15:33:03 -04:00
Andreas Gruenbacher
45987e006c cifs: Check for equality with ACL_TYPE_ACCESS and ACL_TYPE_DEFAULT
The two values ACL_TYPE_ACCESS and ACL_TYPE_DEFAULT are meant to be
enumerations, not bits in a bit mask.  Use '==' instead of '&' to check
for these values.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-04-23 15:33:03 -04:00
Andreas Gruenbacher
d9a1548921 cifs: Fix xattr name checks
Use strcmp(str, name) instead of strncmp(str, name, strlen(name)) for
checking if str and name are the same (as opposed to name being a prefix
of str) in the gexattr and setxattr inode operations.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-04-23 15:33:03 -04:00
Herbert Xu
e81f3340bb eCryptfs: Do not allocate hash tfm in NORECLAIM context
You cannot allocate crypto tfm objects in NORECLAIM or NOFS contexts.
The ecryptfs code currently does exactly that for the MD5 tfm.

This patch fixes it by preallocating the MD5 tfm in a safe context.

The MD5 tfm is also reentrant so this patch removes the superfluous
cs_hash_tfm_mutex.

Reported-by: Nicolas Boichat <drinkcat@chromium.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2016-04-20 17:50:01 +08:00
Daniel DeFreez
9c7fe83530 GFS2: Add calls to gfs2_holder_uninit in two error handlers
This patch fixes two locations that do not call gfs2_holder_uninit
if gfs2_glock_nq returns an error.

Signed-off-by: Daniel DeFreez <dcdefreez@ucdavis.edu>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2016-04-19 19:59:11 -04:00
Linus Torvalds
9a0e3eea25 Merge branch 'ptmx-cleanup'
Merge the ptmx internal interface cleanup branch.

This doesn't change semantics, but it should be a sane basis for
eventually getting the multi-instance devpts code into some sane shape
where we can get rid of the kernel config option.  Which we can
hopefully get done next merge window..

* ptmx-cleanup:
  devpts: clean up interface to pty drivers
2016-04-19 16:36:18 -07:00
Linus Torvalds
67245ff332 devpts: clean up interface to pty drivers
This gets rid of the horrible notion of having that

    struct inode *ptmx_inode

be the linchpin of the interface between the pty code and devpts.

By de-emphasizing the ptmx inode, a lot of things actually get cleaner,
and we will have a much saner way forward.  In particular, this will
allow us to associate with any particular devpts instance at open-time,
and not be artificially tied to one particular ptmx inode.

The patch itself is actually fairly straightforward, and apart from some
locking and return path cleanups it's pretty mechanical:

 - the interfaces that devpts exposes all take "struct pts_fs_info *"
   instead of "struct inode *ptmx_inode" now.

   NOTE! The "struct pts_fs_info" thing is a completely opaque structure
   as far as the pty driver is concerned: it's still declared entirely
   internally to devpts. So the pty code can't actually access it in any
   way, just pass it as a "cookie" to the devpts code.

 - the "look up the pts fs info" is now a single clear operation, that
   also does the reference count increment on the pts superblock.

   So "devpts_add/del_ref()" is gone, and replaced by a "lookup and get
   ref" operation (devpts_get_ref(inode)), along with a "put ref" op
   (devpts_put_ref()).

 - the pty master "tty->driver_data" field now contains the pts_fs_info,
   not the ptmx inode.

 - because we don't care about the ptmx inode any more as some kind of
   base index, the ref counting can now drop the inode games - it just
   gets the ref on the superblock.

 - the pts_fs_info now has a back-pointer to the super_block. That's so
   that we can easily look up the information we actually need. Although
   quite often, the pts fs info was actually all we wanted, and not having
   to look it up based on some magical inode makes things more
   straightforward.

In particular, now that "devpts_get_ref(inode)" operation should really
be the *only* place we need to look up what devpts instance we're
associated with, and we do it exactly once, at ptmx_open() time.

The other side of this is that one ptmx node could now be associated
with multiple different devpts instances - you could have a single
/dev/ptmx node, and then have multiple mount namespaces with their own
instances of devpts mounted on /dev/pts/.  And that's all perfectly sane
in a model where we just look up the pts instance at open time.

This will eventually allow us to get rid of our odd single-vs-multiple
pts instance model, but this patch in itself changes no semantics, only
an internal binding model.

Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Peter Anvin <hpa@zytor.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Peter Hurley <peter@hurleysoftware.com>
Cc: Serge Hallyn <serge.hallyn@ubuntu.com>
Cc: Willy Tarreau <w@1wt.eu>
Cc: Aurelien Jarno <aurelien@aurel32.net>
Cc: Alan Cox <gnomes@lxorguk.ukuu.org.uk>
Cc: Jann Horn <jann@thejh.net>
Cc: Greg KH <greg@kroah.com>
Cc: Jiri Slaby <jslaby@suse.com>
Cc: Florian Weimer <fw@deneb.enyo.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-18 13:43:02 -07:00
Greg Kroah-Hartman
5614e77258 Merge 4.6-rc4 into driver-core-next
We want those fixes in here as well.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-04-19 04:28:28 +09:00
Masanari Iida
c19ca6cb4c treewide: Fix typos in printk
This patch fix spelling typos found in printk
within various part of the kernel sources.

Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2016-04-18 11:23:24 +02:00
Jiri Kosina
9938b04472 Merge branch 'master' into for-next
Sync with Linus' tree so that patches against newer codebase can be applied.

Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2016-04-18 11:18:55 +02:00
Masanari Iida
bd7ced9881 Doc: treewide : Fix typos in DocBook/filesystem.xml
This patch fix spelling typos found in DocBook/filesystem.xml.
It is because the file was generated from comments in code,
I have to fix the comments in codes, instead of xml file.

Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2016-04-18 11:13:05 +02:00
Linus Torvalds
e1e22b27ec "driver core" fixes for 4.6-rc4
Here are 3 small fixes 4.6-rc4.  Two fix up some lz4 issues with big
 endian systems, and the remaining one resolves a minor debugfs issue
 that was reported.
 
 All have been in linux-next with no reported issues.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iEYEABECAAYFAlcS3ucACgkQMUfUDdst+ynw1QCbBGgbY7Xt08whNFcAP81z1Q5X
 fmsAn1fyrhfsxe+JnybzswsTOjFw99Xd
 =YX6h
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-4.6-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull misc fixes from Greg KH:
 "Here are three small fixes for 4.6-rc4.

  Two fix up some lz4 issues with big endian systems, and the remaining
  one resolves a minor debugfs issue that was reported.

  All have been in linux-next with no reported issues"

* tag 'driver-core-4.6-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
  lib: lz4: cleanup unaligned access efficiency detection
  lib: lz4: fixed zram with lz4 on big endian machines
  debugfs: Make automount point inodes permanently empty
2016-04-16 20:53:50 -07:00
Jaegeuk Kim
c27753d675 f2fs: flush dirty pages before starting atomic writes
If somebody wrote some data before atomic writes, we should flush them in order
to handle atomic data in a right period.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-04-15 08:50:25 -07:00
Jaegeuk Kim
63c52d7878 f2fs: don't invalidate atomic page if successful
If we committed atomic write successfully, we don't need to invalidate pages.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-04-15 08:50:25 -07:00
Jaegeuk Kim
58457f1c35 f2fs: give -E2BIG for no space in xattr
This patch returns -E2BIG if there is no space to add an xattr entry.
This should fix generic/026 in xfstests as well.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-04-15 08:50:25 -07:00
Jaegeuk Kim
4da7bf5a43 f2fs: remove redundant condition check
This patch resolves the redundant condition check reported by David.

Reported-by: David Binderman <dcb314@hotmail.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-04-15 08:49:47 -07:00
Jaegeuk Kim
26dc3d4424 f2fs: unset atomic/volatile flag in f2fs_release_file
The atomic/volatile operation should be done in pair of start and commit
ioctl.
For example, if a killed process remains open-ended atomic operation, we should
drop its flag as well as its atomic data. Otherwise, if sqlite initiates another
operation which doesn't require atomic writes, it will lose every data, since
f2fs still treats with them as atomic writes; nobody will trigger its commit.

Reported-by: Miao Xie <miaoxie@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-04-15 08:49:47 -07:00
Jaegeuk Kim
de5307e46d f2fs: fix dropping inmemory pages in a wrong time
When one reader closes its file while the other writer is doing atomic writes,
f2fs_release_file drops atomic data resulting in an empty commit.
This patch fixes this wrong commit problem by checking openess of the file.

 Process0                       Process1
 				open file
 start atomic write
 write data
 read data
				close file
				f2fs_release_file()
				clear atomic data
 commit atomic write

Reported-by: Miao Xie <miaoxie@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-04-15 08:49:47 -07:00
Jaegeuk Kim
ff37355886 f2fs: add BUG_ON to avoid unnecessary flow
This patch adds BUG_ON instead of retrying loop.
In the case of node pages, we already got this inode page, but unlocked it.
By the fact that we don't truncate any node pages in operations, the page's
mapping should be unchangeable.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-04-15 08:49:47 -07:00
Jaegeuk Kim
4a6de50d54 f2fs: use PGP_LOCK to check its truncation
Previously, after trylock_page is succeeded, it doesn't check its mapping.
In order to fix that, we can just give PGP_LOCK to pagecache_get_page.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-04-15 08:49:47 -07:00
Chao Yu
675f10bde6 f2fs: fix to convert inline directory correctly
With below serials, we will lose parts of dirents:

1) mount f2fs with inline_dentry option
2) echo 1 > /sys/fs/f2fs/sdX/dir_level
3) mkdir dir
4) touch 180 files named [1-180] in dir
5) touch 181 in dir
6) echo 3 > /proc/sys/vm/drop_caches
7) ll dir

ls: cannot access 2: No such file or directory
ls: cannot access 4: No such file or directory
ls: cannot access 5: No such file or directory
ls: cannot access 6: No such file or directory
ls: cannot access 8: No such file or directory
ls: cannot access 9: No such file or directory
...
total 360
drwxr-xr-x 2 root root 4096 Feb 19 15:12 ./
drwxr-xr-x 3 root root 4096 Feb 19 15:11 ../
-rw-r--r-- 1 root root    0 Feb 19 15:12 1
-rw-r--r-- 1 root root    0 Feb 19 15:12 10
-rw-r--r-- 1 root root    0 Feb 19 15:12 100
-????????? ? ?    ?       ?            ? 101
-????????? ? ?    ?       ?            ? 102
-????????? ? ?    ?       ?            ? 103
...

The reason is: when doing the inline dir conversion, we didn't consider
that directory has hierarchical hash structure which can be configured
through sysfs interface 'dir_level'.

By default, dir_level of directory inode is 0, it means we have one bucket
in hash table located in first level, all dirents will be hashed in this
bucket, so it has no problem for us to do the duplication simply between
inline dentry page and converted normal dentry page.

However, if we configured dir_level with the value N (greater than 0), it
will expand the bucket number of first level hash table by 2^N - 1, it
hashs dirents into different buckets according their hash value, if we
still move all dirents to first bucket, it makes incorrent locating for
inline dirents, the result is, although we can iterate all dirents through
->readdir, we can't stat some of them in ->lookup which based on hash
table searching.

This patch fixes this issue by rehashing dirents into correct position
when converting inline directory.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-04-15 08:49:47 -07:00
Jaegeuk Kim
8c11a53fc2 f2fs: show current mount status
This patch remains the current mount status to f2fs status info.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-04-15 08:49:47 -07:00
Jaegeuk Kim
faa0e55bba f2fs: treat as a normal umount when remounting ro
When user remounts f2fs as read-only, we can mark the checkpoint as umount.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-04-15 08:49:47 -07:00
Jaegeuk Kim
6781eabba1 f2fs: give -EINVAL for norecovery and rw mount
Once detecting something to recover, f2fs should stop mounting, given norecovery
and rw mount options.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-04-15 08:49:47 -07:00
Jaegeuk Kim
df728b0f69 f2fs: recover superblock at RW remounts
This patch adds a sbi flag, SBI_NEED_SB_WRITE, which indicates it needs to
recover superblock when (re)mounting as RW. This is set only when f2fs is
mounted as RO.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-04-15 08:49:47 -07:00
Jaegeuk Kim
f2353d7bd4 f2fs: give RO message when recovering superblock
When one of superblocks is missing, f2fs recovers it with the valid one.
But, even if f2fs is mounted as RO, we'd better notify that too.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-04-15 08:49:47 -07:00
Linus Torvalds
dfe70581c1 for-linus-4.6-rc4
These patches fix f2fs and fscrypto based on -rc3 bug fixes in ext4 crypto,
 which have not yet been fully propagated as follows.
  - use of dget_parent and file_dentry to avoid crashes
  - disallow RCU-mode lookup in d_invalidate
  - disallow -ENOMEM in the core data encryption path
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJXDbkhAAoJEEAUqH6CSFDS6yMP/1L+1PcmTJKloe0wCft57fL0
 hXRXI57DeNwCxc9fKmaBnAn9uv96qc7fWq1oMqT0oYPTf8qGUaIc0bALxx/IFCwh
 Ul1QOzlGJmomPq4kGleOEPmDGBGUwDYgH55rEadhmevbOMX5IGQOSDGu9qxZ+LMA
 orYTYTHX5tjeR1UAxzjMkwa0WRyYJGRsqqycgIl8f/kIg06GIpxOx1IIuFQOBv1N
 BJlkKbbiX/snWKXXVCNlFzVAZOWjpgzfvCGB0Fge4HZlMZL876jKIj+4dVmQ5xiI
 K6i5NXk1U13hhQEFzK4McicHBwlMnMxeMbdR8wm1vQ/4UfQFZoQAeClAxz5/ZyPr
 lHi+VamyTbMfRwijp49hNQ8SKjMKPd70yFd2d2csQAwVra2t4fmXqieh6AEppkJb
 EDeFAK8Y+PGis6XVnAJ1wkXW7OSwzfJUVAmj3oFsZ7Onny+mm++f+i7a7I3RKUib
 SH23Imv1OQaoGcAFVSCYe/JP9dcfKUwzSFFYOiQneqdNTNN8rzoPYYhQ4hNCKv6/
 zHSEI2EcpGipmSTh4dROPJvNf42QG8Sw4lEuyGsXy251t7wxBrc7e43BzXtjyKsl
 /2d5hgAQbR9HeU3Oz41Dq/LEjReEAcdJDKMETP9/qrzs7qDJgtI0XoUkVSn3NyOR
 i3JSEix2JljR97DzUIW/
 =ML1+
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-4.6-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs

Pull f2fs/fscrypto fixes from Jaegeuk Kim:
 "In addition to f2fs/fscrypto fixes, I've added one patch which
  prevents RCU mode lookup in d_revalidate, as Al mentioned.

  These patches fix f2fs and fscrypto based on -rc3 bug fixes in ext4
  crypto, which have not yet been fully propagated as follows.

   - use of dget_parent and file_dentry to avoid crashes
   - disallow RCU-mode lookup in d_invalidate
   - disallow -ENOMEM in the core data encryption path"

* tag 'for-linus-4.6-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs:
  ext4/fscrypto: avoid RCU lookup in d_revalidate
  fscrypto: don't let data integrity writebacks fail with ENOMEM
  f2fs: use dget_parent and file_dentry in f2fs_file_open
  fscrypto: use dget_parent() in fscrypt_d_revalidate()
2016-04-14 18:22:42 -07:00
Linus Torvalds
34dbbcdbf6 Make file credentials available to the seqfile interfaces
A lot of seqfile users seem to be using things like %pK that uses the
credentials of the current process, but that is actually completely
wrong for filesystem interfaces.

The unix semantics for permission checking files is to check permissions
at _open_ time, not at read or write time, and that is not just a small
detail: passing off stdin/stdout/stderr to a suid application and making
the actual IO happen in privileged context is a classic exploit
technique.

So if we want to be able to look at permissions at read time, we need to
use the file open credentials, not the current ones.  Normal file
accesses can just use "f_cred" (or any of the helper functions that do
that, like file_ns_capable()), but the seqfile interfaces do not have
any such options.

It turns out that seq_file _does_ save away the user_ns information of
the file, though.  Since user_ns is just part of the full credential
information, replace that special case with saving off the cred pointer
instead, and suddenly seq_file has all the permission information it
needs.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-14 12:56:09 -07:00
Bob Peterson
e97321fa09 GFS2: Don't dereference inode in gfs2_inode_lookup until it's valid
Function gfs2_inode_lookup was dereferencing the inode, and after,
it checks for the value being NULL. We need to check that first.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2016-04-14 09:52:50 -04:00
Hannes Frederic Sowa
fafc4e1ea1 sock: tigthen lockdep checks for sock_owned_by_user
sock_owned_by_user should not be used without socket lock held. It seems
to be a common practice to check .owned before lock reclassification, so
provide a little help to abstract this check away.

Cc: linux-cifs@vger.kernel.org
Cc: linux-bluetooth@vger.kernel.org
Cc: linux-nfs@vger.kernel.org
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-13 22:37:20 -04:00
Jaegeuk Kim
03a8bb0e53 ext4/fscrypto: avoid RCU lookup in d_revalidate
As Al pointed, d_revalidate should return RCU lookup before using d_inode.
This was originally introduced by:
commit 34286d6662 ("fs: rcu-walk aware d_revalidate method").

Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: stable <stable@vger.kernel.org>
2016-04-12 20:01:35 -07:00
Seth Forshee
87243deb88 debugfs: Make automount point inodes permanently empty
Starting with 4.1 the tracing subsystem has its own filesystem
which is automounted in the tracing subdirectory of debugfs.
Prior to this debugfs could be bind mounted in a cloned mount
namespace, but if tracefs has been mounted under debugfs this
now fails because there is a locked child mount. This creates
a regression for container software which bind mounts debugfs
to satisfy the assumption of some userspace software.

In other pseudo filesystems such as proc and sysfs we're already
creating mountpoints like this in such a way that no dirents can
be created in the directories, allowing them to be exceptions to
some MNT_LOCKED tests. In fact we're already do this for the
tracefs mountpoint in sysfs.

Do the same in debugfs_create_automount(), since the intention
here is clearly to create a mountpoint. This fixes the regression,
as locked child mounts on permanently empty directories do not
cause a bind mount to fail.

Cc: stable@vger.kernel.org # v4.1+
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-04-12 15:01:53 -07:00
Nicolai Stange
c4a74f63df debugfs: unproxify files created through debugfs_create_u32_array()
The struct file_operations u32_array_fops associated with files created
through debugfs_create_u32_array() has been lifetime aware already:
everything needed for subsequent operation is copied to a ->f_private
buffer at file opening time in u32_array_open(). Now, ->open() is always
protected against file removal issues by the debugfs core.

There is no need for the debugfs core to wrap the u32_array_fops
with a file lifetime managing proxy.

Make debugfs_create_u32_array() create its files in non-proxying operation
mode by means of debugfs_create_file_unsafe().

Signed-off-by: Nicolai Stange <nicstange@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-04-12 14:14:21 -07:00
Nicolai Stange
83b711cbf4 debugfs: unproxify files created through debugfs_create_blob()
Currently, the struct file_operations fops_blob associated with files
created through the debugfs_create_blob() helpers are not file
lifetime aware.

Thus, a lifetime managing proxy is created around fops_blob each time such
a file is opened which is an unnecessary waste of resources.

Implement file lifetime management for the fops_bool file_operations.
Namely, make read_file_blob() safe gainst file removals by means of
debugfs_use_file_start() and debugfs_use_file_finish().

Make debugfs_create_blob() create its files in non-proxying operation mode
by means of debugfs_create_file_unsafe().

Signed-off-by: Nicolai Stange <nicstange@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-04-12 14:14:21 -07:00
Nicolai Stange
4d45f7974c debugfs: unproxify files created through debugfs_create_bool()
Currently, the struct file_operations fops_bool associated with files
created through the debugfs_create_bool() helpers are not file
lifetime aware.

Thus, a lifetime managing proxy is created around fops_bool each time such
a file is opened which is an unnecessary waste of resources.

Implement file lifetime management for the fops_bool file_operations.
Namely, make debugfs_read_file_bool() and debugfs_write_file_bool() safe
against file removals by means of debugfs_use_file_start() and
debugfs_use_file_finish().

Make debugfs_create_bool() create its files in non-proxying operation mode
through debugfs_create_mode_unsafe().

Finally, purge debugfs_create_mode() as debugfs_create_bool() had been its
last user.

Signed-off-by: Nicolai Stange <nicstange@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-04-12 14:14:21 -07:00
Nicolai Stange
4909f16810 debugfs: unproxify integer attribute files
Currently, the struct file_operations associated with the integer attribute
style files created through the debugfs_create_*() helpers are not file
lifetime aware as they are defined by means of DEFINE_SIMPLE_ATTRIBUTE().

Thus, a lifetime managing proxy is created around the original fops each
time such a file is opened which is an unnecessary waste of resources.

Migrate all usages of DEFINE_SIMPLE_ATTRIBUTE() within debugfs itself
to DEFINE_DEBUGFS_ATTRIBUTE() in order to implement file lifetime managing
within the struct file_operations thus defined.

Introduce the debugfs_create_mode_unsafe() helper, analogous to
debugfs_create_mode(), but distinct in that it creates the files in
non-proxying operation mode through debugfs_create_file_unsafe().

Feed all struct file_operations migrated to DEFINE_DEBUGFS_ATTRIBUTE()
into debugfs_create_mode_unsafe() instead of former debugfs_create_mode().

Signed-off-by: Nicolai Stange <nicstange@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-04-12 14:14:21 -07:00
Nicolai Stange
c646880814 debugfs: add support for self-protecting attribute file fops
In order to protect them against file removal issues, debugfs_create_file()
creates a lifetime managing proxy around each struct file_operations
handed in.

In cases where this struct file_operations is able to manage file lifetime
by itself already, the proxy created by debugfs is a waste of resources.

The most common class of struct file_operations given to debugfs are those
defined by means of the DEFINE_SIMPLE_ATTRIBUTE() macro.

Introduce a DEFINE_DEBUGFS_ATTRIBUTE() macro to allow any
struct file_operations of this class to be easily made file lifetime aware
and thus, to be operated unproxied.

Specifically, introduce debugfs_attr_read() and debugfs_attr_write()
which wrap simple_attr_read() and simple_attr_write() under the protection
of a debugfs_use_file_start()/debugfs_use_file_finish() pair.

Make DEFINE_DEBUGFS_ATTRIBUTE() set the defined struct file_operations'
->read() and ->write() members to these wrappers.

Export debugfs_create_file_unsafe() in order to allow debugfs users to
create their files in non-proxying operation mode.

Signed-off-by: Nicolai Stange <nicstange@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-04-12 14:14:21 -07:00
Nicolai Stange
49d200deaa debugfs: prevent access to removed files' private data
Upon return of debugfs_remove()/debugfs_remove_recursive(), it might
still be attempted to access associated private file data through
previously opened struct file objects. If that data has been freed by
the caller of debugfs_remove*() in the meanwhile, the reading/writing
process would either encounter a fault or, if the memory address in
question has been reassigned again, unrelated data structures could get
overwritten.

However, since debugfs files are seldomly removed, usually from module
exit handlers only, the impact is very low.

Currently, there are ~1000 call sites of debugfs_create_file() spread
throughout the whole tree and touching all of those struct file_operations
in order to make them file removal aware by means of checking the result of
debugfs_use_file_start() from within their methods is unfeasible.

Instead, wrap the struct file_operations by a lifetime managing proxy at
file open:
- In debugfs_create_file(), the original fops handed in has got stashed
  away in ->d_fsdata already.
- In debugfs_create_file(), install a proxy file_operations factory,
  debugfs_full_proxy_file_operations, at ->i_fop.

This proxy factory has got an ->open() method only. It carries out some
lifetime checks and if successful, dynamically allocates and sets up a new
struct file_operations proxy at ->f_op. Afterwards, it forwards to the
->open() of the original struct file_operations in ->d_fsdata, if any.

The dynamically set up proxy at ->f_op has got a lifetime managing wrapper
set for each of the methods defined in the original struct file_operations
in ->d_fsdata.

Its ->release()er frees the proxy again and forwards to the original
->release(), if any.

In order not to mislead the VFS layer, it is strictly necessary to leave
those fields blank in the proxy that have been NULL in the original
struct file_operations also, i.e. aren't supported. This is why there is a
need for dynamically allocated proxies. The choice made not to allocate a
proxy instance for every dentry at file creation, but for every
struct file object instantiated thereof is justified by the expected usage
pattern of debugfs, namely that in general very few files get opened more
than once at a time.

The wrapper methods set in the struct file_operations implement lifetime
managing by means of the SRCU protection facilities already in place for
debugfs:
They set up a SRCU read side critical section and check whether the dentry
is still alive by means of debugfs_use_file_start(). If so, they forward
the call to the original struct file_operation stored in ->d_fsdata, still
under the protection of the SRCU read side critical section.
This SRCU read side critical section prevents any pending debugfs_remove()
and friends to return to their callers. Since a file's private data must
only be freed after the return of debugfs_remove(), the ongoing proxied
call is guarded against any file removal race.

If, on the other hand, the initial call to debugfs_use_file_start() detects
that the dentry is dead, the wrapper simply returns -EIO and does not
forward the call. Note that the ->poll() wrapper is special in that its
signature does not allow for the return of arbitrary -EXXX values and thus,
POLLHUP is returned here.

In order not to pollute debugfs with wrapper definitions that aren't ever
needed, I chose not to define a wrapper for every struct file_operations
method possible. Instead, a wrapper is defined only for the subset of
methods which are actually set by any debugfs users.
Currently, these are:

  ->llseek()
  ->read()
  ->write()
  ->unlocked_ioctl()
  ->poll()

The ->release() wrapper is special in that it does not protect the original
->release() in any way from dead files in order not to leak resources.
Thus, any ->release() handed to debugfs must implement file lifetime
management manually, if needed.
For only 33 out of a total of 434 releasers handed in to debugfs, it could
not be verified immediately whether they access data structures that might
have been freed upon a debugfs_remove() return in the meanwhile.

Export debugfs_use_file_start() and debugfs_use_file_finish() in order to
allow any ->release() to manually implement file lifetime management.

For a set of common cases of struct file_operations implemented by the
debugfs_core itself, future patches will incorporate file lifetime
management directly within those in order to allow for their unproxied
operation. Rename the original, non-proxying "debugfs_create_file()" to
"debugfs_create_file_unsafe()" and keep it for future internal use by
debugfs itself. Factor out code common to both into the new
__debugfs_create_file().

Signed-off-by: Nicolai Stange <nicstange@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-04-12 14:14:21 -07:00
Nicolai Stange
9fd4dcece4 debugfs: prevent access to possibly dead file_operations at file open
Nothing prevents a dentry found by path lookup before a return of
__debugfs_remove() to actually get opened after that return. Now, after
the return of __debugfs_remove(), there are no guarantees whatsoever
regarding the memory the corresponding inode's file_operations object
had been kept in.

Since __debugfs_remove() is seldomly invoked, usually from module exit
handlers only, the race is hard to trigger and the impact is very low.

A discussion of the problem outlined above as well as a suggested
solution can be found in the (sub-)thread rooted at

  http://lkml.kernel.org/g/20130401203445.GA20862@ZenIV.linux.org.uk
  ("Yet another pipe related oops.")

Basically, Greg KH suggests to introduce an intermediate fops and
Al Viro points out that a pointer to the original ones may be stored in
->d_fsdata.

Follow this line of reasoning:
- Add SRCU as a reverse dependency of DEBUG_FS.
- Introduce a srcu_struct object for the debugfs subsystem.
- In debugfs_create_file(), store a pointer to the original
  file_operations object in ->d_fsdata.
- Make debugfs_remove() and debugfs_remove_recursive() wait for a
  SRCU grace period after the dentry has been delete()'d and before they
  return to their callers.
- Introduce an intermediate file_operations object named
  "debugfs_open_proxy_file_operations". It's ->open() functions checks,
  under the protection of a SRCU read lock, whether the dentry is still
  alive, i.e. has not been d_delete()'d and if so, tries to acquire a
  reference on the owning module.
  On success, it sets the file object's ->f_op to the original
  file_operations and forwards the ongoing open() call to the original
  ->open().
- For clarity, rename the former debugfs_file_operations to
  debugfs_noop_file_operations -- they are in no way canonical.

The choice of SRCU over "normal" RCU is justified by the fact, that the
former may also be used to protect ->i_private data from going away
during the execution of a file's readers and writers which may (and do)
sleep.

Finally, introduce the fs/debugfs/internal.h header containing some
declarations internal to the debugfs implementation.

Signed-off-by: Nicolai Stange <nicstange@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-04-12 14:14:21 -07:00
Jaegeuk Kim
b32e4482aa fscrypto: don't let data integrity writebacks fail with ENOMEM
This patch fixes the issue introduced by the ext4 crypto fix in a same manner.
For F2FS, however, we flush the pending IOs and wait for a while to acquire free
memory.

Fixes: c9af28fdd4 ("ext4 crypto: don't let data integrity writebacks fail with ENOMEM")
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-04-12 10:25:30 -07:00
Jaegeuk Kim
33b1395124 f2fs: use dget_parent and file_dentry in f2fs_file_open
This patch synced with the below two ext4 crypto fixes together.

In 4.6-rc1, f2fs newly introduced accessing f_path.dentry which crashes
overlayfs. To fix, now we need to use file_dentry() to access that field.

Fixes: c0a37d4878 ("ext4: use file_dentry()")
Fixes: 9dd78d8c9a ("ext4: use dget_parent() in ext4_file_open()")
Cc: Miklos Szeredi <mszeredi@redhat.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-04-12 10:24:22 -07:00
Jaegeuk Kim
d7d7535289 fscrypto: use dget_parent() in fscrypt_d_revalidate()
This patch updates fscrypto along with the below ext4 crypto change.

Fixes: 3d43bcfef5 ("ext4 crypto: use dget_parent() in ext4_d_revalidate()")
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-04-12 10:24:04 -07:00
Denys Vlasenko
a527b38e14 GFS2: fs/gfs2/glock.c: Deinline do_error, save 1856 bytes
This function compiles to 522 bytes of machine code.

Error paths are not very time critical.

Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2016-04-12 12:39:12 -04:00
David Howells
5ac7eace2d KEYS: Add a facility to restrict new links into a keyring
Add a facility whereby proposed new links to be added to a keyring can be
vetted, permitting them to be rejected if necessary.  This can be used to
block public keys from which the signature cannot be verified or for which
the signature verification fails.  It could also be used to provide
blacklisting.

This affects operations like add_key(), KEYCTL_LINK and KEYCTL_INSTANTIATE.

To this end:

 (1) A function pointer is added to the key struct that, if set, points to
     the vetting function.  This is called as:

	int (*restrict_link)(struct key *keyring,
			     const struct key_type *key_type,
			     unsigned long key_flags,
			     const union key_payload *key_payload),

     where 'keyring' will be the keyring being added to, key_type and
     key_payload will describe the key being added and key_flags[*] can be
     AND'ed with KEY_FLAG_TRUSTED.

     [*] This parameter will be removed in a later patch when
     	 KEY_FLAG_TRUSTED is removed.

     The function should return 0 to allow the link to take place or an
     error (typically -ENOKEY, -ENOPKG or -EKEYREJECTED) to reject the
     link.

     The pointer should not be set directly, but rather should be set
     through keyring_alloc().

     Note that if called during add_key(), preparse is called before this
     method, but a key isn't actually allocated until after this function
     is called.

 (2) KEY_ALLOC_BYPASS_RESTRICTION is added.  This can be passed to
     key_create_or_update() or key_instantiate_and_link() to bypass the
     restriction check.

 (3) KEY_FLAG_TRUSTED_ONLY is removed.  The entire contents of a keyring
     with this restriction emplaced can be considered 'trustworthy' by
     virtue of being in the keyring when that keyring is consulted.

 (4) key_alloc() and keyring_alloc() take an extra argument that will be
     used to set restrict_link in the new key.  This ensures that the
     pointer is set before the key is published, thus preventing a window
     of unrestrictedness.  Normally this argument will be NULL.

 (5) As a temporary affair, keyring_restrict_trusted_only() is added.  It
     should be passed to keyring_alloc() as the extra argument instead of
     setting KEY_FLAG_TRUSTED_ONLY on a keyring.  This will be replaced in
     a later patch with functions that look in the appropriate places for
     authoritative keys.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Mimi Zohar <zohar@linux.vnet.ibm.com>
2016-04-11 22:37:37 +01:00
David Howells
dc44b3a09a rxrpc: Differentiate local and remote abort codes in structs
In the rxrpc_connection and rxrpc_call structs, there's one field to hold
the abort code, no matter whether that value was generated locally to be
sent or was received from the peer via an abort packet.

Split the abort code fields in two for cleanliness sake and add an error
field to hold the Linux error number to the rxrpc_call struct too
(sometimes this is generated in a context where we can't return it to
userspace directly).

Furthermore, add a skb mark to indicate a packet that caused a local abort
to be generated so that recvmsg() can pick up the correct abort code.  A
future addition will need to be to indicate to userspace the difference
between aborts via a control message.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-11 15:34:40 -04:00
David Howells
2f02f7aea7 afs: Wait for outstanding async calls before closing rxrpc socket
The afs filesystem needs to wait for any outstanding asynchronous calls
(such as FS.GiveUpCallBacks cleaning up the callbacks lodged with a server)
to complete before closing the AF_RXRPC socket when unloading the module.

This may occur if the module is removed too quickly after unmounting all
filesystems.  This will produce an error report that looks like:

	AFS: Assertion failed
	1 == 0 is false
	0x1 == 0x0 is false
	------------[ cut here ]------------
	kernel BUG at ../fs/afs/rxrpc.c:135!
	...
	RIP: 0010:[<ffffffffa004111c>] afs_close_socket+0xec/0x107 [kafs]
	...
	Call Trace:
	 [<ffffffffa004a160>] afs_exit+0x1f/0x57 [kafs]
	 [<ffffffff810c30a0>] SyS_delete_module+0xec/0x17d
	 [<ffffffff81610417>] entry_SYSCALL_64_fastpath+0x12/0x6b

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-11 15:34:40 -04:00
Al Viro
ce23e64013 ->getxattr(): pass dentry and inode as separate arguments
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-04-11 00:48:00 -04:00
Al Viro
b296821a7c xattr_handler: pass dentry and inode as separate arguments of ->get()
... and do not assume they are already attached to each other

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-04-10 20:48:24 -04:00
Linus Torvalds
9f2394c9be Revert "ext4: allow readdir()'s of large empty directories to be interrupted"
This reverts commit 1028b55baf.

It's broken: it makes ext4 return an error at an invalid point, causing
the readdir wrappers to write the the position of the last successful
directory entry into the position field, which means that the next
readdir will now return that last successful entry _again_.

You can only return fatal errors (that terminate the readdir directory
walk) from within the filesystem readdir functions, the "normal" errors
(that happen when the readdir buffer fills up, for example) happen in
the iterorator where we know the position of the actual failing entry.

I do have a very different patch that does the "signal_pending()"
handling inside the iterator function where it is allowable, but while
that one passes all the sanity checks, I screwed up something like four
times while emailing it out, so I'm not going to commit it today.

So my track record is not good enough, and the stars will have to align
better before that one gets committed.  And it would be good to get some
review too, of course, since celestial alignments are always an iffy
debugging model.

IOW, let's just revert the commit that caused the problem for now.

Reported-by: Greg Thelen <gthelen@google.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-10 16:52:24 -07:00
Al Viro
79a628d14e reiserfs: switch to generic_{get,set,remove}xattr()
reiserfs_xattr_[sg]et() will fail with -EOPNOTSUPP for V1 inodes anyway,
and all reiserfs instances of ->[sg]et() call it and so does ->set_acl().

Checks for name length in the instances had been bogus; they should've
been "bugger off if it's _exactly_ the prefix" (as generic would
do on its own) and not "bugger off if it's shorter than the prefix" -
that can't happen.

xattr_full_name() is needed to adjust for the fact that generic instances
will skip the prefix in the name passed to ->[gs]et(); reiserfs homegrown
analogues didn't.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-04-10 19:31:09 -04:00
Al Viro
5fdccfef48 cifs: kill more bogus checks in ->...xattr() methods
none of that stuff can ever be called for NULL or negative
dentry.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-04-10 17:12:03 -04:00
Al Viro
fc64005c93 don't bother with ->d_inode->i_sb - it's always equal to ->d_sb
... and neither can ever be NULL

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-04-10 17:11:51 -04:00
David S. Miller
ae95d71261 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2016-04-09 17:41:41 -04:00
Linus Torvalds
839a3f7657 Merge branch 'for-linus-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "These are bug fixes, including a really old fsync bug, and a few trace
  points to help us track down problems in the quota code"

* 'for-linus-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: fix file/data loss caused by fsync after rename and new inode
  btrfs: Reset IO error counters before start of device replacing
  btrfs: Add qgroup tracing
  Btrfs: don't use src fd for printk
  btrfs: fallback to vmalloc in btrfs_compare_tree
  btrfs: handle non-fatal errors in btrfs_qgroup_inherit()
  btrfs: Output more info for enospc_debug mount option
  Btrfs: fix invalid reference in replace_path
  Btrfs: Improve FL_KEEP_SIZE handling in fallocate
2016-04-09 10:41:34 -07:00
Linus Torvalds
6759212640 Orangefs: cleanups and a strncpy vulnerability fix.
Cleanups:
 
   - remove an unused variable from orangefs_readdir.
   - clean up printk wrapper used for ofs "gossip" debugging.
   - clean up truncate ctime and mtime setting in inode.c
   - remove a useless null check found by coccinelle.
   - optimize some memcpy/memset boilerplate code.
   - remove some useless sanity checks from xattr.c
 
 Fix:
 
   - fix a potential strncpy vulnerability.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIcBAABAgAGBQJXCAvtAAoJEM9EDqnrzg2+U3kP/RcypLnTdnOLFI7GtTr7erpc
 4ys4UE3CdvOdIkDNgTeW+qTvy/b7qo3JBdaRqMAAmnWkxWn4cGXDGLhmfKD/6a3Q
 LL9vKxiczYN8gAYgnxoF5rNwoVScFVD7PospKkWMPedkRu00K0dqiJvomPXsvwM7
 MEl4ueewq3KMrXHu1GYSdoZhU8BWtku789Oa5leXRqEO2pRit0amX/qxB9KG6Urz
 Ebz0yCGMPP61nTN9A/Q7IM2imcJO5F0wc5uaSaWimjHeHb040ytMVqFa4GOh4Pky
 oQjjw5OYSsX/O7Hh66wKQ68YqYb762OKE1y5v8K43vRhWQtnQo9pdKpFVEK0An8Z
 wOSaUFd5mhPD2KENcKB/kSiH3eOyGdyD2QamptLJ6Opl/6UdPQMt++1EkuSuYEdW
 wnCFsJM5yam3Ot+REnQiYAVjLsDZ0XhPfNIuAp+d4LIV32JGTQPOBurKECwsJbj5
 fK0lsBNk7b8qgBwG41JjV7XyGlf5HWT2TwpuC2S5PysVXcxsvdfR3oVeUnzdL6aB
 fv0mp7bBqg2lNLKoeEYlxb+vbSiLutOCIPWSNptYjbCWa/q/bqnZMhXOUQjS5XRj
 HIG/91XeRUxOZk+y1gs5N5F8pCT9mfWRTvSCX5ex5x17rUFto/jhn2mQLAhxSyPD
 /AGQgNGz2VmmVBlaCngI
 =ZRlK
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-4.6-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux

Pull orangefs fixes from Mike Marshall:
 "Orangefs cleanups and a strncpy vulnerability fix.

  Cleanups:
   - remove an unused variable from orangefs_readdir.
   - clean up printk wrapper used for ofs "gossip" debugging.
   - clean up truncate ctime and mtime setting in inode.c
   - remove a useless null check found by coccinelle.
   - optimize some memcpy/memset boilerplate code.
   - remove some useless sanity checks from xattr.c

  Fix:
   - fix a potential strncpy vulnerability"

* tag 'for-linus-4.6-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux:
  orangefs: remove unused variable
  orangefs: Add KERN_<LEVEL> to gossip_<level> macros
  orangefs: strncpy -> strscpy
  orangefs: clean up truncate ctime and mtime setting
  Orangefs: fix ifnullfree.cocci warnings
  Orangefs: optimize boilerplate code.
  Orangefs: xattr.c cleanup
2016-04-09 10:33:58 -07:00
Martin Brandenburg
e56f498142 orangefs: remove unused variable
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2016-04-08 15:50:44 -04:00
Joe Perches
1917a69328 orangefs: Add KERN_<LEVEL> to gossip_<level> macros
Emit the logging messages at the appropriate levels.

Miscellanea:

o Change format to fmt
o Use the more common ##__VA_ARGS__

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2016-04-08 14:10:45 -04:00
Martin Brandenburg
2eacea74cc orangefs: strncpy -> strscpy
It would have been possible for a rogue client-core to send in a symlink
target which is not NUL terminated. This returns EIO if the client-core
gives us corrupt data.

Leave debugfs and superblock code as is for now.

Other dcache.c and namei.c strncpy instances are safe because
ORANGEFS_NAME_MAX = NAME_MAX + 1; there is always enough space for a
name plus a NUL byte.

Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2016-04-08 14:10:34 -04:00
Martin Brandenburg
f83140c146 orangefs: clean up truncate ctime and mtime setting
The ctime and mtime are always updated on a successful ftruncate and
only updated on a successful truncate where the size changed.

We handle the ``if the size changed'' bit.

This matches FUSE's behavior.

Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2016-04-08 14:10:31 -04:00
kbuild test robot
2fa37fd713 Orangefs: fix ifnullfree.cocci warnings
fs/orangefs/orangefs-debugfs.c:130:2-26: WARNING: NULL check before freeing functions like kfree, debugfs_remove, debugfs_remove_recursive or usb_free_urb is not needed. Maybe consider reorganizing relevant code to avoid passing NULL values.

 NULL check before some freeing functions is not needed.

 Based on checkpatch warning
 "kfree(NULL) is safe this check is probably not required"
 and kfreeaddr.cocci by Julia Lawall.

Generated by: scripts/coccinelle/free/ifnullfree.cocci

Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2016-04-08 14:08:38 -04:00
Mike Marshall
a9bb3ba81f Orangefs: optimize boilerplate code.
Suggested by David Binderman <dcb314@hotmail.com>
The former can potentially be a performance win over the latter.

memcpy(d, s, len);
memset(d+len, c, size-len);

memset(d, c, size);
memcpy(d, s, len);

Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2016-04-08 14:08:27 -04:00
Mike Marshall
2d09a2ca6a Orangefs: xattr.c cleanup
1. It is nonsense to test for negative size_t, suggested by
   David Binderman <dcb314@hotmail.com>

2. By the time Orangefs gets called, the vfs has ensured that
   name != NULL, and that buffer and size are sane.

Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2016-04-08 14:08:27 -04:00
Linus Torvalds
93061f390f These changes contains a fix for overlayfs interacting with some
(badly behaved) dentry code in various file systems.  These have been
 reviewed by Al and the respective file system mtinainers and are going
 through the ext4 tree for convenience.
 
 This also has a few ext4 encryption bug fixes that were discovered in
 Android testing (yes, we will need to get these sync'ed up with the
 fs/crypto code; I'll take care of that).  It also has some bug fixes
 and a change to ignore the legacy quota options to allow for xfstests
 regression testing of ext4's internal quota feature and to be more
 consistent with how xfs handles this case.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJXBn4aAAoJEPL5WVaVDYGjHWgH/2wXnlQnC2ndJhblBWtPzprz
 OQW4dawdnhxqbTEGUqWe942tZivSb/liu/lF+urCGbWsbgz9jNOCmEAg7JPwlccY
 mjzwDvtVq5U4d2rP+JDWXLy/Gi8XgUclhbQDWFVIIIea6fS7IuFWqoVBR+HPMhra
 9tEygpiy5lNtJA/hqq3/z9x0AywAjwrYR491CuWreo2Uu1aeKg0YZsiDsuAcGioN
 Waa2TgbC/ZZyJuJcPBP8If+VOFAa0ea3F+C/o7Tb9bOqwuz0qSTcaMRgt6eQ2KUt
 P4b9Ecp1XLjJTC7IYOknUOScY3lCyREx/Xya9oGZfFNTSHzbOlLBoplCr3aUpYQ=
 =/HHR
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 bugfixes from Ted Ts'o:
 "These changes contains a fix for overlayfs interacting with some
  (badly behaved) dentry code in various file systems.  These have been
  reviewed by Al and the respective file system mtinainers and are going
  through the ext4 tree for convenience.

  This also has a few ext4 encryption bug fixes that were discovered in
  Android testing (yes, we will need to get these sync'ed up with the
  fs/crypto code; I'll take care of that).  It also has some bug fixes
  and a change to ignore the legacy quota options to allow for xfstests
  regression testing of ext4's internal quota feature and to be more
  consistent with how xfs handles this case"

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: ignore quota mount options if the quota feature is enabled
  ext4 crypto: fix some error handling
  ext4: avoid calling dquot_get_next_id() if quota is not enabled
  ext4: retry block allocation for failed DIO and DAX writes
  ext4: add lockdep annotations for i_data_sem
  ext4: allow readdir()'s of large empty directories to be interrupted
  btrfs: fix crash/invalid memory access on fsync when using overlayfs
  ext4 crypto: use dget_parent() in ext4_d_revalidate()
  ext4: use file_dentry()
  ext4: use dget_parent() in ext4_file_open()
  nfs: use file_dentry()
  fs: add file_dentry()
  ext4 crypto: don't let data integrity writebacks fail with ENOMEM
  ext4: check if in-inode xattr is corrupted in ext4_expand_extra_isize_ea()
2016-04-07 17:22:20 -07:00
Filipe Manana
56f23fdbb6 Btrfs: fix file/data loss caused by fsync after rename and new inode
If we rename an inode A (be it a file or a directory), create a new
inode B with the old name of inode A and under the same parent directory,
fsync inode B and then power fail, at log tree replay time we end up
removing inode A completely. If inode A is a directory then all its files
are gone too.

Example scenarios where this happens:
This is reproducible with the following steps, taken from a couple of
test cases written for fstests which are going to be submitted upstream
soon:

   # Scenario 1

   mkfs.btrfs -f /dev/sdc
   mount /dev/sdc /mnt
   mkdir -p /mnt/a/x
   echo "hello" > /mnt/a/x/foo
   echo "world" > /mnt/a/x/bar
   sync
   mv /mnt/a/x /mnt/a/y
   mkdir /mnt/a/x
   xfs_io -c fsync /mnt/a/x
   <power failure happens>

   The next time the fs is mounted, log tree replay happens and
   the directory "y" does not exist nor do the files "foo" and
   "bar" exist anywhere (neither in "y" nor in "x", nor the root
   nor anywhere).

   # Scenario 2

   mkfs.btrfs -f /dev/sdc
   mount /dev/sdc /mnt
   mkdir /mnt/a
   echo "hello" > /mnt/a/foo
   sync
   mv /mnt/a/foo /mnt/a/bar
   echo "world" > /mnt/a/foo
   xfs_io -c fsync /mnt/a/foo
   <power failure happens>

   The next time the fs is mounted, log tree replay happens and the
   file "bar" does not exists anymore. A file with the name "foo"
   exists and it matches the second file we created.

Another related problem that does not involve file/data loss is when a
new inode is created with the name of a deleted snapshot and we fsync it:

   mkfs.btrfs -f /dev/sdc
   mount /dev/sdc /mnt
   mkdir /mnt/testdir
   btrfs subvolume snapshot /mnt /mnt/testdir/snap
   btrfs subvolume delete /mnt/testdir/snap
   rmdir /mnt/testdir
   mkdir /mnt/testdir
   xfs_io -c fsync /mnt/testdir # or fsync some file inside /mnt/testdir
   <power failure>

   The next time the fs is mounted the log replay procedure fails because
   it attempts to delete the snapshot entry (which has dir item key type
   of BTRFS_ROOT_ITEM_KEY) as if it were a regular (non-root) entry,
   resulting in the following error that causes mount to fail:

   [52174.510532] BTRFS info (device dm-0): failed to delete reference to snap, inode 257 parent 257
   [52174.512570] ------------[ cut here ]------------
   [52174.513278] WARNING: CPU: 12 PID: 28024 at fs/btrfs/inode.c:3986 __btrfs_unlink_inode+0x178/0x351 [btrfs]()
   [52174.514681] BTRFS: Transaction aborted (error -2)
   [52174.515630] Modules linked in: btrfs dm_flakey dm_mod overlay crc32c_generic ppdev xor raid6_pq acpi_cpufreq parport_pc tpm_tis sg parport tpm evdev i2c_piix4 proc
   [52174.521568] CPU: 12 PID: 28024 Comm: mount Tainted: G        W       4.5.0-rc6-btrfs-next-27+ #1
   [52174.522805] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
   [52174.524053]  0000000000000000 ffff8801df2a7710 ffffffff81264e93 ffff8801df2a7758
   [52174.524053]  0000000000000009 ffff8801df2a7748 ffffffff81051618 ffffffffa03591cd
   [52174.524053]  00000000fffffffe ffff88015e6e5000 ffff88016dbc3c88 ffff88016dbc3c88
   [52174.524053] Call Trace:
   [52174.524053]  [<ffffffff81264e93>] dump_stack+0x67/0x90
   [52174.524053]  [<ffffffff81051618>] warn_slowpath_common+0x99/0xb2
   [52174.524053]  [<ffffffffa03591cd>] ? __btrfs_unlink_inode+0x178/0x351 [btrfs]
   [52174.524053]  [<ffffffff81051679>] warn_slowpath_fmt+0x48/0x50
   [52174.524053]  [<ffffffffa03591cd>] __btrfs_unlink_inode+0x178/0x351 [btrfs]
   [52174.524053]  [<ffffffff8118f5e9>] ? iput+0xb0/0x284
   [52174.524053]  [<ffffffffa0359fe8>] btrfs_unlink_inode+0x1c/0x3d [btrfs]
   [52174.524053]  [<ffffffffa038631e>] check_item_in_log+0x1fe/0x29b [btrfs]
   [52174.524053]  [<ffffffffa0386522>] replay_dir_deletes+0x167/0x1cf [btrfs]
   [52174.524053]  [<ffffffffa038739e>] fixup_inode_link_count+0x289/0x2aa [btrfs]
   [52174.524053]  [<ffffffffa038748a>] fixup_inode_link_counts+0xcb/0x105 [btrfs]
   [52174.524053]  [<ffffffffa038a5ec>] btrfs_recover_log_trees+0x258/0x32c [btrfs]
   [52174.524053]  [<ffffffffa03885b2>] ? replay_one_extent+0x511/0x511 [btrfs]
   [52174.524053]  [<ffffffffa034f288>] open_ctree+0x1dd4/0x21b9 [btrfs]
   [52174.524053]  [<ffffffffa032b753>] btrfs_mount+0x97e/0xaed [btrfs]
   [52174.524053]  [<ffffffff8108e1b7>] ? trace_hardirqs_on+0xd/0xf
   [52174.524053]  [<ffffffff8117bafa>] mount_fs+0x67/0x131
   [52174.524053]  [<ffffffff81193003>] vfs_kern_mount+0x6c/0xde
   [52174.524053]  [<ffffffffa032af81>] btrfs_mount+0x1ac/0xaed [btrfs]
   [52174.524053]  [<ffffffff8108e1b7>] ? trace_hardirqs_on+0xd/0xf
   [52174.524053]  [<ffffffff8108c262>] ? lockdep_init_map+0xb9/0x1b3
   [52174.524053]  [<ffffffff8117bafa>] mount_fs+0x67/0x131
   [52174.524053]  [<ffffffff81193003>] vfs_kern_mount+0x6c/0xde
   [52174.524053]  [<ffffffff8119590f>] do_mount+0x8a6/0x9e8
   [52174.524053]  [<ffffffff811358dd>] ? strndup_user+0x3f/0x59
   [52174.524053]  [<ffffffff81195c65>] SyS_mount+0x77/0x9f
   [52174.524053]  [<ffffffff814935d7>] entry_SYSCALL_64_fastpath+0x12/0x6b
   [52174.561288] ---[ end trace 6b53049efb1a3ea6 ]---

Fix this by forcing a transaction commit when such cases happen.
This means we check in the commit root of the subvolume tree if there
was any other inode with the same reference when the inode we are
fsync'ing is a new inode (created in the current transaction).

Test cases for fstests, covering all the scenarios given above, were
submitted upstream for fstests:

  * fstests: generic test for fsync after renaming directory
    https://patchwork.kernel.org/patch/8694281/

  * fstests: generic test for fsync after renaming file
    https://patchwork.kernel.org/patch/8694301/

  * fstests: add btrfs test for fsync after snapshot deletion
    https://patchwork.kernel.org/patch/8670671/

Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-04-06 17:01:44 -07:00
Abhi Das
611526756a gfs2: Use gfs2 wrapper to sync inode before calling generic_file_splice_read()
gfs2_file_splice_read() f_op grabs and releases the cluster-wide
inode glock to sync the inode size to the latest.

Without this, generic_file_splice_read() uses an older i_size value
and can return EOF for valid offsets in the inode.

Signed-off-by: Abhi Das <adas@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2016-04-05 12:06:15 -04:00
Bob Peterson
73cc86252b GFS2: Get rid of dead code in inode_go_demote_ok
Function inode_go_demote_ok had some code that was only executed
if gl_holders was not empty. However, if gl_holders was not empty,
the only caller, demote_ok(), returns before inode_go_demote_ok
would ever be called. Therefore, it's dead code, so I removed it.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
2016-04-05 11:59:18 -04:00
Bob Copeland
8f6fd83c6c rhashtable: accept GFP flags in rhashtable_walk_init
In certain cases, the 802.11 mesh pathtable code wants to
iterate over all of the entries in the forwarding table from
the receive path, which is inside an RCU read-side critical
section.  Enable walks inside atomic sections by allowing
GFP_ATOMIC allocations for the walker state.

Change all existing callsites to pass in GFP_KERNEL.

Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Bob Copeland <me@bobcopeland.com>
[also adjust gfs2/glock.c and rhashtable tests]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-04-05 10:56:32 +02:00
Linus Torvalds
e865f4965f Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull quota fixes from Jan Kara:
 "Fixes for oopses when the new quotactl gets used with quotas disabled"

* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  ocfs2: Fix Q_GETNEXTQUOTA for filesystem without quotas
  quota: Handle Q_GETNEXTQUOTA when quota is disabled
2016-04-04 13:18:27 -07:00
Linus Torvalds
c7e82c6485 Merge tag 'f2fs-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs fixes from Jaegeuk Kim.

* tag 'f2fs-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs:
  f2fs: retrieve IO write stat from the right place
  f2fs crypto: fix corrupted symlink in encrypted case
  f2fs: cover large section in sanity check of super
2016-04-04 13:00:39 -07:00
Linus Torvalds
4a2d057e4f Merge branch 'PAGE_CACHE_SIZE-removal'
Merge PAGE_CACHE_SIZE removal patches from Kirill Shutemov:
 "PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
  ago with promise that one day it will be possible to implement page
  cache with bigger chunks than PAGE_SIZE.

  This promise never materialized.  And unlikely will.

  Let's stop pretending that pages in page cache are special.  They are
  not.

  The first patch with most changes has been done with coccinelle.  The
  second is manual fixups on top.

  The third patch removes macros definition"

[ I was planning to apply this just before rc2, but then I spaced out,
  so here it is right _after_ rc2 instead.

  As Kirill suggested as a possibility, I could have decided to only
  merge the first two patches, and leave the old interfaces for
  compatibility, but I'd rather get it all done and any out-of-tree
  modules and patches can trivially do the converstion while still also
  working with older kernels, so there is little reason to try to
  maintain the redundant legacy model.    - Linus ]

* PAGE_CACHE_SIZE-removal:
  mm: drop PAGE_CACHE_* and page_cache_{get,release} definition
  mm, fs: remove remaining PAGE_CACHE_* and page_cache_{get,release} usage
  mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
2016-04-04 10:50:24 -07:00
Kirill A. Shutemov
ea1754a084 mm, fs: remove remaining PAGE_CACHE_* and page_cache_{get,release} usage
Mostly direct substitution with occasional adjustment or removing
outdated comments.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-04 10:41:08 -07:00
Kirill A. Shutemov
09cbfeaf1a mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.

This promise never materialized.  And unlikely will.

We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE.  And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.

Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.

Let's stop pretending that pages in page cache are special.  They are
not.

The changes are pretty straight-forward:

 - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

 - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

 - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

 - page_cache_get() -> get_page();

 - page_cache_release() -> put_page();

This patch contains automated changes generated with coccinelle using
script below.  For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.

The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.

There are few places in the code where coccinelle didn't reach.  I'll
fix them manually in a separate patch.  Comments and documentation also
will be addressed with the separate patch.

virtual patch

@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT

@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE

@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK

@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)

@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)

@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-04 10:41:08 -07:00
Yauhen Kharuzhy
7ccefb98ce btrfs: Reset IO error counters before start of device replacing
If device replace entry was found on disk at mounting and its num_write_errors
stats counter has non-NULL value, then replace operation will never be
finished and -EIO error will be reported by btrfs_scrub_dev() because
this counter is never reset.

 # mount -o degraded /media/a4fb5c0a-21c5-4fe7-8d0e-fdd87d5f71ee/
 # btrfs replace status /media/a4fb5c0a-21c5-4fe7-8d0e-fdd87d5f71ee/
 Started on 25.Mar 07:28:00, canceled on 25.Mar 07:28:01 at 0.0%, 40 write errs, 0 uncorr. read errs
 # btrfs replace start -B 4 /dev/sdg /media/a4fb5c0a-21c5-4fe7-8d0e-fdd87d5f71ee/
 ERROR: ioctl(DEV_REPLACE_START) failed on "/media/a4fb5c0a-21c5-4fe7-8d0e-fdd87d5f71ee/": Input/output error, no error

Reset num_write_errors and num_uncorrectable_read_errors counters in the
dev_replace structure before start of replacing.

Signed-off-by: Yauhen Kharuzhy <yauhen.kharuzhy@zavadatar.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-04 16:29:22 +02:00
Mark Fasheh
0f5dcf8de9 btrfs: Add qgroup tracing
This patch adds tracepoints to the qgroup code on both the reporting side
(insert_dirty_extents) and the accounting side. Taken together it allows us
to see what qgroup operations have happened, and what their result was.

Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-04 16:29:22 +02:00
Josef Bacik
c79b471330 Btrfs: don't use src fd for printk
The fd we pass in may not be on a btrfs file system, so don't try to do
BTRFS_I() on it.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-04 16:29:22 +02:00
David Sterba
8f282f71ea btrfs: fallback to vmalloc in btrfs_compare_tree
The allocation of node could fail if the memory is too fragmented for a
given node size, practically observed with 64k.

http://article.gmane.org/gmane.comp.file-systems.btrfs/54689

Reported-and-tested-by: Jean-Denis Girard <jd.girard@sysnux.pf>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-04 16:29:22 +02:00
Mark Fasheh
918c2ee103 btrfs: handle non-fatal errors in btrfs_qgroup_inherit()
create_pending_snapshot() will go readonly on _any_ error return from
btrfs_qgroup_inherit(). If qgroups are enabled, a user can crash their fs by
just making a snapshot and asking it to inherit from an invalid qgroup. For
example:

$ btrfs sub snap -i 1/10 /btrfs/ /btrfs/foo

Will cause a transaction abort.

Fix this by only throwing errors in btrfs_qgroup_inherit() when we know
going readonly is acceptable.

The following xfstests test case reproduces this bug:

  seq=`basename $0`
  seqres=$RESULT_DIR/$seq
  echo "QA output created by $seq"

  here=`pwd`
  tmp=/tmp/$$
  status=1	# failure is the default!
  trap "_cleanup; exit \$status" 0 1 2 3 15

  _cleanup()
  {
  	cd /
  	rm -f $tmp.*
  }

  # get standard environment, filters and checks
  . ./common/rc
  . ./common/filter

  # remove previous $seqres.full before test
  rm -f $seqres.full

  # real QA test starts here
  _supported_fs btrfs
  _supported_os Linux
  _require_scratch

  rm -f $seqres.full

  _scratch_mkfs
  _scratch_mount
  _run_btrfs_util_prog quota enable $SCRATCH_MNT
  # The qgroup '1/10' does not exist and should be silently ignored
  _run_btrfs_util_prog subvolume snapshot -i 1/10 $SCRATCH_MNT $SCRATCH_MNT/snap1

  _scratch_unmount

  echo "Silence is golden"

  status=0
  exit

Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-04 16:29:22 +02:00
Qu Wenruo
0305bc2793 btrfs: Output more info for enospc_debug mount option
As one user in mail list report reproducible balance ENOSPC error, it's
better to add more debug info for enospc_debug mount option.

Reported-by: Marc Haber <mh+linux-btrfs@zugschlus.de>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-04 16:29:22 +02:00
Liu Bo
264813acb1 Btrfs: fix invalid reference in replace_path
Dan Carpenter's static checker has found this error, it's introduced by
commit 64c043de46
("Btrfs: fix up read_tree_block to return proper error")

It's really supposed to 'break' the loop on error like others.

Cc: Dan Carpenter <dan.carpenter@oracle.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-04 16:29:18 +02:00
Davide Italiano
2a162ce932 Btrfs: Improve FL_KEEP_SIZE handling in fallocate
- We call inode_size_ok() only if FL_KEEP_SIZE isn't specified.
- As an optimisation we can skip the call if (off + len)
  isn't greater than the current size of the file. This operation
  is called under the lock so the less work we do, the better.
- If we call inode_size_ok() pass to it the correct value rather
  than a more conservative estimation.

Signed-off-by: Davide Italiano <dccitaliano@gmail.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-04 16:25:28 +02:00
Al Viro
7367ab9157 Merge branch 'for-linus' into work.iov_iter 2016-04-03 19:53:20 -04:00
Al Viro
03cc0789a6 do_splice_to(): cap the size before passing to ->splice_read()
pipe capacity won't exceed 2G anyway.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-04-03 19:52:59 -04:00
Al Viro
bc61384dcd rw_verify_area(): saner calling conventions
Lift length capping into the few callers that care about it.  Most of
them treat all non-negatives as "success" and ignore the capped value,
and with good reasons.

Make rw_verify_area() return 0 on success.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-04-03 19:51:39 -04:00
Al Viro
2958ec177e aio: remove a pointless assignment
the value is never used after that point

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-04-03 19:51:33 -04:00
Theodore Ts'o
c325a67c72 ext4: ignore quota mount options if the quota feature is enabled
Previously, ext4 would fail the mount if the file system had the quota
feature enabled and quota mount options (used for the older quota
setups) were present.  This broke xfstests, since xfs silently ignores
the usrquote and grpquota mount options if they are specified.  This
commit changes things so that we are consistent with xfs; having the
mount options specified is harmless, so no sense break users by
forbidding them.

Cc: stable@vger.kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-04-03 17:03:37 -04:00
Dan Carpenter
4762cc3fbb ext4 crypto: fix some error handling
We should be testing for -ENOMEM but the minus sign is missing.

Fixes: c9af28fdd4 ('ext4 crypto: don't let data integrity writebacks fail with ENOMEM')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-04-02 18:13:38 -04:00
Linus Torvalds
82d2a348bb Merge branch 'for-linus-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "This has a few fixes Dave Sterba had queued up.  These are all pretty
  small, but since they were tested I decided against waiting for more"

* 'for-linus-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  btrfs: transaction_kthread() is not freezable
  btrfs: cleaner_kthread() doesn't need explicit freeze
  btrfs: do not write corrupted metadata blocks to disk
  btrfs: csum_tree_block: return proper errno value
2016-04-01 18:08:34 -05:00
Linus Torvalds
22fed39775 Two bugfixes for OrangeFS.
One is a reference counting bug and the other is a typo in client
 minimum version.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJW/s+dAAoJEMyISHmv95KnmvwQALUnkjCgbe8d4621VfqX4yhS
 EEG6UtjaIw1C/lH0KS9dyw5yJwJmzEh5HBaLwxN48Gw75s+MznhCEy6ocBdqfblj
 3DGfIv2/gmlsAxn/wdmlAmjRu/UdQj4m/QrmjrE430XScieo+onaTAFojFa2z/a3
 Ux0mxSozog1b+XzTrXKu+64QNpasttlPYvpqV8nq/WBTsJBZgG8JjNnorhbZ0QH1
 PpgA3iJcOYtHiCtjwKLbWk//iGvdDJaiUvSR/GWLUPiesM4JkZ3eUUtpfHCaMEC7
 jZe5WeVMHXN8Vz/KBeE1TCU9n7MHtCSjri7CIjz/UzaXgnt+6JaUORfEzHr/1a5M
 Rtgp88VlIC23jUDgSQd6ncXr6DECCNUdQu63xeZB1aZg5GmkxhXRIzfj8if990hh
 o8J42X/D1q4iqMf85aPoie0mCIa0TcnEU8xJcDPSnkKMCrzRiu++xuLiNTfE4k0E
 FxQYylvBhNDoiwOREqw4SYcLYqxUhpAtN/qVLVtHwyXYwjIiHcl3ZJOD3EuKtk2D
 qwoNAzlCtgHDECbbr520en/kVdgI4zhKs5a3ufSwQm/wMVi02tw+HT+HeJyfJVjV
 e3WfwLsEM1f0fUizAJjosaXMwiE3bM7xor1kDAZYt1MQE+IdTE0LorC0ONVPTts+
 9kXoy9L/lvOhAJnpR4+Z
 =MLba
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://github.com/martinbrandenburg/linux

Pull OrangeFS fixes from Martin Brandenburg:
 "Two bugfixes for OrangeFS.

  One is a reference counting bug and the other is a typo in client
  minimum version"

* tag 'for-linus' of git://github.com/martinbrandenburg/linux:
  orangefs: minimum userspace version is 2.9.3
  orangefs: don't put readdir slot twice
2016-04-01 17:17:32 -05:00
Theodore Ts'o
8f0e8746b4 ext4: avoid calling dquot_get_next_id() if quota is not enabled
This should be fixed in the quota layer so we can test with the quota
mutex held, but for now, we need this to avoid tests from crashing the
kernel aborting the regression test suite.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-04-01 12:00:03 -04:00
Jan Kara
e84dfbe2bf ext4: retry block allocation for failed DIO and DAX writes
Currently if block allocation for DIO or DAX write fails due to ENOSPC,
we just returned it to userspace. However these ENOSPC errors can be
transient because the transaction freeing blocks has not yet committed.
This demonstrates as failures of generic/102 test when the filesystem is
mounted with 'dax' mount option.

Fix the problem by properly retrying the allocation in case of ENOSPC
error in get blocks functions used for direct IO.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Tested-by: Ross Zwisler <ross.zwisler@linux.intel.com>
2016-04-01 02:07:22 -04:00
Theodore Ts'o
daf647d2dd ext4: add lockdep annotations for i_data_sem
With the internal Quota feature, mke2fs creates empty quota inodes and
quota usage tracking is enabled as soon as the file system is mounted.
Since quotacheck is no longer preallocating all of the blocks in the
quota inode that are likely needed to be written to, we are now seeing
a lockdep false positive caused by needing to allocate a quota block
from inside ext4_map_blocks(), while holding i_data_sem for a data
inode.  This results in this complaint:

  Possible unsafe locking scenario:

        CPU0                    CPU1
        ----                    ----
   lock(&ei->i_data_sem);
                                lock(&s->s_dquot.dqio_mutex);
                                lock(&ei->i_data_sem);
   lock(&s->s_dquot.dqio_mutex);

Google-Bug-Id: 27907753

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2016-04-01 01:31:28 -04:00
Martin Brandenburg
878dfd3210 orangefs: minimum userspace version is 2.9.3
Version 2.9.4 isn't even released yet.

Signed-off-by: Martin Brandenburg <martin@omnibond.com>
2016-03-31 12:06:00 -04:00