Commit Graph

3796 Commits

Author SHA1 Message Date
Guoqing Jiang
05cd0e5176 md-cluster: split recover_slot for future code reuse
Make recover_slot as a wraper to __recover_slot, since the
logic of __recover_slot can be reused for the condition
when other nodes need to take over the resync job.

Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31 19:40:41 +02:00
Guoqing Jiang
b89f704a8d md-cluster: use %pU to print UUIDs
Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31 19:40:30 +02:00
Sasha Levin
25b2edfa3b md: setup safemode_timer before it's being used
We used to set up the safemode_timer timer in md_run. If md_run
would fail before the timer was set up we'd end up trying to modify
a timer that doesn't have a callback function when we access safe_delay_store,
which would trigger a BUG.

neilb: delete init_timer() call as setup_timer() does that.

Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31 19:39:39 +02:00
NeilBrown
6cbd81487f md/raid5: handle possible race as reshape completes.
It is possible (though unlikely) for a reshape to be
interrupted between the time that end_reshape is called
and the time when raid5_finish_reshape is called.

This can leave conf->reshape_progress set to MaxSector,
but mddev->reshape_position not.

This combination confused reshape_request() when ->reshape_backwards.
As conf->reshape_progress is so high, it seems the reshape hasn't
really begun.  But assuming MaxSector is a valid address only
leads to sorrow.

So ensure reshape_position and reshape_progress both agree,
and add an extra check in reshape_request() just in case they don't.

Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31 19:38:59 +02:00
NeilBrown
5ed1df2eac md: sync sync_completed has correct value as recovery finishes.
There can be a small window between the moment that recovery
actually writes the last block and the time when various sysfs
and /proc/mdstat attributes report that it has finished.
During this time, 'sync_completed' can have the wrong value.
This can confuse monitoring software.

So:
 - don't set curr_resync_completed beyond the end of the devices,
 - set it correctly when resync/recovery has completed.

Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31 19:38:17 +02:00
NeilBrown
c5e19d906a md: be careful when testing resync_max against curr_resync_completed.
While it generally shouldn't happen, it is not impossible for
curr_resync_completed to exceed resync_max.
This can particularly happen when reshaping RAID5 - the current
status isn't copied to curr_resync_completed promptly, so when it
is, it can exceed resync_max.
This happens when the reshape is 'frozen', resync_max is set low,
and reshape is re-enabled.

Taking a difference between two unsigned numbers is always dangerous
anyway, so add a test to behave correctly if
   curr_resync_completed > resync_max

Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31 19:37:33 +02:00
NeilBrown
a4a3d26d87 md: set MD_RECOVERY_RECOVER when starting a degraded array.
This ensures that 'sync_action' will show 'recover' immediately the
array is started.  If there is no spare the status will change to
'idle' once that is detected.

Clear MD_RECOVERY_RECOVER for a read-only array to ensure this change
happens.

This allows scripts which monitor status not to get confused -
particularly my test scripts.

Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31 19:37:03 +02:00
NeilBrown
c74c0d760e md/raid5: remove incorrect "min_t()" when calculating writepos.
This code is calculating:
  writepos, which is the furthest along address (device-space) that we
     *will* be writing to
  readpos, which is the earliest address that we *could* possible read
     from, and
  safepos, which is the earliest address in the 'old' section that we
     might read from after a crash when the reshape position is
     recovered from metadata.

  The first is a precise calculation, so clipping at zero doesn't
  make sense.  As the reshape position is now guaranteed to always be
  a multiple of reshape_sectors and as we already BUG_ON when
  reshape_progress is zero, there is no point in this min_t() call.

  The readpos and safepos are worst case - actual value depends on
  precise geometry.  That worst case could be negative, which is only
  a problem because we are storing the value in an unsigned.
  So leave the min_t() for those.

Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31 19:36:06 +02:00
NeilBrown
05256d9884 md/raid5: strengthen check on reshape_position at run.
When reshaping, we work in units of the largest chunk size.
If changing from a larger to a smaller chunk size, that means we
reshape more than one stripe at a time.  So the required alignment
of reshape_position needs to take into account both the old
and new chunk size.

This means that both 'here_new' and 'here_old' are calculated with
respect to the same (maximum) chunk size, so testing if they are the
same when delta_disks is zero becomes pointless.

Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31 19:34:21 +02:00
NeilBrown
3cb5edf454 md/raid5: switch to use conf->chunk_sectors in place of mddev->chunk_sectors where possible
The chunk_sectors and new_chunk_sectors fields of mddev can be changed
any time (via sysfs) that the reconfig mutex can be taken.  So raid5
keeps internal copies in 'conf' which are stable except for a short
locked moment when reshape stops/starts.

So any access that does not hold reconfig_mutex should use the 'conf'
values, not the 'mddev' values.
Several don't.

This could result in corruption if new values were written at awkward
times.

Also use min() or max() rather than open-coding.

Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31 19:32:48 +02:00
NeilBrown
5cac6bcb93 md/raid5: always set conf->prev_chunk_sectors and ->prev_algo
These aren't really needed when no reshape is happening,
but it is safer to have them always set to a meaningful value.
The next patch will use ->prev_chunk_sectors without checking
if a reshape is happening (because that makes the code simpler),
and this patch makes that safe.

Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31 19:32:25 +02:00
NeilBrown
02ec50265b md/raid10: fix a few typos in comments
Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31 19:32:09 +02:00
NeilBrown
92140480ed md/raid5: consider updating reshape_position at start of reshape.
md/raid5 only updates ->reshape_position (which is stored in
metadata and is authoritative) occasionally, but particularly
when getting closed to ->resync_max as it must be correct
when ->resync_max is reached.

When mdadm tries to stop an array which is reshaping it will:
 - freeze the reshape,
 - set resync_max to where the reshape has reached.
 - unfreeze the reshape.
When this happens, the reshape is aborted and then restarted.

The restart doesn't check that resync_max is close, and so doesn't
update ->reshape_position like it should.
This results in the reshape stopping, but ->reshape_position being
incorrect.

So on that first call to reshape_request, make sure ->reshape_position
is updated if needed.

Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31 19:31:20 +02:00
NeilBrown
985ca973b6 md: close some races between setting and checking sync_action.
When checking sync_action in a script, we want to be sure it is
as accurate as possible.
As resync/reshape etc doesn't always start immediately (a separate
thread is scheduled to do it), it is best if 'action_show'
checks if MD_RECOVER_NEEDED is set (which it does) and in that
case reports what is likely to start soon (which it only sometimes
does).

So:
 - report 'reshape' if reshape_position suggests one might start.
 - set MD_RECOVERY_RECOVER in raid1_reshape(), because that is very
   likely to happen next.

Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31 19:30:40 +02:00
NeilBrown
f7851be736 md: Keep /proc/mdstat reporting recovery until fully DONE.
Currently when a recovery completes, mdstat shows that it has finished
before the new device is marked as a full member.  Because of this it
can appear to a script that the recovery finished but the array isn't
in sync.

So while MD_RECOVERY_DONE is still set, keep mdstat reporting "recovery".
Once md_reap_sync_thread() completes, the spare will be active and then
MD_RECOVERY_DONE will be cleared.

To ensure this is race-free, set MD_RECOVERY_DONE before clearning
curr_resync.

Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31 19:29:09 +02:00
NeilBrown
199dc6ed51 md/raid0: update queue parameter in a safer location.
When a (e.g.) RAID5 array is reshaped to RAID0, the updating
of queue parameters (e.g. max number of sectors per bio) is
done in the wrong place.
It should be part of ->run, but it is actually part of ->takeover.
This means it happens before level_store() calls:

	blk_set_stacking_limits(&mddev->queue->limits);

and so it ineffective.  This can lead to errors from underlying
devices.

So move all the relevant settings out of create_stripe_zones()
and into raid0_run().

As this can lead to a bug-on it is suitable for any -stable
kernel which supports reshape to RAID0.  So 2.6.35 or later.
As the bug has been present for five years there is no urgency,
so no need to rush into -stable.

Fixes: 9af204cf72 ("md: Add support for Raid5->Raid0 and Raid10->Raid0 takeover")
Cc: stable@vger.kernel.org (v2.6.35+ - please delay until after -final release).
Reported-by: Yi Zhang <yizhan@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-03 17:12:44 +10:00
Benjamin Randazzo
25eafe1a81 md: simplify get_bitmap_file now that "file" is zeroed.
There is no point assigning '\0' to file->pathname[0] as
file is now zeroed out, so remove that branch and
simplify the code.

[Original patch combined this with the change to use
 kzalloc.  I split the two so that the change to kzalloc
 is easier to backport. - neilb]

Signed-off-by: Benjamin Randazzo <benjamin@randazzo.fr>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-03 17:12:44 +10:00
NeilBrown
49895bcc7e md/raid5: don't let shrink_slab shrink too far.
I have a report of drop_one_stripe() called from
raid5_cache_scan() apparently finding ->max_nr_stripes == 0.

This should not be allowed.

So add a test to keep max_nr_stripes above min_nr_stripes.

Also use a 'mask' rather than a 'mod' in drop_one_stripe
to ensure 'hash' is valid even if max_nr_stripes does reach zero.


Fixes: edbe83ab4c ("md/raid5: allow the stripe_cache to grow and shrink.")
Cc: stable@vger.kernel.org (4.1 - please release with 2d5b569b66)
Reported-by: Tomas Papan <tomas.papan@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-03 17:10:56 +10:00
Benjamin Randazzo
b6878d9e03 md: use kzalloc() when bitmap is disabled
In drivers/md/md.c get_bitmap_file() uses kmalloc() for creating a
mdu_bitmap_file_t called "file".

5769         file = kmalloc(sizeof(*file), GFP_NOIO);
5770         if (!file)
5771                 return -ENOMEM;

This structure is copied to user space at the end of the function.

5786         if (err == 0 &&
5787             copy_to_user(arg, file, sizeof(*file)))
5788                 err = -EFAULT

But if bitmap is disabled only the first byte of "file" is initialized
with zero, so it's possible to read some bytes (up to 4095) of kernel
space memory from user space. This is an information leak.

5775         /* bitmap disabled, zero the first byte and copy out */
5776         if (!mddev->bitmap_info.file)
5777                 file->pathname[0] = '\0';

Signed-off-by: Benjamin Randazzo <benjamin@randazzo.fr>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-03 14:56:02 +10:00
NeilBrown
423f04d63c md/raid1: extend spinlock to protect raid1_end_read_request against inconsistencies
raid1_end_read_request() assumes that the In_sync bits are consistent
with the ->degaded count.
raid1_spare_active updates the In_sync bit before the ->degraded count
and so exposes an inconsistency, as does error()
So extend the spinlock in raid1_spare_active() and error() to hide those
inconsistencies.

This should probably be part of
  Commit: 34cab6f420 ("md/raid1: fix test for 'was read error from
  last working device'.")
as it addresses the same issue.  It fixes the same bug and should go
to -stable for same reasons.

Fixes: 76073054c9 ("md/raid1: clean up read_balance.")
Cc: stable@vger.kernel.org (v3.0+)
Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-03 12:29:42 +10:00
Mike Snitzer
795e633a2d dm cache: fix device destroy hang due to improper prealloc_used accounting
Commit 665022d72f ("dm cache: avoid calls to prealloc_free_structs() if
possible") introduced a regression that caused the removal of a DM cache
device to hang in cache_postsuspend()'s call to wait_for_migrations()
with the following stack trace:

  [<ffffffff81651457>] schedule+0x37/0x80
  [<ffffffffa041e21b>] cache_postsuspend+0xbb/0x470 [dm_cache]
  [<ffffffff810ba970>] ? prepare_to_wait_event+0xf0/0xf0
  [<ffffffffa0006f77>] dm_table_postsuspend_targets+0x47/0x60 [dm_mod]
  [<ffffffffa0001eb5>] __dm_destroy+0x215/0x250 [dm_mod]
  [<ffffffffa0004113>] dm_destroy+0x13/0x20 [dm_mod]
  [<ffffffffa00098cd>] dev_remove+0x10d/0x170 [dm_mod]
  [<ffffffffa00097c0>] ? dev_suspend+0x240/0x240 [dm_mod]
  [<ffffffffa0009f85>] ctl_ioctl+0x255/0x4d0 [dm_mod]
  [<ffffffff8127ac00>] ? SYSC_semtimedop+0x280/0xe10
  [<ffffffffa000a213>] dm_ctl_ioctl+0x13/0x20 [dm_mod]
  [<ffffffff811fd432>] do_vfs_ioctl+0x2d2/0x4b0
  [<ffffffff81117d5f>] ? __audit_syscall_entry+0xaf/0x100
  [<ffffffff81022636>] ? do_audit_syscall_entry+0x66/0x70
  [<ffffffff811fd689>] SyS_ioctl+0x79/0x90
  [<ffffffff81023e58>] ? syscall_trace_leave+0xb8/0x110
  [<ffffffff81654f6e>] entry_SYSCALL_64_fastpath+0x12/0x71

Fix this by accounting for the call to prealloc_data_structs()
immediately _before_ the call as opposed to after.  This is needed
because it is possible to break out of the control loop after the call
to prealloc_data_structs() but before prealloc_used was set to true.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-07-29 14:32:09 -04:00
Mike Snitzer
3508e6590d Revert "dm cache: do not wake_worker() in free_migration()"
This reverts commit 386cb7cdee.

Taking the wake_worker() out of free_migration() will slow writeback
dramatically, and hence adaptability.

Say we have 10k blocks that need writing back, but are only able to
issue 5 concurrently due to the migration bandwidth: it's imperative
that we wake_worker() immediately after migration completion; waiting
for the next 1 second wake up (via do_waker) means it'll take a long
time to write that all back.

Reported-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-07-29 14:32:08 -04:00
Baruch Siach
6ed443c07f dm crypt: update wiki page URL
Cryptsetup moved to gitlab.  This is a leftover from commit e44f23b32d
(dm crypt: update URLs to new cryptsetup project page, 2015-04-05).

Signed-off-by: Baruch Siach <baruch@tkos.co.il>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-07-27 07:58:16 -04:00
Colin Ian King
134bf30c06 dm cache policy smq: fix alloc_bitset check that always evaluates as false
static analysis by cppcheck has found a check on alloc_bitset that
always evaluates as false and hence never finds an allocation failure:

[drivers/md/dm-cache-policy-smq.c:1689]: (warning) Logical conjunction
  always evaluates to false: !EXPR && EXPR.

Fix this by removing the incorrect mq->cache_hit_bits check

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-07-27 07:58:15 -04:00
Mike Snitzer
0a927c2f02 dm thin: return -ENOSPC when erroring retry list due to out of data space
Otherwise -EIO would be returned when -ENOSPC should be used
consistently.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-07-26 17:39:19 -04:00
Linus Torvalds
aca105a697 Some md fixes for 4.2
Several are tagged for -stable.
 A few aren't because they are not very, serious or
 because they are in the 'experimental' cluster code.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJVsbRgAAoJEDnsnt1WYoG54BcQAJlVxcGdMXNAbFAfhToH+cwY
 DcCKmnXiu3TCcK/gtr0SlUKIhv7kPE2HaGbTko3g5/uTk7SCuMSWRbjpQJr1u98U
 9VUPZV0RGLEcQXgsjG3sobEtdSYMSn1/BpdJpmsn/Q6cyTheiEJA9fEghn9F0Iw9
 3Ctc56aL0nKsnpeRTvWmKR3F995L2Ene+pIHPbSqTlQcGU+DdxQL/iY9YspMzeih
 6SWQICo59+pGSRQdKaU968nZ1a8hJCvhV2NbW0JsJMo9iGo5htCJLdxZatscZV6E
 xAppCRymW/Nl0n59wCOBhUp+0skVhtYZ1UmNdx/vUA7vcZJX85vIG7WGaaSQAmlF
 Y4nNMSaX6SWbt/oVgq+JiD39T1oFZN5N5Fh9juqcp3fshiWLO74cnTwea1LtkQZX
 LfW2HhajDUgoJqoXL6ppKDK8l7alQdcaoYn0C6SfqRZ+PLB5ERrSBHNZpKDbI9aw
 CFW93lHTtlwtwZn0S7k06mCG8ilO4iPthUVaJEeI1kUWVKY0Ju4xnVTt/7jCroUa
 Km14KdWeZsS8j9/xr6FYBjarjuh7M9SoflgUK84txE+PIZsSczyrErpBkMf2YBWQ
 0KduqQabXa1XYWGtZ7Uhj6iOOGNBcTcZsww8cBEDOJEDyCtSs5w/iPmsFCEGUuo2
 YTz6qKpYPgB1VzK2PoBG
 =6ZCK
 -----END PGP SIGNATURE-----

Merge tag 'md/4.2-fixes' of git://neil.brown.name/md

Pull md fixes from Neil Brown:
 "Some md fixes for 4.2

  Several are tagged for -stable.
  A few aren't because they are not very, serious or because they are in
  the 'experimental' cluster code"

* tag 'md/4.2-fixes' of git://neil.brown.name/md:
  md/raid5: clear R5_NeedReplace when no longer needed.
  Fix read-balancing during node failure
  md-cluster: fix bitmap sub-offset in bitmap_read_sb
  md: Return error if request_module fails and returns positive value
  md: Skip cluster setup in case of error while reading bitmap
  md/raid1: fix test for 'was read error from last working device'.
  md: Skip cluster setup for dm-raid
  md: flush ->event_work before stopping array.
  md/raid10: always set reshape_safe when initializing reshape_position.
  md/raid5: avoid races when changing cache size.
2015-07-25 11:24:58 -07:00
NeilBrown
e6030cb06c md/raid5: clear R5_NeedReplace when no longer needed.
This flag is currently never cleared, which can in rare cases
trigger a warn-on if it is still set but the block isn't
InSync.

So clear it when it isn't need, which includes if the replacement
device has failed.

Signed-off-by: NeilBrown <neilb@suse.com>
2015-07-24 13:38:04 +10:00
Goldwyn Rodrigues
90382ed9af Fix read-balancing during node failure
During a node failure, We need to suspend read balancing so that the
reads are directed to the first device and stale data is not read.
Suspending writes is not required because these would be recorded and
synced eventually.

A new flag MD_CLUSTER_SUSPEND_READ_BALANCING is set in recover_prep().
area_resyncing() will respond true for the entire devices if this
flag is set and the request type is READ. The flag is cleared
in recover_done().

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reported-By: David Teigland <teigland@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-07-24 13:37:59 +10:00
Goldwyn Rodrigues
33e38ac688 md-cluster: fix bitmap sub-offset in bitmap_read_sb
bitmap_read_sb is modifying mddev->bitmap_info.offset. This works for
the first bitmap read. However, when multiple bitmaps need to be opened
by the same node, it ends up corrupting the offset. Fix it by using a
local variable.

Also, bitmap_read_sb is not required in bitmap_copy_from_slot since
it is called in bitmap_create. Remove bitmap_read_sb().

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-07-24 13:37:55 +10:00
Goldwyn Rodrigues
b0c26a79d6 md: Return error if request_module fails and returns positive value
request_module() can return 256 (process exited) in some cases,
which is not as specified in the documentation before the
request_module() definition. Convert the error to -ENOENT.

The positive error number results in bitmap_create() returning
a value that is meant to be an error but doesn't look like one,
so it is dereferenced as a point and causes a crash.

(not needed for stable as this is "experimental" code)
Fixes: edb39c9ded ("Introduce md_cluster_operations to handle cluster functions")
Signed-off-By: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-07-24 13:37:51 +10:00
Goldwyn Rodrigues
f735727319 md: Skip cluster setup in case of error while reading bitmap
If the bitmap read fails, the error code set is -EINVAL. However,
we don't check for errors and go ahead with cluster_setup.
Skip the cluster setup in case of error.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-07-24 13:37:48 +10:00
NeilBrown
34cab6f420 md/raid1: fix test for 'was read error from last working device'.
When we get a read error from the last working device, we don't
try to repair it, and don't fail the device.  We simple report a
read error to the caller.

However the current test for 'is this the last working device' is
wrong.
When there is only one fully working device, it assumes that a
non-faulty device is that device.  However a spare which is rebuilding
would be non-faulty but so not the only working device.

So change the test from "!Faulty" to "In_sync".  If ->degraded says
there is only one fully working device and this device is in_sync,
this must be the one.

This bug has existed since we allowed read_balance to read from
a recovering spare in v3.0

Reported-and-tested-by: Alexander Lyakas <alex.bolshoy@gmail.com>
Fixes: 76073054c9 ("md/raid1: clean up read_balance.")
Cc: stable@vger.kernel.org (v3.0+)
Signed-off-by: NeilBrown <neilb@suse.com>
2015-07-24 13:37:21 +10:00
Goldwyn Rodrigues
d3b178adb3 md: Skip cluster setup for dm-raid
There is a bug that the bitmap superblock isn't initialised properly for
dm-raid, so a new field can have garbage in new fields.
(dm-raid does initialisation in the kernel - md initialised the
 superblock in mdadm).

This means that for dm-raid we cannot currently trust the new ->nodes
field. So:
 - use __GFP_ZERO to initialise the superblock properly for all new
    arrays
 - initialise all fields in bitmap_info in bitmap_new_disk_sb
 - ignore ->nodes for dm arrays (yes, this is a hack)

This bug exposes dm-raid to bug in the (still experimental) md-cluster
code, so it is suitable for -stable.  It does cause crashes.

References: https://bugzilla.kernel.org/show_bug.cgi?id=100491
Cc: stable@vger.kernel.org (v4.1)
Signed-off-By: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-07-23 09:22:00 +10:00
NeilBrown
ee5d004fd0 md: flush ->event_work before stopping array.
The 'event_work' worker used by dm-raid may still be running
when the array is stopped.  This can result in an oops.

So flush the workqueue on which it is run after detaching
and before destroying the device.

Reported-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Cc: stable@vger.kernel.org (2.6.38+ please delay 2 weeks after -final release)
Fixes: 9d09e663d5 ("dm: raid456 basic support")
2015-07-22 14:09:29 +10:00
NeilBrown
299b0685e3 md/raid10: always set reshape_safe when initializing reshape_position.
'reshape_position' tracks where in the reshape we have reached.
'reshape_safe' tracks where in the reshape we have safely recorded
in the metadata.

These are compared to determine when to update the metadata.
So it is important that reshape_safe is initialised properly.
Currently it isn't.  When starting a reshape from the beginning
it usually has the correct value by luck.  But when reducing the
number of devices in a RAID10, it has the wrong value and this leads
to the metadata not being updated correctly.
This can lead to corruption if the reshape is not allowed to complete.

This patch is suitable for any -stable kernel which supports RAID10
reshape, which is 3.5 and later.

Fixes: 3ea7daa5d7 ("md/raid10: add reshape support")
Cc: stable@vger.kernel.org (v3.5+ please wait for -final to be out for 2 weeks)
Signed-off-by: NeilBrown <neilb@suse.com>
2015-07-22 14:08:24 +10:00
NeilBrown
2d5b569b66 md/raid5: avoid races when changing cache size.
Cache size can grow or shrink due to various pressures at
any time.  So when we resize the cache as part of a 'grow'
operation (i.e. change the size to allow more devices) we need
to blocks that automatic growing/shrinking.

So introduce a mutex.  auto grow/shrink uses mutex_trylock()
and just doesn't bother if there is a blockage.
Resizing the whole cache holds the mutex to ensure that
the correct number of new stripes is allocated.

This bug can result in some stripes not being freed when an
array is stopped.  This leads to the kmem_cache not being
freed and a subsequent array can try to use the same kmem_cache
and get confused.

Fixes: edbe83ab4c ("md/raid5: allow the stripe_cache to grow and shrink.")
Cc: stable@vger.kernel.org (4.1 - please delay until 2 weeks after release of 4.2)
Signed-off-by: NeilBrown <neilb@suse.com>
2015-07-22 14:04:15 +10:00
Linus Torvalds
3f8476fe89 - Revert a request-based DM core change that caused IO latency to
increase and adversely impact both throughput and system load
 
 - Fix for a use after free bug in DM core's device cleanup
 
 - A couple DM btree removal fixes (used by dm-thinp)
 
 - A DM thinp fix for order-5 allocation failure
 
 - A DM thinp fix to not degrade to read-only metadata mode when in
   out-of-data-space mode for longer than the 'no_space_timeout'
 
 - Fix a long-standing oversight in both dm-thinp and dm-cache by
   now exporting 'needs_check' in status if it was set in metadata
 
 - Fix an embarrassing dm-cache busy-loop that caused worker threads to
   eat cpu even if no IO was actively being issued to the cache device
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJVqUfSAAoJEMUj8QotnQNa+RUH+wXrHCGI6J7RHIXVd5igP9K0
 yFZGEnLZe6Ebt5CACLcKn/qN0g97iwCrlcxFt+1Gj/GbW1GIQzs7vg38La3PZxWZ
 jAkI3JMY816bP1x3VK1HtMsk2gRaE/hh0gxK5pPLB9a+ZdEsz9UML0rs+JseOdn3
 n+454dhwOyChwz7zFEbpn+mfjoruFScGX0Y2qaSHBV/xMhmExpthw9V1yFC2v2tW
 8cAHOMDLNLHhR5adF9YxjZH8wILbyYK9oPy3iGhj/TF/Dx7saWYG4UlnL5xIOLsB
 5WK9gRrJJ/Wf0FsDdN88AaY4Bdpj4esS2JeTZpvujxeBb7ZNeJoCUqyzggURv/c=
 =hCjo
 -----END PGP SIGNATURE-----

Merge tag 'dm-4.2-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper fixes from Mike Snitzer:

 - revert a request-based DM core change that caused IO latency to
   increase and adversely impact both throughput and system load

 - fix for a use after free bug in DM core's device cleanup

 - a couple DM btree removal fixes (used by dm-thinp)

 - a DM thinp fix for order-5 allocation failure

 - a DM thinp fix to not degrade to read-only metadata mode when in
   out-of-data-space mode for longer than the 'no_space_timeout'

 - fix a long-standing oversight in both dm-thinp and dm-cache by now
   exporting 'needs_check' in status if it was set in metadata

 - fix an embarrassing dm-cache busy-loop that caused worker threads to
   eat cpu even if no IO was actively being issued to the cache device

* tag 'dm-4.2-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm cache: avoid calls to prealloc_free_structs() if possible
  dm cache: avoid preallocation if no work in writeback_some_dirty_blocks()
  dm cache: do not wake_worker() in free_migration()
  dm cache: display 'needs_check' in status if it is set
  dm thin: display 'needs_check' in status if it is set
  dm thin: stay in out-of-data-space mode once no_space_timeout expires
  dm: fix use after free crash due to incorrect cleanup sequence
  Revert "dm: only run the queue on completion if congested or no requests pending"
  dm btree: silence lockdep lock inversion in dm_btree_del()
  dm thin: allocate the cell_sort_array dynamically
  dm btree remove: fix bug in redistribute3
2015-07-17 20:53:57 -07:00
Mike Snitzer
665022d72f dm cache: avoid calls to prealloc_free_structs() if possible
If no work was performed then prealloc_data_structs() wasn't ever called
so there isn't any need to call prealloc_free_structs().

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-07-16 22:32:08 -04:00
Mike Snitzer
e782eff591 dm cache: avoid preallocation if no work in writeback_some_dirty_blocks()
Refactor writeback_some_dirty_blocks() to avoid prealloc_data_structs()
if the policy doesn't have any dirty blocks ready for writeback.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-07-16 22:32:07 -04:00
Mike Snitzer
386cb7cdee dm cache: do not wake_worker() in free_migration()
All methods that queue work call wake_worker() as you'd expect.
E.g. cell_defer, defer_bio, quiesce_migration (which is called by
writeback, promote, demote_then_promote, invalidate, discard, etc).

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-07-16 22:32:06 -04:00
Mike Snitzer
255eac2005 dm cache: display 'needs_check' in status if it is set
There is currently no way to see that the needs_check flag has been set
in the metadata.  Display 'needs_check' in the cache status if it is set
in the cache metadata.

Also, update cache documentation.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-07-16 10:23:50 -04:00
Mike Snitzer
e4c78e210d dm thin: display 'needs_check' in status if it is set
There is currently no way to see that the needs_check flag has been set
in the metadata.  Display 'needs_check' in the thin-pool status if it is
set in the thinp metadata.

Also, update thinp documentation.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-07-16 10:23:50 -04:00
Mike Snitzer
bcc696fac1 dm thin: stay in out-of-data-space mode once no_space_timeout expires
This fixes an issue where running out of data space would cause the
thin-pool's metadata to become read-only.  There was no reason to make
metadata read-only -- calling set_pool_mode() with PM_READ_ONLY was a
misguided way to error all queued and future write IOs.  We can
accomplish the same by degrading from PM_OUT_OF_DATA_SPACE to
PM_OUT_OF_DATA_SPACE with error_if_no_space enabled.

Otherwise, the use of PM_READ_ONLY could cause a race where commit() was
started before the PM_READ_ONLY transition but dm_pool_commit_metadata()
would go on to fail because the block manager had transitioned to
read-only.  The return of -EPERM from dm_pool_commit_metadata(), due to
attempting to commit while in read-only mode, caused the thin-pool to
set 'needs_check' because a metadata_operation_failed().  This needless
cascade of failures makes life for users more difficult than needed.

Reported-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-07-16 10:23:49 -04:00
Mikulas Patocka
b06075a98d dm: fix use after free crash due to incorrect cleanup sequence
Linux 4.2-rc1 Commit 0f20972f7b ("dm: factor out a common
cleanup_mapped_device()") moved a common cleanup code to a separate
function.  Unfortunately, that commit incorrectly changed the order of
cleanup, so that it destroys the mapped_device's srcu structure
'io_barrier' before destroying its workqueue.

The function that is executed on the workqueue (dm_wq_work) uses the srcu
structure, thus it may use it after being freed.  That results in a
crash in the LVM test suite's mirror-vgreduce-removemissing.sh test.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Fixes: 0f20972f7b ("dm: factor out a common cleanup_mapped_device()")
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-07-13 09:14:11 -04:00
Jens Axboe
77b5a08427 bcache: don't embed 'return' statements in closure macros
This is horribly confusing, it breaks the flow of the code without
it being apparent in the caller.

Signed-off-by: Jens Axboe <axboe@fb.com>
Acked-by: Christoph Hellwig <hch@lst.de>
2015-07-11 09:57:32 -06:00
Mike Snitzer
621739b00e Revert "dm: only run the queue on completion if congested or no requests pending"
This reverts commit 9a0e609e3f.
(Resolved a conflict during revert due to commit bfebd1cdb4 that came
after)

This revert is motivated by a couple failure reports on request-based DM
multipath testbeds:
1) Netapp reported that their multipath fault injection test under heavy
   IO load can stall longer than 300 seconds.
2) IBM reported elevated lock contention in their testbed (likely due to
   increased back pressure due to IO not being dispatched as quickly):
   https://www.redhat.com/archives/dm-devel/2015-July/msg00057.html

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 4.1+
2015-07-08 16:16:07 -04:00
Joe Thornber
1c7518794a dm btree: silence lockdep lock inversion in dm_btree_del()
Allocate memory using GFP_NOIO when deleting a btree.  dm_btree_del()
can be called via an ioctl and we don't want to recurse into the FS or
block layer.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2015-07-06 10:45:02 -04:00
Joe Thornber
a822c83e47 dm thin: allocate the cell_sort_array dynamically
Given the pool's cell_sort_array holds 8192 pointers it triggers an
order 5 allocation via kmalloc.  This order 5 allocation is prone to
failure as system memory gets more fragmented over time.

Fix this by allocating the cell_sort_array using vmalloc.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2015-07-05 21:59:37 -04:00
Dennis Yang
4c7e309340 dm btree remove: fix bug in redistribute3
redistribute3() shares entries out across 3 nodes.  Some entries were
being moved the wrong way, breaking the ordering.  This manifested as a
BUG() in dm-btree-remove.c:shift() when entries were removed from the
btree.

For additional context see:
https://www.redhat.com/archives/dm-devel/2015-May/msg00113.html

Signed-off-by: Dennis Yang <shinrairis@gmail.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2015-07-05 17:43:58 -04:00
Linus Torvalds
1dc51b8288 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull more vfs updates from Al Viro:
 "Assorted VFS fixes and related cleanups (IMO the most interesting in
  that part are f_path-related things and Eric's descriptor-related
  stuff).  UFS regression fixes (it got broken last cycle).  9P fixes.
  fs-cache series, DAX patches, Jan's file_remove_suid() work"

[ I'd say this is much more than "fixes and related cleanups".  The
  file_table locking rule change by Eric Dumazet is a rather big and
  fundamental update even if the patch isn't huge.   - Linus ]

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (49 commits)
  9p: cope with bogus responses from server in p9_client_{read,write}
  p9_client_write(): avoid double p9_free_req()
  9p: forgetting to cancel request on interrupted zero-copy RPC
  dax: bdev_direct_access() may sleep
  block: Add support for DAX reads/writes to block devices
  dax: Use copy_from_iter_nocache
  dax: Add block size note to documentation
  fs/file.c: __fget() and dup2() atomicity rules
  fs/file.c: don't acquire files->file_lock in fd_install()
  fs:super:get_anon_bdev: fix race condition could cause dev exceed its upper limitation
  vfs: avoid creation of inode number 0 in get_next_ino
  namei: make set_root_rcu() return void
  make simple_positive() public
  ufs: use dir_pages instead of ufs_dir_pages()
  pagemap.h: move dir_pages() over there
  remove the pointless include of lglock.h
  fs: cleanup slight list_entry abuse
  xfs: Correctly lock inode when removing suid and file capabilities
  fs: Call security_ops->inode_killpriv on truncate
  fs: Provide function telling whether file_remove_privs() will do anything
  ...
2015-07-04 19:36:06 -07:00