Commit Graph

299893 Commits

Author SHA1 Message Date
NeilBrown
15702d7fb6 md/bitmap: use DIV_ROUND_UP instead of open-code
Also take the opportunity to simplify CHUNK_BLOCK_RATIO.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-22 13:55:25 +10:00
NeilBrown
40cffcc0e8 md/bitmap: create a 'struct bitmap_counts' substructure of 'struct bitmap'
The new "struct bitmap_counts" contains all the fields that are
related to counting the number of active writes in each bitmap chunk.

Having this separate will make it easier to change the chunksize
or overall size of a bitmap atomically.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-22 13:55:24 +10:00
NeilBrown
63c68268b2 md/bitmap: make bitmap bitops atomic.
This allows us to remove spinlock protection which is
more heavy-weight than simple atomics.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-22 13:55:23 +10:00
NeilBrown
bdfd114073 md/bitmap: make _page_attr bitops atomic.
Using e.g. set_bit instead of __set_bit and using test_and_clear_bit
allow us to remove some locking and contract other locked ranges.

It is rare that we set or clear a lot of these bits, so gain should
outweigh any cost.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-22 13:55:22 +10:00
NeilBrown
fae7d326cd md/bitmap: merge bitmap_file_unmap and bitmap_file_put.
There functions really do one thing together: release the
'bitmap_storage'.  So make them just one function.

Since we removed the locking (previous patch), we don't need to zero
any fields before freeing them, so it all becomes a bit simpler.


Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-22 13:55:21 +10:00
NeilBrown
62f82faace md/bitmap: remove async freeing of bitmap file.
There is no real value in freeing things the moment there is an error.
It is just as good to free the bitmap file and pages when the bitmap
is explicitly removed (and replaced?) or at shutdown.

With this gone, the bitmap will only disappear when the array is
quiescent, so we can remove some locking.

As the 'filemap' doesn't disappear now, include extra checks before
trying to write any of it out.
Also remove the check for "has it disappeared" in
bitmap_daemon_write().


Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-22 13:55:21 +10:00
NeilBrown
7466712347 md/bitmap: convert some spin_lock_irqsave to spin_lock_irq
All of these sites can only be called from process context with
irqs enabled, so using irqsave/irqrestore just adds noise.
Remove it.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-22 13:55:19 +10:00
NeilBrown
b405fe91e5 md/bitmap: use set_bit, test_bit, etc for operation on bitmap->flags.
We currently use '&' and '|' which isn't the norm in the kernel
and doesn't allow easy atomicity.
So change to bit numbers and {set,clear,test}_bit.
This allows us to remove a spinlock/unlock (which was dubious anyway)
and some other simplifications.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-22 13:55:15 +10:00
NeilBrown
84e923453e md/bitmap: remove single-bit manipulation on sb->state
Just do single-bit manipulations on bitmap->flags and copy whole
value between that and sb->state.

This will allow next patch which changes how bit manipulations are
performed on bitmap->flags.

This does result in BITMAP_STALE not being set in sb by
bitmap_read_sb, however as the setting is determined by other
information in the 'sb' we do not lose information this way.
Normally, bitmap_load will be called shortly which will clear
BITMAP_STALE anyway.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-22 13:55:14 +10:00
NeilBrown
edbb79df67 md/bitmap: remove bitmap_mask_state
This function isn't really needed.  It sets or clears a flag in both
bitmap->flags and sb->state.
However both times it is called, bitmap_update_sb is called soon
afterwards which copies bitmap->flags to sb->state.
So just make changes to bitmap->flags, and open-code those rather than
hiding in a function.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-22 13:55:13 +10:00
NeilBrown
bc9891a885 md/bitmap: move storage allocation from bitmap_load to bitmap_create.
We should allocate memory for the storage-bitmap at create-time, not
load time.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-22 13:55:12 +10:00
NeilBrown
d1244cb062 md/bitmap: separate bitmap file allocation to its own function.
This will allow allocation before swapping in a new bitmap.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-22 13:55:12 +10:00
NeilBrown
9b1215c102 md/bitmap: store bytes in file rather than just in last page.
This number is more generally useful, and bytes-in-last-page is
easily extracted from it.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-22 13:55:11 +10:00
NeilBrown
1ec885cdd0 md/bitmap: move some fields of 'struct bitmap' into a 'storage' substruct.
This new 'struct bitmap_storage' reflects the external storage of the
bitmap.
Having this clearly defined will make it easier to change the storage
used while the array is active.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-22 13:55:10 +10:00
NeilBrown
d189122d4b md/bitmap: change *_page_attr() to take a page number, not a page.
Most often we have the page number, not the page.  And that is what
the  *_page_attr() functions really want.  So change the arguments to
take that number.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-22 13:55:09 +10:00
NeilBrown
27581e5ae0 md/bitmap: centralise allocation of bitmap file pages.
Instead of allocating pages in read_sb_page, read_page and
bitmap_read_sb, allocate them all in bitmap_init_from disk.

Also replace the hack of calling "attach_page_buffers(page, NULL)" to
ensure that free_buffer() won't complain, by putting a test for
PagePrivate in free_buffer().

Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-22 13:55:08 +10:00
NeilBrown
ef99bf480d md/bitmap: allow a bitmap with no backing storage.
An md bitmap comprises two parts
 - internal counting of active writes per 'chunk'.
 - external storage of whether there are any active writes on
   each chunk

The second requires the first, but the first doesn't require the
second.

Not having backing storage means that the bitmap cannot expedite
resync after a crash, but it still allows us to expedite the recovery
of a recently-removed device.

So: allow a bitmap to exist even if there is no backing device.
In that case we default to 128M chunks.

A particular value of this is that we can remove and re-add a bitmap
(possibly of a different granularity) on a degraded array, and not
lose the information needed to fast-recover the missing device.

We don't actually activate these bitmaps yet - that will come
in a later patch.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-22 13:55:08 +10:00
NeilBrown
6409bb05a9 md/bitmap: add new 'space' attribute for bitmaps.
If we are to allow bitmaps to be resized when the array is resized,
we need to know how much space there is.

So create an attribute to store this information and set appropriate
defaults.

It can be set more precisely via sysfs, or future metadata extensions
may allow it to be recorded.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-22 13:55:07 +10:00
NeilBrown
bf07bb7d5b md/bitmap: disentangle two different 'pending' flags.
There are two different 'pending' concepts in the handling of the
write intent bitmap.

Firstly, a 'page' from the bitmap (which container PAGE_SIZE*8 bits)
may have changes (bits cleared) that should be written in due course.
There is no hurry for these and the page will transition from
PENDING to NEEDWRITE and will then be written, though if it ever
becomes DIRTY it will be written much sooner and PENDING will be
cleared.

Secondly, a page of counters - which contains PAGE_SIZE/2 counters, one
for each bit, can usefully have a 'pending' flag which indicates if
any of the counters are low (2 or 1) and ready to be processed by
bitmap_daemon_work().  If this flag is clear we can skip the whole
page.

These two concepts are currently combined in the bitmap-file flag.
This causes a tighter connection between the counters and the bitmap
file than I would like - as I want to add some flexibility to the
bitmap file.

So introduce a new flag with the page-of-counters, and rewrite
bitmap_daemon_work() so that it handles the two different 'pending'
concepts separately.

This also allows us to clear BITMAP_PAGE_PENDING when we write out
a dirty page, which may occasionally reduce the number of times we
write a page.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-22 13:55:06 +10:00
Shaohua Li
bc0934f047 raid5: support sync request
REQ_SYNC is ignored in current raid5 code. Block layer does use it to do
policy,
for example ioscheduler. This patch adds it.

Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-22 13:55:05 +10:00
Shaohua Li
cceeca43b5 raid5: remove unused variables
The two variables are useless.

Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-22 13:55:04 +10:00
majianpeng
5fdd2cf826 md/raid10: Fix memleak in r10buf_pool_alloc
If the allocation of rep1_bio fails, we currently don't free the 'bio'
of the same dev.

Reported by kmemleak.

Signed-off-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-22 13:55:03 +10:00
majianpeng
da8840a747 md/raid1: allow fix_read_error to read from recovering device.
When attempting to fix a read error, it is acceptable to read from a
device that is recovering, provided the recovery has got past the
place we are reading from.  This makes the test for "can we read from
here" the same as the test in read_balance.

Signed-off-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-22 13:55:03 +10:00
NeilBrown
4fa2f32768 md: move freeing of badblocks.page into md_rdev_clear
This ensures that it is always freed - there were case where
we failed to free the page.

Reported-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-22 13:55:01 +10:00
NeilBrown
545c87957f md: dm-raid should call helper function to clear rdev.
dm-raid currently open-codes the freeing of some members of
and rdev.  It is more maintainable to have it call common code
from md.c which does this for all call-sites.

So remove free_disk_sb to md_rdev_clear, export it, and use it in
dm-raid.c

Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-22 13:54:30 +10:00
Jim Kukunas
96e67703e7 lib/raid6: cleanup gen_syndrome function selection
Reorders functions in raid6_algos as well as the preference check
to reduce the number of functions tested on initialization.

Also, creates symmetry between choosing the gen_syndrome functions
and choosing the recovery functions.

Signed-off-by: Jim Kukunas <james.t.kukunas@linux.intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-22 13:54:24 +10:00
Jim Kukunas
2dbf708448 lib/raid6: update test program for recovery functions
Test each combination of recovery and syndrome generation
functions.

Signed-off-by: Jim Kukunas <james.t.kukunas@linux.intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-22 13:54:23 +10:00
Jim Kukunas
048a8b8c89 lib/raid6: Add SSSE3 optimized recovery functions
Add SSSE3 optimized recovery functions, as well as a system
for selecting the most appropriate recovery functions to use.

Originally-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Jim Kukunas <james.t.kukunas@linux.intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-22 13:54:18 +10:00
Jim Kukunas
f674ef7b43 lib/raid6: fix test program build
<linux/module.h> drags in headers which are not visible to userspace,
thus breaking the build for the test program.

Signed-off-by: Jim Kukunas <james.t.kukunas@linux.intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-22 13:54:16 +10:00
Jim Kukunas
ea4d26ae24 raid5: add AVX optimized RAID5 checksumming
Optimize RAID5 xor checksumming by taking advantage of
256-bit YMM registers introduced in AVX.

Signed-off-by: Jim Kukunas <james.t.kukunas@linux.intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-22 13:54:04 +10:00
Jim Kukunas
56a519913e crypto: disable preemption while benchmarking RAID5 xor checksumming
With CONFIG_PREEMPT=y, we need to disable preemption while benchmarking
RAID5 xor checksumming to ensure we're actually measuring what we think
we're measuring.

Signed-off-by: Jim Kukunas <james.t.kukunas@linux.intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-22 13:54:04 +10:00
Jim Kukunas
6a328475cc crypto: wait for a full jiffy in do_xor_speed
In the existing do_xor_speed(), there is no guarantee that we actually
run do_2() for a full jiffy. We get the current jiffy, then run do_2()
until the next jiffy.

Instead, let's get the current jiffy, then wait until the next jiffy
to start our test.

Signed-off-by: Jim Kukunas <james.t.kukunas@linux.intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-22 13:54:03 +10:00
NeilBrown
3ea7daa5d7 md/raid10: add reshape support
A 'near' or 'offset' lay RAID10 array can be reshaped to a different
'near' or 'offset' layout, a different chunk size, and a different
number of devices.
However the number of copies cannot change.

Unlike RAID5/6, we do not support having user-space backup data that
is being relocated during a 'critical section'.  Rather, the
data_offset of each device must change so that when writing any block
to a new location, it will not over-write any data that is still
'live'.

This means that RAID10 reshape is not supportable on v0.90 metadata.

The different between the old data_offset and the new_offset must be
at least the larger of the chunksize multiplied by offset copies of
each of the old and new layout. (for 'near' mode, offset_copies == 1).

A larger difference of around 64M seems useful for in-place reshapes
as more data can be moved between metadata updates.
Very large differences (e.g. 512M) seem to slow the process down due
to lots of long seeks (on oldish consumer graded devices at least).

Metadata needs to be updated whenever the place we are about to write
to is considered - by the current metadata - to still contain data in
the old layout.

[unbalanced locking fix from Dan Carpenter <dan.carpenter@oracle.com>]

Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-22 13:53:47 +10:00
NeilBrown
deb200d085 md/raid10: split out interpretation of layout to separate function.
We will soon be interpreting the layout (and chunksize etc) from
multiple places to support reshape.  So split it out into separate
function.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-21 09:28:33 +10:00
NeilBrown
f8c9e74ff0 md/raid10: Introduce 'prev' geometry to support reshape.
When RAID10 supports reshape it will need a 'previous' and a 'current'
geometry, so introduce that here.
Use the 'prev' geometry when before the reshape_position, and the
current 'geo' when beyond it.  At other times, use both as
appropriate.

For now, both are identical (And reshape_position is never set).

When we use the 'prev' geometry, we must use the old data_offset.
When we use the current (And a reshape is happening) we must use
the new_data_offset.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-21 09:28:33 +10:00
NeilBrown
c804cdecea md: use resync_max_sectors for reshape as well as resync.
Some resync type operations need to act on the address space of the
device, others on the address space of the array.

This only affects RAID10, so it sets resync_max_sectors to the array
size (it defaults to the device size), and that is currently used for
resync only.  However reshape of a RAID10 must be done against the
array size, not device size, so change code to use resync_max_sectors
for both the resync and the reshape cases.
This does not affect RAID5 or RAID1, just RAID10.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-21 09:28:33 +10:00
NeilBrown
1fdd6fc92f md: teach sync_page_io about new_data_offset.
Some code in raid1 and raid10 use sync_page_io to
read/write pages when responding to read errors.
As we will shortly support changing data_offset for
raid10, this function must understand new_data_offset.

So add that understanding.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-21 09:28:32 +10:00
NeilBrown
5cf00fcd3c md/raid10: collect some geometry fields into a dedicated structure.
We will shortly be adding reshape support for RAID10 which will
require it having 2 concurrent geometries (before and after).
To make that easier, collect most geometry fields into 'struct geom'
and access them from there.  Then we will more easily be able to add
a second set of fields.

Note that 'copies' is not in this struct and so cannot be changed.
There is little need to change this number and doing so is a lot
more difficult as it requires reallocating more things.
So leave it out for now.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-21 09:28:20 +10:00
NeilBrown
b5254dd5fd md/raid5: allow for change in data_offset while managing a reshape.
The important issue here is incorporating the different in data_offset
into calculations concerning when we might need to over-write data
that is still thought to be valid.

To this end we find the minimum offset difference across all devices
and add that where appropriate.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-21 09:27:01 +10:00
NeilBrown
05616be5e1 md/raid5: Use correct data_offset for all IO.
As there can now be two different data_offsets - an 'old' and
a 'new' - we need to carefully choose between them.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-21 09:27:00 +10:00
NeilBrown
c6563a8c38 md: add possibility to change data-offset for devices.
When reshaping we can avoid costly intermediate backup by
changing the 'start' address of the array on the device
(if there is enough room).

So as a first step, allow such a change to be requested
through sysfs, and recorded in v1.x metadata.

(As we didn't previous check that all 'pad' fields were zero,
 we need a new FEATURE flag for this.
 A (belatedly) check that all remaining 'pad' fields are
 zero to avoid a repeat of this)

The new data offset must be requested separately for each device.
This allows each to have a different change in the data offset.
This is not likely to be used often but as data_offset can be
set per-device, new_data_offset should be too.

This patch also removes the 'acknowledged' arg to rdev_set_badblocks as
it is never used and never will be.  At the same time we add a new
arg ('in_new') which is currently always zero but will be used more
soon.

When a reshape finishes we will need to update the data_offset
and rdev->sectors.  So provide an exported function to do that.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-21 09:27:00 +10:00
NeilBrown
2c810cddc4 md: allow a reshape operation to be reversed.
Currently a reshape operation always progresses from the start
of the array to the end unless the number of devices is being
reduced, in which case it progressed in the opposite direction.

To reverse a partial reshape which changes the number of devices
you can stop the array and re-assemble with the raid-disks numbers
reversed and it will undo.

However for a reshape that does not change the number of devices
it is not possible to reverse the reshape in the middle - you have to
wait until it completes.

So add a 'reshape_direction' attribute with is either 'forwards' or
'backwards' and can be explicitly set when delta_disks is zero.

This will become more important when we allow the data_offset to
change in a reshape.  Then the explicit statement of what direction is
being used will be more useful.

This can be enabled in raid5 trivially as it already supports
reverse reshape and just needs to use a different trigger to request it.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-21 09:27:00 +10:00
Shaohua Li
b5e1b8cee7 md: using GFP_NOIO to allocate bio for flush request
A flush request is usually issued in transaction commit code path, so
using GFP_KERNEL to allocate memory for flush request bio falls into
the classic deadlock issue.

This is suitable for any -stable kernel to which it applies as it
avoids a possible deadlock.

Cc: stable@vger.kernel.org
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-21 09:26:59 +10:00
NeilBrown
b0d634d568 md/raid10: fix transcription error in calc_sectors conversion.
The old code was
		sector_div(stride, fc);
the new code was
		sector_dir(size, conf->near_copies);

'size' is right (the stride various wasn't really needed), but
'fc' means 'far_copies', and that is an important difference.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-19 09:01:13 +10:00
Jonathan Brassow
0d9f4f135e MD: Add del_timer_sync to mddev_suspend (fix nasty panic)
Use del_timer_sync to remove timer before mddev_suspend finishes.

We don't want a timer going off after an mddev_suspend is called.  This is
especially true with device-mapper, since it can call the destructor function
immediately following a suspend.  This results in the removal (kfree) of the
structures upon which the timer depends - resulting in a very ugly panic.
Therefore, we add a del_timer_sync to mddev_suspend to prevent this.

Cc: stable@vger.kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-17 10:38:24 +10:00
NeilBrown
6508fdbf40 md/raid10: set dev_sectors properly when resizing devices in array.
raid10 stores dev_sectors in 'conf' separately from the one in
'mddev' because it can have a very significant effect on block
addressing and so need to be updated carefully.

However raid10_resize isn't updating it at all!

To update it correctly, we need to make sure it is a proper
multiple of the chunksize taking various details of the layout
in to account.
This calculation is currently done in setup_conf.   So split it
out from there and call it from raid10_resize as well.
Then set conf->dev_sectors properly.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-17 10:08:45 +10:00
NeilBrown
b16b1b6cd0 md/bitmap: fix calculation of 'chunks' - missing shift.
commit 61a0d80c "md/bitmap: discard CHUNK_BLOCK_SHIFT macro"
replaced CHUNK_BLOCK_RATIO() by the same text that was
replacing CHUNK_BLOCK_SHIFT() - which is clearly wrong.

The result is that 'chunks' is often too small by 1,
which can sometimes result in a crash (not sure how).

So use the correct replacement, and get rid of CHUNK_BLOCK_RATIO
which is no longe used.

Reported-by: Karl Newman <siliconfiend@gmail.com>
Tested-by: Karl Newman <siliconfiend@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-04 17:03:18 +10:00
Linus Torvalds
69964ea4c7 Linux 3.4-rc5 2012-04-29 15:19:10 -07:00
Linus Torvalds
6cfdd02b88 Power management fixes for 3.4
Fix for an issue causing hibernation to hang on systems with highmem (that
 practically means i386) due to broken memory management (bug introduced in 3.2,
 so -stable material) and PM documentation update making the freezer
 documentation follow the code again after some recent updates.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.18 (GNU/Linux)
 
 iQIcBAABAgAGBQJPnbeLAAoJEKhOf7ml8uNs4nkQAKwhfWWfbM7ZepPfT56A5NW1
 9vlfgO+1ibUgdjkO0hi1biCAbARTNVS5eCLyRJW0W/msGgL51nYleBeFmwwx5E6h
 m5Vwr/53cGeAeF0AOkrQkD45YKJaAlTmWF4/T2YWKLWNMgaVuLmGf7eyYZ6rP1NO
 rJxzUOMC6UrRjIA+S2anDU0CdMyqDHvV3OmY+InZBikFCk0YAtDYUYfNDNqQpEBG
 bzkG3SyaJeqnbQDkhme7U/uAPJCThSz2Z4gvvOxiXdB+I+yp6FhluhLSGxqMh/kj
 OUAJe9s6AAdKz+K62/OgowwucxvmeJRCyYWkN2ZEpsZLoqTEOqLNS4+eaUO6xS/2
 tq89LnfSIVFwRx23XeVr/oMfxUJZ8VKZENo5Pm6NjTAYykTeyD4ug/GAHqgXR0TT
 B+fvx8QmQ68R843aJsjR9h0AKsSeXfgCAROJt+x0ONYAvmJNV62nzs81broDEl4I
 BmWHpOWI7wlzMPt7bNWn4ev4K+WhbVsioFDS61he0Y47Rqt3yUJ8G2OfBq6JYndw
 As4ImoPOVGl0+TKcHJ9Y3bVPnsY7fJyF0GeG50NHxsVFsnTv+rYZ9K3GM9KExhO/
 5mCfoHNgkOJnhGHfZppbnQBHbmjH8EA3QUx57Abo+Q4wiPNNAVG9P1JpZeyGj8KF
 3YML5FjjGQHtYBWeH5WR
 =HaVL
 -----END PGP SIGNATURE-----

Merge tag 'pm-for-3.4-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management fixes from Rafael J. Wysocki:
 "Fix for an issue causing hibernation to hang on systems with highmem
  (that practically means i386) due to broken memory management (bug
  introduced in 3.2, so -stable material) and PM documentation update
  making the freezer documentation follow the code again after some
  recent updates."

* tag 'pm-for-3.4-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  PM / Freezer / Docs: Update documentation about freezing of tasks
  PM / Hibernate: fix the number of pages used for hibernate/thaw buffering
2012-04-29 15:00:44 -07:00
Linus Torvalds
64f371bc31 autofs: make the autofsv5 packet file descriptor use a packetized pipe
The autofs packet size has had a very unfortunate size problem on x86:
because the alignment of 'u64' differs in 32-bit and 64-bit modes, and
because the packet data was not 8-byte aligned, the size of the autofsv5
packet structure differed between 32-bit and 64-bit modes despite
looking otherwise identical (300 vs 304 bytes respectively).

We first fixed that up by making the 64-bit compat mode know about this
problem in commit a32744d4ab ("autofs: work around unhappy compat
problem on x86-64"), and that made a 32-bit 'systemd' work happily on a
64-bit kernel because everything then worked the same way as on a 32-bit
kernel.

But it turned out that 'automount' had actually known and worked around
this problem in user space, so fixing the kernel to do the proper 32-bit
compatibility handling actually *broke* 32-bit automount on a 64-bit
kernel, because it knew that the packet sizes were wrong and expected
those incorrect sizes.

As a result, we ended up reverting that compatibility mode fix, and
thus breaking systemd again, in commit fcbf94b9de.

With both automount and systemd doing a single read() system call, and
verifying that they get *exactly* the size they expect but using
different sizes, it seemed that fixing one of them inevitably seemed to
break the other.  At one point, a patch I seriously considered applying
from Michael Tokarev did a "strcmp()" to see if it was automount that
was doing the operation.  Ugly, ugly.

However, a prettier solution exists now thanks to the packetized pipe
mode.  By marking the communication pipe as being packetized (by simply
setting the O_DIRECT flag), we can always just write the bigger packet
size, and if user-space does a smaller read, it will just get that
partial end result and the extra alignment padding will simply be thrown
away.

This makes both automount and systemd happy, since they now get the size
they asked for, and the kernel side of autofs simply no longer needs to
care - it could pad out the packet arbitrarily.

Of course, if there is some *other* user of autofs (please, please,
please tell me it ain't so - and we haven't heard of any) that tries to
read the packets with multiple writes, that other user will now be
broken - the whole point of the packetized mode is that one system call
gets exactly one packet, and you cannot read a packet in pieces.

Tested-by: Michael Tokarev <mjt@tls.msk.ru>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: David Miller <davem@davemloft.net>
Cc: Ian Kent <raven@themaw.net>
Cc: Thomas Meyer <thomas@m3y3r.de>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-04-29 13:30:08 -07:00