During a recovery of reshape the early part of some devices might be
in-sync while the later parts are not.
We we know we are looking at an early part it is good to treat that
part as in-sync for stripe calculations.
This is particularly important for a reshape which suffers device
failure. Treating the data as in-sync can mean the difference between
data-safety and data-loss.
Signed-off-by: NeilBrown <neilb@suse.de>
When we are reshaping an array, the device failure combinations
that cause us to decide that the array as failed are more subtle.
In particular, any 'spare' will be fully in-sync in the section
of the array that has already been reshaped, thus failures that
affect only that section are less critical.
So encode this subtlety in a new function and call it as appropriate.
The case that showed this problem was a 4 drive RAID5 to 8 drive RAID6
conversion where the last two devices failed.
This resulted in:
good good good good incomplete good good failed failed
while converting a 5-drive RAID6 to 8 drive RAID5
The incomplete device causes the whole array to look bad,
bad as it was actually good for the section that had been
converted to 8-drives, all the data was actually safe.
Reported-by: Terry Morris <tbmorris@tbmorris.com>
Signed-off-by: NeilBrown <neilb@suse.de>
When an array is reshaped to have fewer devices, the reshape proceeds
from the end of the devices to the beginning.
If a device happens to be non-In_sync (which is possible but rare)
we would normally update the ->recovery_offset as the reshape
progresses. However that would be wrong as the recover_offset records
that the early part of the device is in_sync, while in fact it would
only be the later part that is in_sync, and in any case the offset
number would be measured from the wrong end of the device.
Relatedly, if after a reshape a spare is discovered to not be
recoverred all the way to the end, not allow spare_active
to incorporate it in the array.
This becomes relevant in the following sample scenario:
A 4 drive RAID5 is converted to a 6 drive RAID6 in a combined
operation.
The RAID5->RAID6 conversion will cause a 5 drive to be included as a
spare, then the 5drive -> 6drive reshape will effectively rebuild that
spare as it progresses. The 6th drive is treated as in_sync the whole
time as there is never any case that we might consider reading from
it, but must not because there is no valid data.
If we interrupt this reshape part-way through and reverse it to return
to a 5-drive RAID6 (or event a 4-drive RAID5), we don't want to update
the recovery_offset - as that would be wrong - and we don't want to
include that spare as active in the 5-drive RAID6 when the reversed
reshape completed and it will be mostly out-of-sync still.
Signed-off-by: NeilBrown <neilb@suse.de>
The entries in the stripe_cache maintained by raid5 are enlarged
when we increased the number of devices in the array, but not
shrunk when we reduce the number of devices.
So if entries are added after reducing the number of devices, we
much ensure to initialise the whole entry, not just the part that
is currently relevant. Otherwise if we enlarge the array again,
we will reference uninitialised values.
As grow_buffers/shrink_buffer now want to use a count that is stored
explicity in the raid_conf, they should get it from there rather than
being passed it as a parameter.
Signed-off-by: NeilBrown <neilb@suse.de>
Only level 5 with layout=PARITY_N can be taken over to raid0 now.
Lets allow level 4 either.
Signed-off-by: Maciej Trela <maciej.trela@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
After takeover from raid5/10 -> raid0 mddev->layout is not cleared.
Signed-off-by: Maciej Trela <maciej.trela@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Use mddev->new_layout in setup_conf.
Also use new_chunk, and don't set ->degraded in takeover(). That
gets set in run()
Signed-off-by: Maciej Trela <maciej.trela@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Most array level changes leave the list of devices largely unchanged,
possibly causing one at the end to become redundant.
However conversions between RAID0 and RAID10 need to renumber
all devices (except 0).
This renumbering is currently being done in the ->run method when the
new personality takes over. However this is too late as the common
code in md.c might already have invalidated some of the devices if
they had a ->raid_disk number that appeared to high.
Moving it into the ->takeover method is too early as the array is
still active at that time and wrong ->raid_disk numbers could cause
confusion.
So add a ->new_raid_disk field to mdk_rdev_s and use it to communicate
the new raid_disk number.
Now the common code knows exactly which devices need to be renumbered,
and which can be invalidated, and can do it all at a convenient time
when the array is suspend.
It can also update some symlinks in sysfs which previously were not be
updated correctly.
Reported-by: Maciej Trela <maciej.trela@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Such NULL pointer dereference can occur when the driver was fixing the
read errors/bad blocks and the disk was physically removed
causing a system crash. This patch check if the
rcu_dereference() returns valid rdev before accessing it in fix_read_error().
Cc: stable@kernel.org
Signed-off-by: Prasanna S. Panchamukhi <prasanna.panchamukhi@riverbed.com>
Signed-off-by: Rob Becker <rbecker@riverbed.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Commit b821eaa572 broke partition
detection for md arrays.
The logic was almost right. However if revalidate_disk is called
when the device is not yet open, bdev->bd_disk won't be set, so the
flush_disk() Call will not set bd_invalidated.
So when md_open is called we still need to ensure that
->bd_invalidated gets set. This is easily done with a call to
check_disk_size_change in the place where the offending commit removed
check_disk_change. At the important times, the size will have changed
from 0 to non-zero, so check_disk_size_change will set bd_invalidated.
Tested-by: Duncan <1i5t5.duncan@cox.net>
Reported-by: Duncan <1i5t5.duncan@cox.net>
Signed-off-by: NeilBrown <neilb@suse.de>
The block number comes from bulkstat based inode lookups to shortcut
the mapping calculations. We ar enot able to trust anything from
bulkstat, so drop the block number as well so that the correct
lookups and mappings are always done.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Inode numbers may come from somewhere external to the filesystem
(e.g. file handles, bulkstat information) and so are inherently
untrusted. Rename the flag we use for these lookups to make it
obvious we are doing a lookup of an untrusted inode number and need
to verify it completely before trying to read it from disk.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
When we decode a handle or do a bulkstat lookup, we are using an
inode number we cannot trust to be valid. If we are deleting inode
chunks from disk (default noikeep mode), then we cannot trust the on
disk inode buffer for any given inode number to correctly reflect
whether the inode has been unlinked as the di_mode nor the
generation number may have been updated on disk.
This is due to the fact that when we delete an inode chunk, we do
not write the clusters back to disk when they are removed - instead
we mark them stale to avoid them being written back potentially over
the top of something that has been subsequently allocated at that
location. The result is that we can have locations of disk that look
like they contain valid inodes but in reality do not. Hence we
cannot simply convert the inode number to a block number and read
the location from disk to determine if the inode is valid or not.
As a result, and XFS_IGET_BULKSTAT lookup needs to actually look the
inode up in the inode allocation btree to determine if the inode
number is valid or not.
It should be noted even on ikeep filesystems, there is the
possibility that blocks on disk may look like valid inode clusters.
e.g. if there are filesystem images hosted on the filesystem. Hence
even for ikeep filesystems we really need to validate that the inode
number is valid before issuing the inode buffer read.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
As reported by Sergei, a couple of braces were missing after
the WARN removal patch.
[07/22] OMAP: hwmod: Replace WARN by pr_warning if clock lookup failed
https://patchwork.kernel.org/patch/100756/
Signed-off-by: Benoit Cousson <b-cousson@ti.com>
[paul@pwsan.com: fixed patch description per Anand's E-mail]
Signed-off-by: Paul Walmsley <paul@pwsan.com>
Cc: Sergei Shtylyov <sshtylyov@mvista.com>
Cc: Anand Gadiyar <gadiyar@ti.com>
Because cgroup_fork() is ran before sched_fork() [ from copy_process() ]
and the child's pid is not yet visible the child is pinned to its
cgroup. Therefore we can silence this warning.
A nicer solution would be moving cgroup_fork() to right after
dup_task_struct() and exclude PF_STARTING from task_subsys_state().
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
sky2_phy_reinit is called by the ethtool helpers sky2_set_settings,
sky2_nway_reset and sky2_set_pauseparam when netif_running.
However, at the end of sky2_phy_init GM_GP_CTRL has GM_GPCR_RX_ENA and
GM_GPCR_TX_ENA cleared. So, doing these commands causes the device to
stop working:
$ ethtool -r eth0
$ ethtool -A eth0 autoneg off
Fix this issue by enabling Rx/Tx after running sky2_phy_init in
sky2_phy_reinit.
Signed-off-by: Brandon Philips <bphilips@suse.de>
Tested-by: Brandon Philips <bphilips@suse.de>
Cc: stable@kernel.org
Tested-by: Mike McCormack <mikem@ring3k.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are few places where ANI is started without checking
if it is right to start. This might lead to a case where ani
timer would be left undeleted and cause improper memory acccess
during module unload. This bug is clearly exposed with
paprd support where the driver detects tx hang and does a
chip reset. During this reset ani is (re)started without checking
if it needs to be started. This would leave a timer scheduled
even after all the resources are freed and cause a panic.
This patch introduces a bit in sc_flags to indicate if ani
needs to be started in sw_scan_start() and ath_reset().
This would fix the following panic. This issue is easily seen
with ar9003 + paprd.
BUG: unable to handle kernel paging request at 0000000000003f38
[<ffffffff81075391>] ? __queue_work+0x41/0x50
[<ffffffff8106afaa>] run_timer_softirq+0x17a/0x370
[<ffffffff81088be8>] ? tick_dev_program_event+0x48/0x110
[<ffffffff81061f69>] __do_softirq+0xb9/0x1f0
[<ffffffff810ba060>] ? handle_IRQ_event+0x50/0x160
[<ffffffff8100af5c>] call_softirq+0x1c/0x30
[<ffffffff8100c9f5>] do_softirq+0x65/0xa0
[<ffffffff81061e25>] irq_exit+0x85/0x90
[<ffffffff8155e095>] do_IRQ+0x75/0xf0
[<ffffffff815570d3>] ret_from_intr+0x0/0x11
<EOI>
[<ffffffff812fd67b>] ? acpi_idle_enter_simple+0xe4/0x119
[<ffffffff812fd674>] ? acpi_idle_enter_simple+0xdd/0x119
[<ffffffff81441c87>] cpuidle_idle_call+0xa7/0x140
[<ffffffff81008da3>] cpu_idle+0xb3/0x110
[<ffffffff81550722>] start_secondary+0x1ee/0x1f5
Signed-off-by: Vasanthakumar Thiagarajan <vasanth@atheros.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Disable statistics initialization for eth clients that do not support
statistics. This prevents memory corruption on bnx2x hw.
Signed-off-by: Dmitry Kravkov <dmitry@broadcom.com>
Signed-off-by: Michael Chan <mchan@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
commit aa2ea0586d (tcp: fix outsegs stat for TSO segments) incorrectly
assumed SNMP_ADD_STATS() was used from BH context.
Fix this using mib[!in_softirq()] instead of mib[0]
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some programs like Skype trying to set capture volume automatically.
Normally it will tray, carefully step by step lover or higher, set the volume.
In real word it work not really well, because devises and vendors lie about
real audio settings.
For example most Logitech webcams have 6400 or 3500 steps for capture volume.
They do not tell that actual resolution is 384. So we have only 7 or 18 real
steps. In this patch I set real resolution only for tested devices.
Signed-off-by: Alexey Fisher <bug-track@fisher-privat.net>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
The task_group() function returns a pointer that must be protected
by either RCU, the ->alloc_lock, or the cgroup lock (see the
rcu_dereference_check() in task_subsys_state(), which is invoked by
task_group()). The wake_affine() function currently does none of these,
which means that a concurrent update would be within its rights to free
the structure returned by task_group(). Because wake_affine() uses this
structure only to compute load-balancing heuristics, there is no reason
to acquire either of the two locks.
Therefore, this commit introduces an RCU read-side critical section that
starts before the first call to task_group() and ends after the last use
of the "tg" pointer returned from task_group(). Thanks to Li Zefan for
pointing out the need to extend the RCU read-side critical section from
that proposed by the original patch.
Signed-off-by: Daniel J Blueman <daniel.blueman@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
virtio-pci resets the device at startup by writing to the status
register, but this does not clear the pci config space,
specifically msi enable status which affects register
layout.
This breaks things like kdump when they try to use e.g. virtio-blk.
Fix by forcing msi off at startup. Since pci.c already has
a routine to do this, we export and use it instead of duplicating code.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Tested-by: Vivek Goyal <vgoyal@redhat.com>
Acked-by: Jesse Barnes <jbarnes@virtuousgeek.org>
Cc: linux-pci@vger.kernel.org
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: stable@kernel.org
The non-coherent bulkstat versionsthat look directly at the inode
buffers causes various problems with performance optimizations that
make increased use of just logging inodes. This patch makes bulkstat
always use iget, which should be fast enough for normal use with the
radix-tree based inode cache introduced a while ago.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
This patch prevents user "foo" from using the SWAPEXT ioctl to swap
a write-only file owned by user "bar" into a file owned by "foo" and
subsequently reading it. It does so by checking that the file
descriptors passed to the ioctl are also opened for reading.
Signed-off-by: Dan Rosenberg <dan.j.rosenberg@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
The code we had to clear the MSR_SE bit was not doing anything because
the caller (ultimately single_step_exception() in traps.c) had already
cleared. Instead of trying to leave MSR_SE set if the TIF_SINGLESTEP
flag is set (which indicates that the process is being single-stepped
by ptrace), we instead return NOTIFY_DONE in that case, which means
the caller will generate a SIGTRAP for the process.
Signed-off-by: Paul Mackerras <paulus@samba.org>
The code would accept an access to an address one byte past the end
of the requested range as legitimate, due to having a "<=" rather than
a "<". This fixes that and cleans up the code a bit.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Cintiq 21UX2 added 8 more bits for the tool serial number and more
buttons for the expresskey. We did not enable them properly in the
last patch.
Signed-off-by: Ping Cheng <pingc@wacom.com>
Signed-off-by: Dmitry Torokhov <dtor@mail.ru>
Some of the recent X86_MRST additions make some "select"s
conditional on X86_MRST but missed some related kconfig symbols,
causing:
drivers/built-in.o: In function `ps2_end_command':
(.text+0x257ab2): undefined reference to `i8042_check_port_owner'
drivers/built-in.o: In function `ps2_end_command':
(.text+0x257ae1): undefined reference to `i8042_unlock_chip'
drivers/built-in.o: In function `ps2_begin_command':
(.text+0x257b40): undefined reference to `i8042_check_port_owner'
drivers/built-in.o: In function `ps2_begin_command':
(.text+0x257b6f): undefined reference to `i8042_lock_chip'
when SERIO_I8042=m, SERIO_LIBPS2=y, KEYBOARD_ATKBD=y.
We need to make i8042 dependant upon !X86_MRST and allow deselecting
atkbd on Moorestown even when !CONFIG_EMBEDDED.
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Jacob Pan <jacob.jun.pan@intel.com>
Signed-off-by: Dmitry Torokhov <dtor@mail.ru>
Put the code that is common to both the referral and ordinary mount cases
into a common helper routine.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If the attempt to read the calldir fails, then instead of storing the read
bytes, we currently discard them. This leads to a garbage final result when
upon re-entry to the same routine, we read the remaining bytes.
Fixes the regression in bugzilla number 16213. Please see
https://bugzilla.kernel.org/show_bug.cgi?id=16213
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@kernel.org
S_ISDIR(fsinfo.fattr->mode) checks the file type rather than the mode bits,
so we should be checking for the NFS_ATTR_FATTR_TYPE fattr property.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@kernel.org
This patch removes the setting of the low_latency flag.
tty_flip_buffer_push() is occasionally being called in irq context, which
causes a hang if the low_latency flag is set.
Removing the low_latency flag only seems to impact the flush to ldisc,
which will now be put on a workqueue.
Signed-off-by: Filip Aben <f.aben@option.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Many a times, the requested breakpoint length can be less than the
fixed breakpoint length i.e. 8 bytes supported by PowerPC 64-bit
server (Book III S) processors. This could lead to extraneous
interrupts resulting in false breakpoint notifications. This
detects and discards such interrupts for non-ptrace requests.
We don't change ptrace behaviour to avoid breaking compatability.
[Suggestion from Paul Mackerras <paulus@samba.org> to add a new flag in
'struct arch_hw_breakpoint' to identify extraneous interrupts]
Signed-off-by: K.Prasad <prasad@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
A signal delivered between a hw_breakpoint_handler() and the
single_step_dabr_instruction() will not have the breakpoint active
while the signal handler is running -- the signal delivery will
set up a new MSR value which will not have MSR_SE set, so we
won't get the signal step interrupt until and unless the signal
handler returns (which it may never do).
To fix this, we restore the breakpoint when delivering a signal --
we clear the MSR_SE bit and set the DABR again. If the signal
handler returns, the DABR interrupt will occur again when the
instruction that we were originally trying to single-step gets
re-executed.
[Paul Mackerras <paulus@samba.org> pointed out the need to do this.]
Signed-off-by: K.Prasad <prasad@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
If an alignment interrupt occurs on an instruction that is being
single-stepped, the alignment interrupt handler currently handles
the single-step condition by unconditionally sending a SIGTRAP to
the process. Other synchronous interrupts that result in the
instruction being emulated do likewise.
With hw_breakpoint support, the hw_breakpoint code needs to be able
to intercept these single-step events as well as those where the
instruction executes normally and a trace interrupt happens.
Fix this by making emulate_single_step() use the existing
single_step_exception() function instead of calling _exception()
directly. We then make single_step_exception() use the abstracted
clear_single_step() rather than clearing bits in the MSR image
directly so that emulate_single_step() will continue to work
correctly on Book 3E processors.
Signed-off-by: K.Prasad <prasad@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Implement perf-events based hw-breakpoint interfaces for PowerPC
64-bit server (Book III S) processors. This allows access to a
given location to be used as an event that can be counted or
profiled by the perf_events subsystem.
This is done using the DABR (data breakpoint register), which can
also be used for process debugging via ptrace. When perf_event
hw_breakpoint support is configured in, the perf_event subsystem
manages the DABR and arbitrates access to it, and ptrace then
creates a perf_event when it is requested to set a data breakpoint.
[Adopted suggestions from Paul Mackerras <paulus@samba.org> to
- emulate_step() all system-wide breakpoints and single-step only the
per-task breakpoints
- perform arch-specific cleanup before unregistration through
arch_unregister_hw_breakpoint()
]
Signed-off-by: K.Prasad <prasad@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Certain architectures (such as PowerPC) have a need to clean up data
structures before a breakpoint is unregistered. This introduces an
arch-specific hook in release_bp_slot() along with a weak definition
in the form of a stub function.
Signed-off-by: K.Prasad <prasad@linux.vnet.ibm.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
This extends the emulate_step() function to handle a large proportion
of the Book I instructions implemented on current 64-bit server
processors. The aim is to handle all the load and store instructions
used in the kernel, plus all of the instructions that appear between
l[wd]arx and st[wd]cx., so this handles the Altivec/VMX lvx and stvx
and the VSX lxv2dx and stxv2dx instructions (implemented in POWER7).
The new code can emulate user mode instructions, and checks the
effective address for a load or store if the saved state is for
user mode. It doesn't handle little-endian mode at present.
For floating-point, Altivec/VMX and VSX instructions, it checks
that the saved MSR has the enable bit for the relevant facility
set, and if so, assumes that the FP/VMX/VSX registers contain
valid state, and does loads or stores directly to/from the
FP/VMX/VSX registers, using assembly helpers in ldstfp.S.
Instructions supported now include:
* Loads and stores, including some but not all VMX and VSX instructions,
and lmw/stmw
* Atomic loads and stores (l[dw]arx, st[dw]cx.)
* Arithmetic instructions (add, subtract, multiply, divide, etc.)
* Compare instructions
* Rotate and mask instructions
* Shift instructions
* Logical instructions (and, or, xor, etc.)
* Condition register logical instructions
* mtcrf, cntlz[wd], exts[bhw]
* isync, sync, lwsync, ptesync, eieio
* Cache operations (dcbf, dcbst, dcbt, dcbtst)
The overflow-checking arithmetic instructions are not included, but
they appear not to be ever used in C code.
This uses decimal values for the minor opcodes in the switch statements
because that is what appears in the Power ISA specification, thus it is
easier to check that they are correct if they are in decimal.
If this is used to single-step an instruction where a data breakpoint
interrupt occurred, then there is the possibility that the instruction
is a lwarx or ldarx. In that case we have to be careful not to lose the
reservation until we get to the matching st[wd]cx., or we'll never make
forward progress. One alternative is to try to arrange that we can
return from interrupts and handle data breakpoint interrupts without
losing the reservation, which means not using any spinlocks, mutexes,
or atomic ops (including bitops). That seems rather fragile. The
other alternative is to emulate the larx/stcx and all the instructions
in between. This is why this commit adds support for a wide range
of integer instructions.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Fix the following compile warning. kctl should be NULL-initialized.
sound/pci/hda/patch_realtek.c: In function ‘alc_build_controls’:
sound/pci/hda/patch_realtek.c:2550:23: warning: ‘kctl’ may be used uninitialized in this function
Signed-off-by: Takashi Iwai <tiwai@suse.de>
The previous CMT fixup accidentally copied in the TMU shift value, reset
this back to its original value while preserving the TMU fix.
Signed-off-by: Paul Mundt <lethal@linux-sh.org>
This fixes a race between handle_reply finishing an mds request, signalling
completion, and then dropping the request structing and its dentry+inode
refs, and pre_umount function waiting for requests to finish before
letting the vfs tear down the dcache. If umount was delayed waiting for
mds requests, we could race and BUG in shrink_dcache_for_umount_subtree
because of a slow dput.
This delays umount until the msgr queue flushes, which means handle_reply
will exit and will have dropped the ceph_mds_request struct. I'm assuming
the VFS has already ensured that its calls have all completed and those
request refs have thus been dropped as well (I haven't seen that race, at
least).
Signed-off-by: Sage Weil <sage@newdream.net>