Commit Graph

257869 Commits

Author SHA1 Message Date
Shawn Guo
94cc6a8656 mmc: sdhci: merge two sdhci-pltfm.h into one
The structure sdhci_pltfm_data is not necessarily to be in a public
header like include/linux/mmc/sdhci-pltfm.h, so the patch moves it
into drivers/mmc/host/sdhci-pltfm.h and eliminates the former one.

Signed-off-by: Shawn Guo <shawn.guo@linaro.org>
Reviewed-by: Grant Likely <grant.likely@secretlab.ca>
Reviewed-by: Wolfram Sang <w.sang@pengutronix.de>
Signed-off-by: Chris Ball <cjb@laptop.org>
2011-07-20 17:20:48 -04:00
Shawn Guo
38576af1f8 mmc: sdhci: make sdhci-of device drivers self registered
The patch turns the sdhci-of-core common stuff into helper functions
added into sdhci-pltfm.c, and makes sdhci-of device drviers self
registered using the same pair of .probe and .remove used by
sdhci-pltfm device drivers.

As a result, sdhci-of-core.c and sdhci-of.h can be eliminated with
those common things merged into sdhci-pltfm.c and sdhci-pltfm.h
respectively.

Signed-off-by: Shawn Guo <shawn.guo@linaro.org>
Acked-by: Anton Vorontsov <cbouatmailru@gmail.com>
Reviewed-by: Wolfram Sang <w.sang@pengutronix.de>
Signed-off-by: Chris Ball <cjb@laptop.org>
2011-07-20 17:20:47 -04:00
Shawn Guo
e307148fd4 mmc: sdhci: eliminate sdhci_of_host and sdhci_of_data
The patch migrates the use of sdhci_of_host and sdhci_of_data to
sdhci_pltfm_host and sdhci_pltfm_data, so that the former pair can
be eliminated.

Signed-off-by: Shawn Guo <shawn.guo@linaro.org>
Reviewed-by: Grant Likely <grant.likely@secretlab.ca>
Reviewed-by: Wolfram Sang <w.sang@pengutronix.de>
Acked-by: Anton Vorontsov <cbouatmailru@gmail.com>
Signed-off-by: Chris Ball <cjb@laptop.org>
2011-07-20 17:20:38 -04:00
Shawn Guo
85d6509dc8 mmc: sdhci: make sdhci-pltfm device drivers self registered
The patch turns the common stuff in sdhci-pltfm.c into functions, and
add device drivers their own .probe and .remove which in turn call
into the common functions, so that those sdhci-pltfm device drivers
register itself and keep all device specific things away from common
sdhci-pltfm file.

Signed-off-by: Shawn Guo <shawn.guo@linaro.org>
Reviewed-by: Grant Likely <grant.likely@secretlab.ca>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Anton Vorontsov <cbouatmailru@gmail.com>
Signed-off-by: Chris Ball <cjb@laptop.org>
2011-07-20 17:16:06 -04:00
Jan H. Schönherr
7f70893173 rcu: Fix wrong check in list_splice_init_rcu()
If the list to be spliced is empty, then list_splice_init_rcu() has
nothing to do.  Unfortunately, list_splice_init_rcu() does not check
the list to be spliced; it instead checks the list to be spliced into.
This results in memory leaks given current usage.  This commit
therefore fixes the empty-list check.

Signed-off-by: Jan H. Schönherr <schnhrr@cs.tu-berlin.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-07-20 14:10:20 -07:00
Paul E. McKenney
cefcb60201 net,rcu: Convert call_rcu(xt_rateest_free_rcu) to kfree_rcu()
The RCU callback xt_rateest_free_rcu() just calls kfree(), so we can
use kfree_rcu() instead of call_rcu().  This also allows us to dispense
with an rcu_barrier() call, speeding up unloading of this module.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Patrick McHardy <kaber@trash.net>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-07-20 14:10:19 -07:00
Paul E. McKenney
a95cded32d sysctl,rcu: Convert call_rcu(free_head) to kfree
The RCU callback free_head just calls kfree(), so we can use kfree_rcu()
instead of call_rcu().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-07-20 14:10:18 -07:00
Lai Jiangshan
22a3c7d188 vmalloc,rcu: Convert call_rcu(rcu_free_vb) to kfree_rcu()
The rcu callback rcu_free_vb() just calls a kfree(),
so we use kfree_rcu() instead of the call_rcu(rcu_free_vb).

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Namhyung Kim <namhyung@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-07-20 14:10:18 -07:00
Lai Jiangshan
14769de93f vmalloc,rcu: Convert call_rcu(rcu_free_va) to kfree_rcu()
The rcu callback rcu_free_va() just calls a kfree(),
so we use kfree_rcu() instead of the call_rcu(rcu_free_va).

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Namhyung Kim <namhyung@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-07-20 14:10:17 -07:00
Lai Jiangshan
d4ee9aa33d ipc,rcu: Convert call_rcu(ipc_immediate_free) to kfree_rcu()
The rcu callback ipc_immediate_free() just calls a kfree(),
so we use kfree_rcu() instead of the call_rcu(ipc_immediate_free).

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-07-20 14:10:16 -07:00
Lai Jiangshan
693a8b6eec ipc,rcu: Convert call_rcu(free_un) to kfree_rcu()
The rcu callback free_un() just calls a kfree(),
so we use kfree_rcu() instead of the call_rcu(free_un).

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Manfred Spraul <manfred@colorfullife.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-07-20 14:10:16 -07:00
Lai Jiangshan
449a68cc65 security,rcu: Convert call_rcu(sel_netport_free) to kfree_rcu()
The rcu callback sel_netport_free() just calls a kfree(),
so we use kfree_rcu() instead of the call_rcu(sel_netport_free).

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: James Morris <jmorris@namei.org>
Cc: Eric Paris <eparis@parisplace.org>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-07-20 14:10:15 -07:00
Lai Jiangshan
9801c60e99 security,rcu: Convert call_rcu(sel_netnode_free) to kfree_rcu()
The rcu callback sel_netnode_free() just calls a kfree(),
so we use kfree_rcu() instead of the call_rcu(sel_netnode_free).

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: James Morris <jmorris@namei.org>
Cc: Eric Paris <eparis@parisplace.org>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-07-20 14:10:14 -07:00
Lai Jiangshan
f218a7ee7a ia64,rcu: Convert call_rcu(sn_irq_info_free) to kfree_rcu()
The rcu callback sn_irq_info_free() just calls a kfree(),
so we use kfree_rcu() instead of the call_rcu(sn_irq_info_free).

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Jes Sorensen <jes@sgi.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-07-20 14:10:13 -07:00
Lai Jiangshan
57bdfbf9ee block,rcu: Convert call_rcu(disk_free_ptbl_rcu_cb) to kfree_rcu()
The rcu callback disk_free_ptbl_rcu_cb() just calls a kfree(),
so we use kfree_rcu() instead of the call_rcu(disk_free_ptbl_rcu_cb).

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-07-20 14:10:13 -07:00
Lai Jiangshan
8497a24a43 scsi,rcu: Convert call_rcu(fc_rport_free_rcu) to kfree_rcu()
The rcu callback fc_rport_free_rcu() just calls a kfree(),
so we use kfree_rcu() instead of the call_rcu(fc_rport_free_rcu).

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Robert Love <robert.w.love@intel.com>
Cc: "James E.J. Bottomley" <James.Bottomley@suse.de>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-07-20 14:10:12 -07:00
Lai Jiangshan
3b097c4696 audit_tree,rcu: Convert call_rcu(__put_tree) to kfree_rcu()
The rcu callback __put_tree() just calls a kfree(),
so we use kfree_rcu() instead of the call_rcu(__put_tree).

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Paris <eparis@redhat.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-07-20 14:10:11 -07:00
Konrad Rzeszutek Wilk
3a6d28b11a Merge branch 'stable/xen-pciback-0.6.3' into stable/drivers
* stable/xen-pciback-0.6.3:
  xen/pciback: Have 'passthrough' option instead of XEN_PCIDEV_BACKEND_PASS and XEN_PCIDEV_BACKEND_VPCI
  xen/pciback: Remove the DEBUG option.
  xen/pciback: Drop two backends, squash and cleanup some code.
  xen/pciback: Print out the MSI/MSI-X (PIRQ) values
  xen/pciback: Don't setup an fake IRQ handler for SR-IOV devices.
  xen: rename pciback module to xen-pciback.
  xen/pciback: Fine-grain the spinlocks and fix BUG: scheduling while atomic cases.
  xen/pciback: Allocate IRQ handler for device that is shared with guest.
  xen/pciback: Disable MSI/MSI-X when reseting a device
  xen/pciback: guest SR-IOV support for PV guest
  xen/pciback: Register the owner (domain) of the PCI device.
  xen/pciback: Cleanup the driver based on checkpatch warnings and errors.
  xen/pciback: xen pci backend driver.

Conflicts:
	drivers/xen/Kconfig
2011-07-20 15:33:51 -04:00
Ingo Molnar
d1e9ae47a0 Merge branch 'rcu/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-2.6-rcu into core/urgent 2011-07-20 20:59:26 +02:00
Lai Jiangshan
6034f7e603 security,rcu: Convert call_rcu(whitelist_item_free) to kfree_rcu()
The rcu callback whitelist_item_free() just calls a kfree(),
so we use kfree_rcu() instead of the call_rcu(whitelist_item_free).

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: James Morris <jmorris@namei.org>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-07-20 11:05:30 -07:00
Lai Jiangshan
b119cbab3a md,rcu: Convert call_rcu(free_conf) to kfree_rcu()
The rcu callback free_conf() just calls a kfree(),
so we use kfree_rcu() instead of the call_rcu(free_conf).

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: NeilBrown <neilb@suse.de>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-07-20 11:05:29 -07:00
Paul E. McKenney
a841796f11 signal: align __lock_task_sighand() irq disabling and RCU
The __lock_task_sighand() function calls rcu_read_lock() with interrupts
and preemption enabled, but later calls rcu_read_unlock() with interrupts
disabled.  It is therefore possible that this RCU read-side critical
section will be preempted and later RCU priority boosted, which means that
rcu_read_unlock() will call rt_mutex_unlock() in order to deboost itself, but
with interrupts disabled. This results in lockdep splats, so this commit
nests the RCU read-side critical section within the interrupt-disabled
region of code.  This prevents the RCU read-side critical section from
being preempted, and thus prevents the attempt to deboost with interrupts
disabled.

It is quite possible that a better long-term fix is to make rt_mutex_unlock()
disable irqs when acquiring the rt_mutex structure's ->wait_lock.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-07-20 11:04:54 -07:00
Peter Zijlstra
ec433f0c51 softirq,rcu: Inform RCU of irq_exit() activity
The rcu_read_unlock_special() function relies on in_irq() to exclude
scheduler activity from interrupt level.  This fails because exit_irq()
can invoke the scheduler after clearing the preempt_count() bits that
in_irq() uses to determine that it is at interrupt level.  This situation
can result in failures as follows:

 $task			IRQ		SoftIRQ

 rcu_read_lock()

 /* do stuff */

 <preempt> |= UNLOCK_BLOCKED

 rcu_read_unlock()
   --t->rcu_read_lock_nesting

			irq_enter();
			/* do stuff, don't use RCU */
			irq_exit();
			  sub_preempt_count(IRQ_EXIT_OFFSET);
			  invoke_softirq()

					ttwu();
					  spin_lock_irq(&pi->lock)
					  rcu_read_lock();
					  /* do stuff */
					  rcu_read_unlock();
					    rcu_read_unlock_special()
					      rcu_report_exp_rnp()
					        ttwu()
					          spin_lock_irq(&pi->lock) /* deadlock */

   rcu_read_unlock_special(t);

Ed can simply trigger this 'easy' because invoke_softirq() immediately
does a ttwu() of ksoftirqd/# instead of doing the in-place softirq stuff
first, but even without that the above happens.

Cure this by also excluding softirqs from the
rcu_read_unlock_special() handler and ensuring the force_irqthreads
ksoftirqd/# wakeup is done from full softirq context.

[ Alternatively, delaying the ->rcu_read_lock_nesting decrement
  until after the special handling would make the thing more robust
  in the face of interrupts as well.  And there is a separate patch
  for that. ]

Cc: Thomas Gleixner <tglx@linutronix.de>
Reported-and-tested-by: Ed Tomlinson <edt@aei.ca>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-07-20 10:50:12 -07:00
Peter Zijlstra
c5d753a55a sched: Add irq_{enter,exit}() to scheduler_ipi()
Ensure scheduler_ipi() calls irq_{enter,exit} when it does some actual
work. Traditionally we never did any actual work from the resched IPI
and all magic happened in the return from interrupt path.

Now that we do do some work, we need to ensure irq_{enter,exit} are
called so that we don't confuse things.

This affects things like timekeeping, NO_HZ and RCU, basically
everything with a hook in irq_enter/exit.

Explicit examples of things going wrong are:

  sched_clock_cpu() -- has a callback when leaving NO_HZ state to take
                    a new reading from GTOD and TSC. Without this
                    callback, time is stuck in the past.

  RCU -- needs in_irq() to work in order to avoid some nasty deadlocks

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-07-20 10:50:11 -07:00
Paul E. McKenney
10f39bb1b2 rcu: protect __rcu_read_unlock() against scheduler-using irq handlers
The addition of RCU read-side critical sections within runqueue and
priority-inheritance lock critical sections introduced some deadlock
cycles, for example, involving interrupts from __rcu_read_unlock()
where the interrupt handlers call wake_up().  This situation can cause
the instance of __rcu_read_unlock() invoked from interrupt to do some
of the processing that would otherwise have been carried out by the
task-level instance of __rcu_read_unlock().  When the interrupt-level
instance of __rcu_read_unlock() is called with a scheduler lock held
from interrupt-entry/exit situations where in_irq() returns false,
deadlock can result.

This commit resolves these deadlocks by using negative values of
the per-task ->rcu_read_lock_nesting counter to indicate that an
instance of __rcu_read_unlock() is in flight, which in turn prevents
instances from interrupt handlers from doing any special processing.
This patch is inspired by Steven Rostedt's earlier patch that similarly
made __rcu_read_unlock() guard against interrupt-mediated recursion
(see https://lkml.org/lkml/2011/7/15/326), but this commit refines
Steven's approach to avoid the need for preemption disabling on the
__rcu_read_unlock() fastpath and to also avoid the need for manipulating
a separate per-CPU variable.

This patch avoids need for preempt_disable() by instead using negative
values of the per-task ->rcu_read_lock_nesting counter.  Note that nested
rcu_read_lock()/rcu_read_unlock() pairs are still permitted, but they will
never see ->rcu_read_lock_nesting go to zero, and will therefore never
invoke rcu_read_unlock_special(), thus preventing them from seeing the
RCU_READ_UNLOCK_BLOCKED bit should it be set in ->rcu_read_unlock_special.
This patch also adds a check for ->rcu_read_unlock_special being negative
in rcu_check_callbacks(), thus preventing the RCU_READ_UNLOCK_NEED_QS
bit from being set should a scheduling-clock interrupt occur while
__rcu_read_unlock() is exiting from an outermost RCU read-side critical
section.

Of course, __rcu_read_unlock() can be preempted during the time that
->rcu_read_lock_nesting is negative.  This could result in the setting
of the RCU_READ_UNLOCK_BLOCKED bit after __rcu_read_unlock() checks it,
and would also result it this task being queued on the corresponding
rcu_node structure's blkd_tasks list.  Therefore, some later RCU read-side
critical section would enter rcu_read_unlock_special() to clean up --
which could result in deadlock if that critical section happened to be in
the scheduler where the runqueue or priority-inheritance locks were held.

This situation is dealt with by making rcu_preempt_note_context_switch()
check for negative ->rcu_read_lock_nesting, thus refraining from
queuing the task (and from setting RCU_READ_UNLOCK_BLOCKED) if we are
already exiting from the outermost RCU read-side critical section (in
other words, we really are no longer actually in that RCU read-side
critical section).  In addition, rcu_preempt_note_context_switch()
invokes rcu_read_unlock_special() to carry out the cleanup in this case,
which clears out the ->rcu_read_unlock_special bits and dequeues the task
(if necessary), in turn avoiding needless delay of the current RCU grace
period and needless RCU priority boosting.

It is still illegal to call rcu_read_unlock() while holding a scheduler
lock if the prior RCU read-side critical section has ever had either
preemption or irqs enabled.  However, the common use case is legal,
namely where then entire RCU read-side critical section executes with
irqs disabled, for example, when the scheduler lock is held across the
entire lifetime of the RCU read-side critical section.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-07-20 10:50:11 -07:00
Eric Dumazet
b56efcf0a4 slab: shrink sizeof(struct kmem_cache)
Reduce high order allocations for some setups.
(NR_CPUS=4096 -> we need 64KB per kmem_cache struct)

We now allocate exact needed size (using nr_cpu_ids and nr_node_ids)

This also makes code a bit smaller on x86_64, since some field offsets
are less than the 127 limit :

Before patch :
# size mm/slab.o
   text    data     bss     dec     hex filename
  22605  361665      32  384302   5dd2e mm/slab.o

After patch :
# size mm/slab.o
   text    data     bss     dec     hex filename
  22349	 353473	   8224	 384046	  5dc2e	mm/slab.o

CC: Andrew Morton <akpm@linux-foundation.org>
Reported-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-07-20 20:27:56 +03:00
Peter Zijlstra
d110235d2c sched: Avoid creating superfluous NUMA domains on non-NUMA systems
When creating sched_domains, stop when we've covered the entire
target span instead of continuing to create domains, only to
later find they're redundant and throw them away again.

This avoids single node systems from touching funny NUMA
sched_domain creation code and reduces the risks of the new
SD_OVERLAP code.

Requested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Anton Blanchard <anton@samba.org>
Cc: mahesh@linux.vnet.ibm.com
Cc: benh@kernel.crashing.org
Cc: linuxppc-dev@lists.ozlabs.org
Link: http://lkml.kernel.org/r/1311180177.29152.57.camel@twins
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-07-20 18:54:33 +02:00
Peter Zijlstra
e3589f6c81 sched: Allow for overlapping sched_domain spans
Allow for sched_domain spans that overlap by giving such domains their
own sched_group list instead of sharing the sched_groups amongst
each-other.

This is needed for machines with more than 16 nodes, because
sched_domain_node_span() will generate a node mask from the
16 nearest nodes without regard if these masks have any overlap.

Currently sched_domains have a sched_group that maps to their child
sched_domain span, and since there is no overlap we share the
sched_group between the sched_domains of the various CPUs. If however
there is overlap, we would need to link the sched_group list in
different ways for each cpu, and hence sharing isn't possible.

In order to solve this, allocate private sched_groups for each CPU's
sched_domain but have the sched_groups share a sched_group_power
structure such that we can uniquely track the power.

Reported-and-tested-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/n/tip-08bxqw9wis3qti9u5inifh3y@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-07-20 18:32:41 +02:00
Peter Zijlstra
9c3f75cbd1 sched: Break out cpu_power from the sched_group structure
In order to prepare for non-unique sched_groups per domain, we need to
carry the cpu_power elsewhere, so put a level of indirection in.

Reported-and-tested-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/n/tip-qkho2byuhe4482fuknss40ad@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-07-20 18:32:40 +02:00
Axel Lin
2dcd9543a2 HID: emsff: properly handle emsff_init failure
emsff_init() may fail, let's properly handle the failure.

Signed-off-by: Axel Lin <axel.lin@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2011-07-20 11:55:18 +02:00
Dave Chinner
12ad3ab661 superblock: move pin_sb_for_writeback() to fs/super.c
The per-sb shrinker has the same requirement as the writeback
threads of ensuring that the superblock is usable and pinned for the
time it takes to run the work. Both need to take a passive reference
to the sb, take a read lock on the s_umount lock and then only
continue if an unmount is not in progress.

pin_sb_for_writeback() does this exactly, so move it to fs/super.c
and rename it to grab_super_passive() and exporting it via
fs/internal.h for all the VFS code to be able to use.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20 01:44:38 -04:00
Dave Chinner
09cc9fc7a7 inode: move to per-sb LRU locks
With the inode LRUs moving to per-sb structures, there is no longer
a need for a global inode_lru_lock. The locking can be made more
fine-grained by moving to a per-sb LRU lock, isolating the LRU
operations of different filesytsems completely from each other.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20 01:44:36 -04:00
Dave Chinner
98b745c647 inode: Make unused inode LRU per superblock
The inode unused list is currently a global LRU. This does not match
the other global filesystem cache - the dentry cache - which uses
per-superblock LRU lists. Hence we have related filesystem object
types using different LRU reclaimation schemes.

To enable a per-superblock filesystem cache shrinker, both of these
caches need to have per-sb unused object LRU lists. Hence this patch
converts the global inode LRU to per-sb LRUs.

The patch only does rudimentary per-sb propotioning in the shrinker
infrastructure, as this gets removed when the per-sb shrinker
callouts are introduced later on.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20 01:44:35 -04:00
Dave Chinner
fcb94f72d3 inode: convert inode_stat.nr_unused to per-cpu counters
Before we split up the inode_lru_lock, the unused inode counter
needs to be made independent of the global inode_lru_lock. Convert
it to per-cpu counters to do this.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20 01:44:34 -04:00
Dave Chinner
e9299f5058 vmscan: add customisable shrinker batch size
For shrinkers that have their own cond_resched* calls, having
shrink_slab break the work down into small batches is not
paticularly efficient. Add a custom batchsize field to the struct
shrinker so that shrinkers can use a larger batch size if they
desire.

A value of zero (uninitialised) means "use the default", so
behaviour is unchanged by this patch.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20 01:44:32 -04:00
Dave Chinner
3567b59aa8 vmscan: reduce wind up shrinker->nr when shrinker can't do work
When a shrinker returns -1 to shrink_slab() to indicate it cannot do
any work given the current memory reclaim requirements, it adds the
entire total_scan count to shrinker->nr. The idea ehind this is that
whenteh shrinker is next called and can do work, it will do the work
of the previously aborted shrinker call as well.

However, if a filesystem is doing lots of allocation with GFP_NOFS
set, then we get many, many more aborts from the shrinkers than we
do successful calls. The result is that shrinker->nr winds up to
it's maximum permissible value (twice the current cache size) and
then when the next shrinker call that can do work is issued, it
has enough scan count built up to free the entire cache twice over.

This manifests itself in the cache going from full to empty in a
matter of seconds, even when only a small part of the cache is
needed to be emptied to free sufficient memory.

Under metadata intensive workloads on ext4 and XFS, I'm seeing the
VFS caches increase memory consumption up to 75% of memory (no page
cache pressure) over a period of 30-60s, and then the shrinker
empties them down to zero in the space of 2-3s. This cycle repeats
over and over again, with the shrinker completely trashing the inode
and dentry caches every minute or so the workload continues.

This behaviour was made obvious by the shrink_slab tracepoints added
earlier in the series, and made worse by the patch that corrected
the concurrent accounting of shrinker->nr.

To avoid this problem, stop repeated small increments of the total
scan value from winding shrinker->nr up to a value that can cause
the entire cache to be freed. We still need to allow it to wind up,
so use the delta as the "large scan" threshold check - if the delta
is more than a quarter of the entire cache size, then it is a large
scan and allowed to cause lots of windup because we are clearly
needing to free lots of memory.

If it isn't a large scan then limit the total scan to half the size
of the cache so that windup never increases to consume the whole
cache. Reducing the total scan limit further does not allow enough
wind-up to maintain the current levels of performance, whilst a
higher threshold does not prevent the windup from freeing the entire
cache under sustained workloads.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20 01:44:31 -04:00
Dave Chinner
acf92b485c vmscan: shrinker->nr updates race and go wrong
shrink_slab() allows shrinkers to be called in parallel so the
struct shrinker can be updated concurrently. It does not provide any
exclusio for such updates, so we can get the shrinker->nr value
increasing or decreasing incorrectly.

As a result, when a shrinker repeatedly returns a value of -1 (e.g.
a VFS shrinker called w/ GFP_NOFS), the shrinker->nr goes haywire,
sometimes updating with the scan count that wasn't used, sometimes
losing it altogether. Worse is when a shrinker does work and that
update is lost due to racy updates, which means the shrinker will do
the work again!

Fix this by making the total_scan calculations independent of
shrinker->nr, and making the shrinker->nr updates atomic w.r.t. to
other updates via cmpxchg loops.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20 01:44:29 -04:00
Dave Chinner
095760730c vmscan: add shrink_slab tracepoints
It is impossible to understand what the shrinkers are actually doing
without instrumenting the code, so add a some tracepoints to allow
insight to be gained.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20 01:44:27 -04:00
Al Viro
a9049376ee make d_splice_alias(ERR_PTR(err), dentry) = ERR_PTR(err)
... and simplify the living hell out of callers

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20 01:44:26 -04:00
Al Viro
0c1aa9a952 deuglify squashfs_lookup()
d_splice_alias(NULL, dentry) is equivalent to d_add(dentry, NULL), NULL
so no need for that if (inode) ... in there (or ERR_PTR(0), for that
matter)

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20 01:44:24 -04:00
Al Viro
5b4b299cc7 nfsd4_list_rec_dir(): don't bother with reopening rec_file
just rewind it to the beginning before vfs_readdir() and be
done with that...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20 01:44:23 -04:00
Al Viro
e7f5909707 kill useless checks for sb->s_op == NULL
never is...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20 01:44:21 -04:00
Al Viro
0ee5dc676a btrfs: kill magical embedded struct superblock
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20 01:44:20 -04:00
Al Viro
fb408e6ccc get rid of pointless checks for dentry->sb == NULL
it never is...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20 01:44:19 -04:00
Al Viro
a4464dbc0c Make ->d_sb assign-once and always non-NULL
New helper (non-exported, fs/internal.h-only): __d_alloc(sb, name).
Allocates dentry, sets its ->d_sb to given superblock and sets
->d_op accordingly.  Old d_alloc(NULL, name) callers are converted
to that (all of them know what superblock they want).  d_alloc()
itself is left only for parent != NULl case; uses __d_alloc(),
inserts result into the list of parent's children.

Note that now ->d_sb is assign-once and never NULL *and*
->d_parent is never NULL either.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20 01:44:17 -04:00
Al Viro
e3c3d9c838 unexport kern_path_parent()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20 01:44:16 -04:00
Al Viro
e0a0124936 switch vfs_path_lookup() to struct path
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20 01:44:14 -04:00
Al Viro
ed75e95de5 kill lookup_create()
folded into the only caller (kern_path_create())

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20 01:44:12 -04:00
Al Viro
5da4e68944 devtmpfs: get rid of bogus mkdir in create_path()
We do _NOT_ want to mkdir the path itself - we are preparing to
mknod it, after all.  Normally it'll fail with -ENOENT and
just do nothing, but if somebody has created the parent in
the meanwhile, we'll get buggered...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20 01:44:11 -04:00
Al Viro
69753a0f14 switch devtmpfs to kern_path_create()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20 01:44:10 -04:00