Commit Graph

11658 Commits

Author SHA1 Message Date
Paul E. McKenney
27f4d28057 rcu: priority boosting for TREE_PREEMPT_RCU
Add priority boosting for TREE_PREEMPT_RCU, similar to that for
TINY_PREEMPT_RCU.  This is enabled by the default-off RCU_BOOST
kernel parameter.  The priority to which to boost preempted
RCU readers is controlled by the RCU_BOOST_PRIO kernel parameter
(defaulting to real-time priority 1) and the time to wait before
boosting the readers who are blocking a given grace period is
controlled by the RCU_BOOST_DELAY kernel parameter (defaulting to
500 milliseconds).

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-05-05 23:16:55 -07:00
Paul E. McKenney
a26ac2455f rcu: move TREE_RCU from softirq to kthread
If RCU priority boosting is to be meaningful, callback invocation must
be boosted in addition to preempted RCU readers.  Otherwise, in presence
of CPU real-time threads, the grace period ends, but the callbacks don't
get invoked.  If the callbacks don't get invoked, the associated memory
doesn't get freed, so the system is still subject to OOM.

But it is not reasonable to priority-boost RCU_SOFTIRQ, so this commit
moves the callback invocations to a kthread, which can be boosted easily.

Also add comments and properly synchronized all accesses to
rcu_cpu_kthread_task, as suggested by Lai Jiangshan.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-05-05 23:16:54 -07:00
Paul E. McKenney
12f5f524ca rcu: merge TREE_PREEPT_RCU blocked_tasks[] lists
Combine the current TREE_PREEMPT_RCU ->blocked_tasks[] lists in the
rcu_node structure into a single ->blkd_tasks list with ->gp_tasks
and ->exp_tasks tail pointers.  This is in preparation for RCU priority
boosting, which will add a third dimension to the combinatorial explosion
in the ->blocked_tasks[] case, but simply a third pointer in the new
->blkd_tasks case.

Also update documentation to reflect blocked_tasks[] merge

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-05-05 23:16:54 -07:00
Paul E. McKenney
e59fb3120b rcu: Decrease memory-barrier usage based on semi-formal proof
Commit d09b62d fixed grace-period synchronization, but left some smp_mb()
invocations in rcu_process_callbacks() that are no longer needed, but
sheer paranoia prevented them from being removed.  This commit removes
them and provides a proof of correctness in their absence.  It also adds
a memory barrier to rcu_report_qs_rsp() immediately before the update to
rsp->completed in order to handle the theoretical possibility that the
compiler or CPU might move massive quantities of code into a lock-based
critical section.  This also proves that the sheer paranoia was not
entirely unjustified, at least from a theoretical point of view.

In addition, the old dyntick-idle synchronization depended on the fact
that grace periods were many milliseconds in duration, so that it could
be assumed that no dyntick-idle CPU could reorder a memory reference
across an entire grace period.  Unfortunately for this design, the
addition of expedited grace periods breaks this assumption, which has
the unfortunate side-effect of requiring atomic operations in the
functions that track dyntick-idle state for RCU.  (There is some hope
that the algorithms used in user-level RCU might be applied here, but
some work is required to handle the NMIs that user-space applications
can happily ignore.  For the short term, better safe than sorry.)

This proof assumes that neither compiler nor CPU will allow a lock
acquisition and release to be reordered, as doing so can result in
deadlock.  The proof is as follows:

1.	A given CPU declares a quiescent state under the protection of
	its leaf rcu_node's lock.

2.	If there is more than one level of rcu_node hierarchy, the
	last CPU to declare a quiescent state will also acquire the
	->lock of the next rcu_node up in the hierarchy,  but only
	after releasing the lower level's lock.  The acquisition of this
	lock clearly cannot occur prior to the acquisition of the leaf
	node's lock.

3.	Step 2 repeats until we reach the root rcu_node structure.
	Please note again that only one lock is held at a time through
	this process.  The acquisition of the root rcu_node's ->lock
	must occur after the release of that of the leaf rcu_node.

4.	At this point, we set the ->completed field in the rcu_state
	structure in rcu_report_qs_rsp().  However, if the rcu_node
	hierarchy contains only one rcu_node, then in theory the code
	preceding the quiescent state could leak into the critical
	section.  We therefore precede the update of ->completed with a
	memory barrier.  All CPUs will therefore agree that any updates
	preceding any report of a quiescent state will have happened
	before the update of ->completed.

5.	Regardless of whether a new grace period is needed, rcu_start_gp()
	will propagate the new value of ->completed to all of the leaf
	rcu_node structures, under the protection of each rcu_node's ->lock.
	If a new grace period is needed immediately, this propagation
	will occur in the same critical section that ->completed was
	set in, but courtesy of the memory barrier in #4 above, is still
	seen to follow any pre-quiescent-state activity.

6.	When a given CPU invokes __rcu_process_gp_end(), it becomes
	aware of the end of the old grace period and therefore makes
	any RCU callbacks that were waiting on that grace period eligible
	for invocation.

	If this CPU is the same one that detected the end of the grace
	period, and if there is but a single rcu_node in the hierarchy,
	we will still be in the single critical section.  In this case,
	the memory barrier in step #4 guarantees that all callbacks will
	be seen to execute after each CPU's quiescent state.

	On the other hand, if this is a different CPU, it will acquire
	the leaf rcu_node's ->lock, and will again be serialized after
	each CPU's quiescent state for the old grace period.

On the strength of this proof, this commit therefore removes the memory
barriers from rcu_process_callbacks() and adds one to rcu_report_qs_rsp().
The effect is to reduce the number of memory barriers by one and to
reduce the frequency of execution from about once per scheduling tick
per CPU to once per grace period.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-05-05 23:16:54 -07:00
Paul E. McKenney
a00e0d714f rcu: Remove conditional compilation for RCU CPU stall warnings
The RCU CPU stall warnings can now be controlled using the
rcu_cpu_stall_suppress boot-time parameter or via the same parameter
from sysfs.  There is therefore no longer any reason to have
kernel config parameters for this feature.  This commit therefore
removes the RCU_CPU_STALL_DETECTOR and RCU_CPU_STALL_DETECTOR_RUNNABLE
kernel config parameters.  The RCU_CPU_STALL_TIMEOUT parameter remains
to allow the timeout to be tuned and the RCU_CPU_STALL_VERBOSE parameter
remains to allow task-stall information to be suppressed if desired.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-05-05 23:16:54 -07:00
Ingo Molnar
4d70230bb4 Merge branch 'master' of ssh://master.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6 into perf/urgent 2011-05-06 08:11:28 +02:00
john stultz
e05b2efb82 clocksource: Install completely before selecting
Christian Hoffmann reported that the command line clocksource override
with acpi_pm timer fails:

 Kernel command line: <SNIP> clocksource=acpi_pm
 hpet clockevent registered
 Switching to clocksource hpet
 Override clocksource acpi_pm is not HRT compatible.
 Cannot switch while in HRT/NOHZ mode.

The watchdog code is what enables CLOCK_SOURCE_VALID_FOR_HRES, but we
actually end up selecting the clocksource before we enqueue it into
the watchdog list, so that's why we see the warning and fail to switch
to acpi_pm timer as requested. That's particularly bad when we want to
debug timekeeping related problems in early boot.

Put the selection call last.

Reported-by: Christian Hoffmann <email@christianhoffmann.info>
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Cc: stable@kernel.org # 32...
Link: http://lkml.kernel.org/r/%3C1304558210.2943.24.camel%40work-vm%3E
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2011-05-05 15:23:26 +02:00
Ingo Molnar
98bb318864 Merge branch 'perf/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing into perf/urgent 2011-05-04 20:33:42 +02:00
Vladimir Davydov
931aeeda0d sched: Remove unused 'this_best_prio arg' from balance_tasks()
It's passed across multiple functions but is never really used, so
remove it.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1304447467-29200-1-git-send-email-vdavydov@parallels.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-05-04 09:07:21 +02:00
Ingo Molnar
e7e7ee2eab perf events: Clean up definitions and initializers, update copyrights
Fix a few inconsistent style bits that were added over the past few
months.

Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-yv4hwf9yhnzoada8pcpb3a97@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-05-04 08:49:24 +02:00
Borislav Petkov
48dbb6dc86 hw breakpoints: Move to kernel/events/
As part of the events sybsystem unification, relocate hw_breakpoint.c
into its new destination.

Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
2011-05-03 15:26:43 +02:00
Borislav Petkov
fae85b7c8b perf: Start the restructuring
mv kernel/perf_event.c -> kernel/events/core.c. From there, all further
sensible splitting can happen. The idea is that due to perf_event.c
becoming pretty sizable and with the advent of the marriage with ftrace,
splitting functionality into its logical parts should help speeding up
the unification and to manage the complexity of the subsystem.

Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
2011-05-03 12:59:43 +02:00
Mike Frysinger
942c3c5c32 hrtimer: Make lookup table const
Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Link: http://lkml.kernel.org/r/%3C1304364267-14489-1-git-send-email-vapier%40gentoo.org%3E
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2011-05-02 21:37:57 +02:00
Thomas Gleixner
3687a2c0d8 Merge branch 'linus' into timers/core
Reason: Pick up the hrtimer_clock_to_base_table fix from mainline

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2011-05-02 21:37:08 +02:00
John Stultz
472647dcd7 timers: Fix alarmtimer build issues when CONFIG_RTC_CLASS=n
Ingo pointed out that the alarmtimers won't build if CONFIG_RTC_CLASS=n.
This patch adds proper ifdefs to the alarmtimer code to disable the rtc
usage if it is not built in.

Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2011-05-02 21:36:57 +02:00
Geert Uytterhoeven
94b2c363dc genirq: Fix typo CONFIG_GENIRC_IRQ_SHOW_LEVEL
commit ab7798ffcf ("genirq: Expand generic
show_interrupts()") added the Kconfig option GENERIC_IRQ_SHOW_LEVEL to
accomodate PowerPC, but this doesn't actually enable the functionality due
to a typo in the #ifdef check.

Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Linux/PPC Development <linuxppc-dev@lists.ozlabs.org>
Link: http://lkml.kernel.org/r/%3Calpine.DEB.2.00.1104302251370.19068%40ayla.of.borg%3E
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2011-05-02 21:16:37 +02:00
Thomas Gleixner
c42321c76b genirq: Make generic irq chip depend on CONFIG_GENERIC_IRQ_CHIP
Only compile it in when there are users.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-arm-kernel@lists.infradead.org
2011-05-02 18:16:22 +02:00
Ingo Molnar
ac0a3260f3 Merge branch 'tip/perf/core' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into perf/core 2011-05-01 19:11:42 +02:00
Ingo Molnar
809435ff4f Merge branch 'tip/perf/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into perf/core 2011-05-01 19:09:39 +02:00
Linus Torvalds
3fd9952df4 Merge branch 'fixes-2.6.39' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
* 'fixes-2.6.39' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  workqueue: fix deadlock in worker_maybe_bind_and_lock()
  workqueue: Document debugging tricks

Fix up trivial spelling conflict in kernel/workqueue.c
2011-04-30 09:15:40 -07:00
Steven Rostedt
b9df92d2a9 ftrace: Consolidate the function match routines for normal and mods
The code used for matching functions is almost identical between normal
selecting of functions and using the :mod: feature of set_ftrace_notrace.

Consolidate the two users into one function.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2011-04-29 22:53:14 -04:00
Steven Rostedt
491d0dcfb9 ftrace: Consolidate updating of ftrace_trace_function
There are three locations that perform almost identical functions in order
to update the ftrace_trace_function (the ftrace function variable that gets
called by mcount).

Consolidate these into a single function called update_ftrace_function().

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2011-04-29 22:53:11 -04:00
Steven Rostedt
996e87be7f ftrace: Move record update for normal and modules into a separate function
The updating of a function record is moved to a single function. This will allow
us to add specific changes in one location for both modules and kernel
functions.

Later patches will determine if the function record itself needs to be updated
(which enables the mcount caller), or just the ftrace_ops needs the update.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2011-04-29 22:53:08 -04:00
Steven Rostedt
d2c8c3eafb ftrace: Remove FTRACE_FL_CONVERTED flag
Since we disable all function tracer processing if we detect
that a modification of a instruction had failed, we do not need
to track that the record has failed. No more ftrace processing
is allowed, and the FTRACE_FL_CONVERTED flag is pointless.

The FTRACE_FL_CONVERTED flag was used to denote records that were
successfully converted from mcount calls into nops. But if a single
record fails, all of ftrace is disabled.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2011-04-29 22:53:04 -04:00
Steven Rostedt
45a4a2372b ftrace: Remove FTRACE_FL_FAILED flag
Since we disable all function tracer processing if we detect
that a modification of a instruction had failed, we do not need
to track that the record has failed. No more ftrace processing
is allowed, and the FTRACE_FL_FAILED flag is pointless.

Removing this flag simplifies some of the code, but some ftrace_disabled
checks needed to be added or move around a little.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2011-04-29 22:53:01 -04:00
Steven Rostedt
3499e46114 ftrace: Remove failures file
The failures file in the debugfs tracing directory would list the
functions that failed to convert when the old dead ftrace daemon
tried to update code but failed. Since this code is now dead along
with the daemon the failures file is useless. Remove it.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2011-04-29 22:52:58 -04:00
Steven Rostedt
8ab2b7efd3 ftrace: Remove unnecessary disabling of irqs
The disabling of interrupts around ftrace_update_code() was used
to protect against the evil ftrace daemon from years past. But that
daemon has long been killed. It is safe to keep interrupts enabled
while updating the initial mcount into nops.

The ftrace_mutex is also held which keeps other users at bay.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2011-04-29 22:52:55 -04:00
Steven Rostedt
0778d9ad33 ftrace: Make FTRACE_WARN_ON() work in if condition
Let FTRACE_WARN_ON() be used as a stand alone statement or
inside a conditional: if (FTRACE_WARN_ON(x))

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2011-04-29 22:52:52 -04:00
Steven Rostedt
058e297d34 ftrace: Only update the function code on write to filter files
If function tracing is enabled, a read of the filter files will
cause the call to stop_machine to update the function trace sites.
It should only call stop_machine on write.

Cc: stable@kernel.org
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2011-04-29 22:42:59 -04:00
Rafael J. Wysocki
85eb8c8d0b PM / Runtime: Generic clock manipulation rountines for runtime PM (v6)
Many different platforms and subsystems may want to disable device
clocks during suspend and enable them during resume which is going to
be done in a very similar way in all those cases.  For this reason,
provide generic routines for the manipulation of device clocks during
suspend and resume.

Convert the ARM shmobile platform to using the new routines.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2011-04-30 00:25:44 +02:00
Linus Torvalds
40a963502c Merge branch 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  perf, x86, nmi: Move LVT un-masking into irq handlers
  perf events, x86: Work around the Nehalem AAJ80 erratum
  perf, x86: Fix BTS condition
  ftrace: Build without frame pointers on Microblaze
2011-04-29 15:08:53 -07:00
Linus Torvalds
fcc4dc7151 Merge branch 'timer-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'timer-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  hrtimer: Initialize CLOCK_ID to HRTIMER_BASE table statically
  rtc: max8925: Call dev_set_drvdata before rtc_device_register
2011-04-29 15:08:31 -07:00
Tejun Heo
5035b20fa5 workqueue: fix deadlock in worker_maybe_bind_and_lock()
If a rescuer and stop_machine() bringing down a CPU race with each
other, they may deadlock on non-preemptive kernel.  The CPU won't
accept a new task, so the rescuer can't migrate to the target CPU,
while stop_machine() can't proceed because the rescuer is holding one
of the CPU retrying migration.  GCWQ_DISASSOCIATED is never cleared
and worker_maybe_bind_and_lock() retries indefinitely.

This problem can be reproduced semi reliably while the system is
entering suspend.

 http://thread.gmane.org/gmane.linux.kernel/1122051

A lot of kudos to Thilo-Alexander for reporting this tricky issue and
painstaking testing.

stable: This affects all kernels with cmwq, so all kernels since and
        including v2.6.36 need this fix.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Thilo-Alexander Ginkel <thilo@ginkel.com>
Tested-by: Thilo-Alexander Ginkel <thilo@ginkel.com>
Cc: stable@kernel.org
2011-04-29 18:08:37 +02:00
Thomas Gleixner
ce31332d3c hrtimer: Initialize CLOCK_ID to HRTIMER_BASE table statically
Sedat and Bruno reported RCU stalls which turned out to be caused by
the following;

sched_init() calls init_rt_bandwidth() which calls hrtimer_init()
_BEFORE_ hrtimers_init() is called. While not entirely correct this
worked because hrtimer_init() only accessed statically initialized
data (hrtimer_bases.clock_base[CLOCK_MONOTONIC])

Commit e06383db9 (hrtimers: extend hrtimer base code to handle more
then 2 clockids) added an indirection to the hrtimer_bases.clock_base
lookup to avoid gap handling in the hot path. The table which is used
for the translataion from CLOCK_ID to HRTIMER_BASE index is
initialized at runtime in hrtimers_init(). So the early call of the
scheduler code translates CLOCK_MONOTONIC to HRTIMER_BASE_REALTIME.

Thus the rt_bandwith timer ends up on CLOCK_REALTIME. If the timer is
armed and the wall clock time is set (e.g. ntpdate in the early boot
process - which also gives the problem deterministic behaviour
i.e. magic recovery after N hours), then the timer ends up with an
expiry time far into the future. That breaks the RT throttler
mechanism as rt runtime is accumulated and never cleared, so the rt
throttler detects a false cpu hog condition and blocks all RT tasks
until the timer finally expires. That in turn stalls the RCU thread of
TINYRCU which leads to an huge amount of RCU callbacks piling up.

Make the translation table statically initialized, so we are back to
the status of <= 2.6.39.

Reported-and-tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Reported-by: Bruno Prémont <bonbons@linux-vserver.org>
Cc: John stultz <johnstul@us.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/%3Calpine.LFD.2.02.1104282353140.3005%40ionos%3E
Reviewed-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2011-04-29 10:57:11 +02:00
John Stultz
7068b7a162 timers: Remove delayed irqwork from alarmtimers implementation
Thomas asked about the delayed irq work in the alarmtimers code,
and I realized that it was a legacy from when the alarmtimer base
lock was a mutex (due to concerns that we'd be interacting with
the RTC device, which is protected by mutexes).

Since the alarmtimer base is now protected by a spinlock, we can
simply execute alarmtimer functions directly from the hrtimer
callback. Should any future alarmtimer functions sleep, they can
simply manage scheduling any delayed work themselves.

CC: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2011-04-28 13:39:18 -07:00
John Stultz
180bf812ce timers: Improve alarmtimer comments and minor fixes
This patch addresses a number of minor comment improvements and
other minor issues from Thomas' review of the alarmtimers code.

CC: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2011-04-28 13:39:17 -07:00
Hillf Danton
1409f141ac kernel/watchdog.c: disable nmi perf event in the error path of enabling watchdog
In corner cases where softlockup watchdog is not setup successfully, the
relevant nmi perf event for hardlockup watchdog could be disabled, then
the status of the underlying hardware remains unchanged.

Also, if the kthread doesn't start then the hrtimer won't run and the
hardlockup detector will falsely fire.

Signed-off-by: Hillf Danton <dhillf@gmail.com>
Signed-off-by: Don Zickus <dzickus@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-04-28 11:28:21 -07:00
Jeff Mahoney
e11feaa119 watchdog, hung_task_timeout: Add Kconfig configurable default
This patch allows the default value for sysctl_hung_task_timeout_secs
to be set at build time. The feature carries virtually no overhead,
so it makes sense to keep it enabled. On heavily loaded systems, though,
it can end up triggering stack traces when there is no bug other than
the system being underprovisioned. We use this patch to keep the hung task
facility available but disabled at boot-time.

The default of 120 seconds is preserved. As a note, commit e162b39a may
have accidentally reverted commit fb822db4, which raised the default from
120 seconds to 480 seconds.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Acked-by: Mandeep Singh Baines <msb@google.com>
Link: http://lkml.kernel.org/r/4DB8600C.8080000@suse.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-04-28 09:13:17 +02:00
Ingo Molnar
32673822e4 Merge branch 'tip/perf/core' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into perf/core
Conflicts:
	include/linux/perf_event.h

Merge reason: pick up the latest jump-label enhancements, they are cooked ready.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-04-27 10:40:21 +02:00
Ingo Molnar
6c8a721327 Merge branch 'tip/perf/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into perf/urgent 2011-04-27 10:31:29 +02:00
John Stultz
9a7adcf5c6 timers: Posix interface for alarm-timers
This patch exposes alarm-timers to userland via the posix clock
and timers interface, using two new clockids: CLOCK_REALTIME_ALARM
and CLOCK_BOOTTIME_ALARM. Both clockids behave identically to
CLOCK_REALTIME and CLOCK_BOOTTIME, respectively, but timers
set against the _ALARM suffixed clockids will wake the system if
it is suspended.

Some background can be found here:
	https://lwn.net/Articles/429925/

The concept for Alarm-timers was inspired by the Android Alarm
driver (by Arve Hjønnevåg) found in the Android kernel tree.

See: http://android.git.kernel.org/?p=kernel/common.git;a=blob;f=drivers/rtc/alarm.c;h=1250edfbdf3302f5e4ea6194847c6ef4bb7beb1c;hb=android-2.6.36

While the in-kernel interface is pretty similar between
alarm-timers and Android alarm driver, the user-space interface
for the Android alarm driver is via ioctls to a new char device.
As mentioned above, I've instead chosen to export this functionality
via the posix interface, as it seemed a little simpler and avoids
creating duplicate interfaces to things like CLOCK_REALTIME and
CLOCK_MONOTONIC under alternate names (ie:ANDROID_ALARM_RTC and
ANDROID_ALARM_SYSTEMTIME).

The semantics of the Android alarm driver are different from what
this posix interface provides. For instance, threads other then
the thread waiting on the Android alarm driver are able to modify
the alarm being waited on. Also this interface does not allow
the same wakelock semantics that the Android driver provides
(ie: kernel takes a wakelock on RTC alarm-interupt, and holds it
through process wakeup, and while the process runs, until the
process either closes the char device or calls back in to wait
on a new alarm).

One potential way to implement similar semantics may be via
the timerfd infrastructure, but this needs more research.

There may also need to be some sort of sysfs system level policy
hooks that allow alarm timers to be disabled to keep them
from firing at inappropriate times (ie: laptop in a well insulated
bag, mid-flight).

CC: Arve Hjønnevåg <arve@android.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Alessandro Zummo <a.zummo@towertech.it>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2011-04-26 14:01:46 -07:00
John Stultz
ff3ead96d1 timers: Introduce in-kernel alarm-timer interface
This provides the in kernel interface and infrastructure for
alarm-timers.

Alarm-timers are a hybrid style timer, similar to hrtimers,
but when the system is suspended, the RTC device is set to
fire and wake the system for when the soonest alarm-timer
expires.

The concept for Alarm-timers was inspired by the Android Alarm
driver (by Arve Hjønnevåg) found in the Android kernel tree.

See: http://android.git.kernel.org/?p=kernel/common.git;a=blob;f=drivers/rtc/alarm.c;h=1250edfbdf3302f5e4ea6194847c6ef4bb7beb1c;hb=android-2.6.36

This in-kernel interface should be fairly compatible with the
Android alarm driver in-kernel interface, but has the advantage
of utilizing the new RTC timerqueue code instead of doing direct
RTC manipulation.

CC: Arve Hjønnevåg <arve@android.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Alessandro Zummo <a.zummo@towertech.it>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2011-04-26 14:01:44 -07:00
John Stultz
304529b1b6 time: Add timekeeping_inject_sleeptime
Some platforms cannot implement read_persistent_clock, as
their RTC devices are only accessible when interrupts are enabled.
This keeps them from being used by the timekeeping code on resume
to measure the time in suspend.

The RTC layer tries to work around this, by calling do_settimeofday
on resume after irqs are reenabled to set the time properly. However,
this only corrects CLOCK_REALTIME, and does not properly adjust
the sleep time value. This causes btime in /proc/stat to be incorrect
as well as making the new CLOCK_BOTTTIME inaccurate.

This patch resolves the issue by introducing a new timekeeping hook
to allow the RTC layer to inject the sleep time on resume.

The code also checks to make sure that read_persistent_clock is
nonfunctional before setting the sleep time, so that should the RTC's
HCTOSYS option be configured in on a system that does support
read_persistent_clock we will not increase the total_sleep_time twice.

CC: Arve Hjønnevåg <arve@android.com>
CC: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2011-04-26 14:01:41 -07:00
Hillf Danton
1437f5bca3 sched: Remove noop in alloc_rt_sched_group()
The rq varible, though computed for each possible cpu, has nothing to
do in the function, so it can be removed.

This also eliminates a build warning.

Signed-off-by: Hillf Danton <dhillf@gmail.com>
Reviewed-by: Yong Zhang <yong.zhang0@gmail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/BANLkTin-FfQfqW5ym1iuEmrk8s777Y1LAg@mail.gmail.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-04-26 13:34:08 +02:00
Jonathan Cameron
e7e09cd667 params.c: Use new strtobool function to process boolean inputs
No functional changes.

Signed-off-by: Jonathan Cameron <jic23@cam.ac.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-04-25 16:04:52 -07:00
Frederic Weisbecker
bf26c01849 ptrace: Prepare to fix racy accesses on task breakpoints
When a task is traced and is in a stopped state, the tracer
may execute a ptrace request to examine the tracee state and
get its task struct. Right after, the tracee can be killed
and thus its breakpoints released.
This can happen concurrently when the tracer is in the middle
of reading or modifying these breakpoints, leading to dereferencing
a freed pointer.

Hence, to prepare the fix, create a generic breakpoint reference
holding API. When a reference on the breakpoints of a task is
held, the breakpoints won't be released until the last reference
is dropped. After that, no more ptrace request on the task's
breakpoints can be serviced for the tracer.

Reported-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Prasad <prasad@linux.vnet.ibm.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: v2.6.33.. <stable@kernel.org>
Link: http://lkml.kernel.org/r/1302284067-7860-2-git-send-email-fweisbec@gmail.com
2011-04-25 17:28:24 +02:00
Jonathan Corbet
625f2a378e sched: Get rid of lock_depth
Neil Brown pointed out that lock_depth somehow escaped the BKL
removal work.  Let's get rid of it now.

Note that the perf scripting utilities still have a bunch of
code for dealing with common_lock_depth in tracepoints; I have
left that in place in case anybody wants to use that code with
older kernels.

Suggested-by: Neil Brown <neilb@suse.de>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20110422111910.456c0e84@bike.lwn.net
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-04-24 13:18:38 +02:00
Linus Torvalds
686c4cbb10 Merge branch 'pm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6
* 'pm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6:
  PM: Add missing syscore_suspend() and syscore_resume() calls
  PM: Fix error code paths executed after failing syscore_suspend()
2011-04-23 22:35:16 -07:00
Thomas Gleixner
cfefd21e69 genirq: Add chip suspend and resume callbacks
These callbacks are only called in the syscore suspend/resume code on
interrupt chips which have been registered via the generic irq chip
mechanism. Calling those callbacks per irq would be rather icky, but
with the generic irq chip mechanism we can call this per registered
chip.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-arm-kernel@lists.infradead.org
2011-04-23 15:56:24 +02:00
Thomas Gleixner
7d82806247 genirq: Implement a generic interrupt chip
Implement a generic interrupt chip, which is configurable and is able
to handle the most common irq chip implementations.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-arm-kernel@lists.infradead.org
Tested-by: H Hartley Sweeten <hsweeten@visionengravers.com>
Tested-by: Tony Lindgren <tony@atomide.com>
Tested-by; Kevin Hilman <khilman@ti.com>
2011-04-23 15:56:24 +02:00
Paul Mundt
7f1b1244e1 genirq: Support per-IRQ thread disabling.
This adds support for disabling threading on a per-IRQ basis via the IRQ
status instead of the IRQ flow, which is necessary for interrupts that
don't follow the natural IRQ flow channels, such as those that are
virtually created.

The new APIs added are simply:

	irq_set_thread()
	irq_set_nothread()

which follow the rest of the IRQ status routines.

Chained handlers also have IRQ_NOTHREAD set on them automatically, making
the lack of threading explicit rather than implicit. Subsequently, the
nothread flag can be viewed through the standard genirq debugging
facilities.

[ tglx: Fixed cleanup fallout ]

Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Link: http://lkml.kernel.org/r/%3C20110406210135.GF18426%40linux-sh.org%3E
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2011-04-23 15:56:24 +02:00
Steven Rostedt
e0944ee63f lockdep: Remove cmpxchg to update nr_chain_hlocks
For some reason nr_chain_hlocks is updated with cmpxchg, but
this is performed inside of the lockdep global "grab_lock()",
which also makes simple modification of this variable atomic.

Remove the cmpxchg logic for updating nr_chain_hlocks and
simplify the code.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20110421014300.727863282@goodmis.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-04-22 11:06:59 +02:00
Steven Rostedt
282b5c2f6f lockdep: Print a nicer description for simple irq lock inversions
Lockdep output can be pretty cryptic, having nicer output
can save a lot of head scratching. When a simple irq inversion
scenario is detected by lockdep (lock A taken in interrupt
context but also in thread context without disabling interrupts)
we now get the following (hopefully more informative) output:

other info that might help us debug this:
 Possible unsafe locking scenario:

       CPU0
       ----
  lock(lockA);
  <Interrupt>
    lock(lockA);

 *** DEADLOCK ***

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20110421014300.436140880@goodmis.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-04-22 11:06:59 +02:00
Steven Rostedt
6be8c3935b lockdep: Replace "Bad BFS generated tree" message with something less cryptic
The message of "Bad BFS generated tree" is a bit confusing.
Replace it with a more sane error message.

Thanks to Peter Zijlstra for helping me come up with a better
message.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20110421014300.135521252@goodmis.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-04-22 11:06:59 +02:00
Steven Rostedt
dad3d7435e lockdep: Print a nicer description for irq inversion bugs
Irq inversion and irq dependency bugs are only subtly
different. The diffenerence lies where the interrupt occurred.

For irq dependency:

	irq_disable
	lock(A)
	lock(B)
	unlock(B)
	unlock(A)
	irq_enable

	lock(B)
	unlock(B)

 	<interrupt>
	  lock(A)

The interrupt comes in after it has been established that lock A
can be held when taking an irq unsafe lock. Lockdep detects the
problem when taking lock A in interrupt context.

With the irq_inversion the irq happens before it is established
and lockdep detects the problem with the taking of lock B:

 	<interrupt>
	  lock(A)

	irq_disable
	lock(A)
	lock(B)
	unlock(B)
	unlock(A)
	irq_enable

	lock(B)
	unlock(B)

Since the problem with the locking logic for both of these issues
is in actuality the same, they both should report the same scenario.
This patch implements that and prints this:

other info that might help us debug this:

Chain exists of:
  &rq->lock --> lockA --> lockC

 Possible interrupt unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(lockC);
                               local_irq_disable();
                               lock(&rq->lock);
                               lock(lockA);
  <Interrupt>
    lock(&rq->lock);

 *** DEADLOCK ***

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20110421014259.910720381@goodmis.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-04-22 11:06:58 +02:00
Steven Rostedt
48702ecf30 lockdep: Print a nicer description for simple deadlocks
Lockdep output can be pretty cryptic, having nicer output
can save a lot of head scratching. When a simple deadlock
scenario is detected by lockdep (lock A -> lock A) we now
get the following new output:

other info that might help us debug this:
 Possible unsafe locking scenario:

       CPU0
       ----
  lock(&(lock)->rlock);
  lock(&(lock)->rlock);

 *** DEADLOCK ***

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20110421014259.643930104@goodmis.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-04-22 11:06:58 +02:00
Steven Rostedt
f4185812aa lockdep: Print a nicer description for normal deadlocks
The lockdep output can be pretty cryptic, having nicer output
can save a lot of head scratching. When a normal deadlock
scenario is detected by lockdep (lock A -> lock B and there
exists a place where lock B -> lock A) we now get the following
new output:

other info that might help us debug this:

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(lockB);
                               lock(lockA);
                               lock(lockB);
  lock(lockA);

 *** DEADLOCK ***

On cases where there's a deeper chair, it shows the partial
chain that can cause the issue:

Chain exists of:
  lockC --> lockA --> lockB

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(lockB);
                               lock(lockA);
                               lock(lockB);
  lock(lockC);

 *** DEADLOCK ***

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20110421014259.380621789@goodmis.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-04-22 11:06:57 +02:00
Steven Rostedt
3003eba313 lockdep: Print a nicer description for irq lock inversions
Locking order inversion due to interrupts is a subtle problem.

When an irq lockiinversion discovered by lockdep it currently
reports something like:

[ INFO: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected ]

... and then prints out the locks that are involved, as back traces.

Judging by lkml feedback developers were routinely confused by what
a HARDIRQ->safe to unsafe issue is all about, and sometimes even
blew it off as a bug in lockdep.

It is not obvious when lockdep prints this message about a lock that
is never taken in interrupt context.

After explaining the problems that lockdep is reporting, I
decided to add a description of the problem in visual form. Now
the following is shown:

 ---
other info that might help us debug this:

 Possible interrupt unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(lockA);
                               local_irq_disable();
                               lock(&rq->lock);
                               lock(lockA);
  <Interrupt>
    lock(&rq->lock);

 *** DEADLOCK ***

 ---

The above is the case when the unsafe lock is taken while
holding a lock taken in irq context. But when a lock is taken
that also grabs a unsafe lock, the call chain is shown:

 ---
other info that might help us debug this:

Chain exists of:
  &rq->lock --> lockA --> lockC

 Possible interrupt unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(lockC);
                               local_irq_disable();
                               lock(&rq->lock);
                               lock(lockA);
  <Interrupt>
    lock(&rq->lock);

 *** DEADLOCK ***

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20110421014259.132728798@goodmis.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-04-22 11:06:57 +02:00
Michal Simek
d20ac25282 ftrace: Build without frame pointers on Microblaze
Microblaze doesn't need/support FRAME_POINTERS in order to have a working
function tracer.

The patch remove Kconfig warning.

Warning log:
warning: (LOCKDEP && FAULT_INJECTION_STACKTRACE_FILTER && LATENCYTOP &&
FUNCTION_TRACER && KMEMCHECK) selects FRAME_POINTER which has unmet direct
dependencies (DEBUG_KERNEL && (CRIS || M68K || FRV || UML || AVR32 ||
SUPERH || BLACKFIN || MN10300) || ARCH_WANT_FRAME_POINTERS)

Signed-off-by: Michal Simek <monstr@monstr.eu>
Link: http://lkml.kernel.org/r/1301908812-8119-2-git-send-email-monstr@monstr.eu
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2011-04-21 09:06:24 -04:00
Rakib Mullick
d3bf52e998 sched: Remove obsolete comment from scheduler_tick()
scheduler_tick() is no longer called by fork code - this got discarded
a long time ago by commit bc947631d1 ("sched: improve efficiency
of sched_fork()").

So, remove the comment which still claims otherwise.

Signed-off-by: Rakib Mullick <rakib.mullick@gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/BANLkTimO4iGP0QpaHO1HHF1QOnVcQpc0cw@mail.gmail.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-04-21 11:41:36 +02:00
Ingo Molnar
42ac9e87fd Merge commit 'v2.6.39-rc4' into sched/core
Merge reason: Pick up upstream fixes.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-04-21 11:39:28 +02:00
Ludwig Nussel
088ab0b4d8 kernel/ksysfs.c: expose file_caps_enabled in sysfs
A kernel booted with no_file_caps allows to install fscaps on a binary
but doesn't actually honor the fscaps when running the binary. Userspace
currently has no sane way to determine whether installing fscaps
actually has any effect. Since parsing /proc/cmdline is fragile this
patch exposes the current setting (1 or 0) via /sys/kernel/fscaps

Signed-off-by: Ludwig Nussel <ludwig.nussel@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-04-19 16:45:51 -07:00
Rafael J. Wysocki
19234c0819 PM: Add missing syscore_suspend() and syscore_resume() calls
Device suspend/resume infrastructure is used not only by the suspend
and hibernate code in kernel/power, but also by APM, Xen and the
kexec jump feature.  However, commit 40dc166cb5
(PM / Core: Introduce struct syscore_ops for core subsystems PM)
failed to add syscore_suspend() and syscore_resume() calls to that
code, which generally leads to breakage when the features in question
are used.

To fix this problem, add the missing syscore_suspend() and
syscore_resume() calls to arch/x86/kernel/apm_32.c, kernel/kexec.c
and drivers/xen/manage.c.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
2011-04-20 00:36:11 +02:00
Linus Torvalds
4ae0ff16ef Merge branch 'timer-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'timer-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  RTC: rtc-omap: Fix a leak of the IRQ during init failure
  posix clocks: Replace mutex with reader/writer semaphore
2011-04-19 10:56:46 -07:00
Peter Zijlstra
057f3fadb3 sched: Fix sched_domain iterations vs. RCU
Vladis Kletnieks reported a new RCU debug warning in the scheduler.

Since commit dce840a087 ("sched: Dynamically allocate sched_domain/
sched_group data-structures") the sched_domain trees are protected by
RCU instead of RCU-sched.

This means that we need to include rcu_read_lock() protection when we
iterate them since disabling preemption doesn't suffice anymore.

Reported-by: Valdis.Kletnieks@vt.edu
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1302882741.2388.241.camel@twins
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-04-19 10:56:54 +02:00
Venkatesh Pallipadi
2f36825b17 sched: Next buddy hint on sleep and preempt path
When a task in a taskgroup sleeps, pick_next_task starts all the way back at
the root and picks the task/taskgroup with the min vruntime across all
runnable tasks.

But when there are many frequently sleeping tasks across different taskgroups,
it makes better sense to stay with same taskgroup for its slice period (or
until all tasks in the taskgroup sleeps) instead of switching cross taskgroup
on each sleep after a short runtime.

This helps specifically where taskgroups corresponds to a process with
multiple threads. The change reduces the number of CR3 switches in this case.

Example:

Two taskgroups with 2 threads each which are running for 2ms and
sleeping for 1ms. Looking at sched:sched_switch shows:

BEFORE: taskgroup_1 threads [5004, 5005], taskgroup_2 threads [5016, 5017]
      cpu-soaker-5004  [003]  3683.391089
      cpu-soaker-5016  [003]  3683.393106
      cpu-soaker-5005  [003]  3683.395119
      cpu-soaker-5017  [003]  3683.397130
      cpu-soaker-5004  [003]  3683.399143
      cpu-soaker-5016  [003]  3683.401155
      cpu-soaker-5005  [003]  3683.403168
      cpu-soaker-5017  [003]  3683.405170

AFTER: taskgroup_1 threads [21890, 21891], taskgroup_2 threads [21934, 21935]
      cpu-soaker-21890 [003]   865.895494
      cpu-soaker-21935 [003]   865.897506
      cpu-soaker-21934 [003]   865.899520
      cpu-soaker-21935 [003]   865.901532
      cpu-soaker-21934 [003]   865.903543
      cpu-soaker-21935 [003]   865.905546
      cpu-soaker-21891 [003]   865.907548
      cpu-soaker-21890 [003]   865.909560
      cpu-soaker-21891 [003]   865.911571
      cpu-soaker-21890 [003]   865.913582
      cpu-soaker-21891 [003]   865.915594
      cpu-soaker-21934 [003]   865.917606

Similar problem is there when there are multiple taskgroups and say a task A
preempts currently running task B of taskgroup_1. On schedule, pick_next_task
can pick an unrelated task on taskgroup_2. Here it would be better to give some
preference to task B on pick_next_task.

A simple (may be extreme case) benchmark I tried was tbench with 2 tbench
client processes with 2 threads each running on a single CPU. Avg throughput
across 5 50 sec runs was:

 BEFORE: 105.84 MB/sec
 AFTER:  112.42 MB/sec

Signed-off-by: Venkatesh Pallipadi <venki@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1302802253-25760-1-git-send-email-venki@google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-04-19 10:08:38 +02:00
Venkatesh Pallipadi
69c80f3e9d sched: Make set_*_buddy() work on non-task entities
Make set_*_buddy() work on non-task sched_entity, to facilitate the
use of next_buddy to cache a group entity in cases where one of the
tasks within that entity sleeps or gets preempted.

set_skip_buddy() was incorrectly comparing the policy of task that is
yielding to be not equal to SCHED_IDLE. Yielding should happen even
when task yielding is SCHED_IDLE. This change removes the policy check
on the yielding task.

Signed-off-by: Venkatesh Pallipadi <venki@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1302744070-30079-2-git-send-email-venki@google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-04-19 10:08:37 +02:00
Rafael J. Wysocki
2ca6f62f59 PM: Fix error code paths executed after failing syscore_suspend()
If syscore_suspend() fails in suspend_enter(), create_image() or
resume_target_kernel(), it is necessary to call sysdev_resume(),
because sysdev_suspend() has been called already and succeeded
and we are going to abort the transition.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-04-18 23:58:59 +02:00
Linus Torvalds
c78193e9c7 next_pidmap: fix overflow condition
next_pidmap() just quietly accepted whatever 'last' pid that was passed
in, which is not all that safe when one of the users is /proc.

Admittedly the proc code should do some sanity checking on the range
(and that will be the next commit), but that doesn't mean that the
helper functions should just do that pidmap pointer arithmetic without
checking the range of its arguments.

So clamp 'last' to PID_MAX_LIMIT.  The fact that we then do "last+1"
doesn't really matter, the for-loop does check against the end of the
pidmap array properly (it's only the actual pointer arithmetic overflow
case we need to worry about, and going one bit beyond isn't going to
overflow).

[ Use PID_MAX_LIMIT rather than pid_max as per Eric Biederman ]

Reported-by: Tavis Ormandy <taviso@cmpxchg8b.com>
Analyzed-by: Robert Święcki <robert@swiecki.net>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-04-18 10:35:30 -07:00
Ingo Molnar
6ddafdaab3 Merge branch 'sched/locking' into sched/core
Merge reason: the rq locking changes are stable,
              propagate them into the .40 queue.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-04-18 14:53:33 +02:00
Richard Cochran
1791f88143 posix clocks: Replace mutex with reader/writer semaphore
A dynamic posix clock is protected from asynchronous removal by a mutex.
However, using a mutex has the unwanted effect that a long running clock
operation in one process will unnecessarily block other processes.

For example, one process might call read() to get an external time stamp
coming in at one pulse per second. A second process calling clock_gettime
would have to wait for almost a whole second.

This patch fixes the issue by using a reader/writer semaphore instead of
a mutex.

Signed-off-by: Richard Cochran <richard.cochran@omicron.at>
Cc: John Stultz <john.stultz@linaro.org>
Link: http://lkml.kernel.org/r/%3C20110330132421.GA31771%40riccoc20.at.omicron.at%3E
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2011-04-18 10:39:38 +02:00
Linus Torvalds
d733ed6c34 Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
  block: make unplug timer trace event correspond to the schedule() unplug
  block: let io_schedule() flush the plug inline
2011-04-16 10:33:41 -07:00
Linus Torvalds
fdfc552abe Merge branches 'core-fixes-for-linus', 'perf-fixes-for-linus', 'sched-fixes-for-linus', 'timer-fixes-for-linus' and 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  futex: Set FLAGS_HAS_TIMEOUT during futex_wait restart setup

* 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  perf_event: Fix cgrp event scheduling bug in perf_enable_on_exec()
  perf: Fix a build error with some GCC versions

* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  sched: Fix erroneous all_pinned logic
  sched: Fix sched-domain avg_load calculation

* 'timer-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  RTC: rtc-mrst: follow on to the change of rtc_device_register()
  RTC: add missing "return 0" in new alarm func for rtc-bfin.c
  RTC: Fix s3c compile error due to missing s3c_rtc_setpie
  RTC: Fix early irqs caused by calling rtc_set_alarm too early

* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  x86, amd: Disable GartTlbWlkErr when BIOS forgets it
  x86, NUMA: Fix fakenuma boot failure
  x86/mrst: Fix boot crash caused by incorrect pin to irq mapping
  x86/ce4100: Add reg property to bridges
2011-04-16 09:45:08 -07:00
Jens Axboe
49cac01e1f block: make unplug timer trace event correspond to the schedule() unplug
It's a pretty close match to what we had before - the timer triggering
would mean that nobody unplugged the plug in due time, in the new
scheme this matches very closely what the schedule() unplug now is.
It's essentially the difference between an explicit unplug (IO unplug)
or an implicit unplug (timer unplug, we scheduled with pending IO
queued).

Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-04-16 13:51:05 +02:00
Jens Axboe
a237c1c5bc block: let io_schedule() flush the plug inline
Linus correctly observes that the most important dispatch cases
are now done from kblockd, this isn't ideal for latency reasons.
The original reason for switching dispatches out-of-line was to
avoid too deep a stack, so by _only_ letting the "accidental"
flush directly in schedule() be guarded by offload to kblockd,
we should be able to get the best of both worlds.

So add a blk_schedule_flush_plug() that offloads to kblockd,
and only use that from the schedule() path.

Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-04-16 13:27:55 +02:00
Linus Torvalds
5853b4f06f Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
  block: only force kblockd unplugging from the schedule() path
  block: cleanup the block plug helper functions
  block, blk-sysfs: Use the variable directly instead of a function call
  block: move queue run on unplug to kblockd
  block: kill queue_sync_plugs()
  block: readd plug trace event
  block: add callback function for unplug notification
  block: add comment on why we save and disable interrupts in flush_plug_list()
  block: fixup block IO unplug trace call
  block: remove block_unplug_timer() trace point
  block: splice plug list to local context
2011-04-15 08:01:13 -07:00
Darren Hart
0cd9c6494e futex: Set FLAGS_HAS_TIMEOUT during futex_wait restart setup
The FLAGS_HAS_TIMEOUT flag was not getting set, causing the restart_block to
restart futex_wait() without a timeout after a signal.

Commit b41277dc7a in 2.6.38 introduced the regression by accidentally
removing the the FLAGS_HAS_TIMEOUT assignment from futex_wait() during the setup
of the restart block. Restore the originaly behavior.

Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=32922

Reported-by: Tim Smith <tsmith201104@yahoo.com>
Reported-by: Torsten Hilbrich <torsten.hilbrich@secunet.com>
Signed-off-by: Darren Hart <dvhart@linux.intel.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: John Kacur <jkacur@redhat.com>
Cc: stable@kernel.org
Link: http://lkml.kernel.org/r/%3Cdaac0eb3af607f72b9a4d3126b2ba8fb5ed3b883.1302820917.git.dvhart%40linux.intel.com%3E
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2011-04-15 16:34:32 +02:00
Peter Zijlstra
bd8e7dded8 sched: Remove need_migrate_task()
Oleg noticed that need_migrate_task() doesn't need the ->on_cpu check
now that ttwu() doesn't do remote enqueues for !->on_rq && ->on_cpu,
so remove the helper and replace the single instance with a direct
->on_rq test.

Suggested-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Frank Rowand <frank.rowand@am.sony.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20110405152729.556674812@chello.nl
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-04-14 08:52:41 +02:00
Peter Zijlstra
317f394160 sched: Move the second half of ttwu() to the remote cpu
Now that we've removed the rq->lock requirement from the first part of
ttwu() and can compute placement without holding any rq->lock, ensure
we execute the second half of ttwu() on the actual cpu we want the
task to run on.

This avoids having to take rq->lock and doing the task enqueue
remotely, saving lots on cacheline transfers.

As measured using: http://oss.oracle.com/~mason/sembench.c

  $ for i in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor ; do echo performance > $i; done
  $ echo 4096 32000 64 128 > /proc/sys/kernel/sem
  $ ./sembench -t 2048 -w 1900 -o 0

  unpatched: run time 30 seconds 647278 worker burns per second
  patched:   run time 30 seconds 816715 worker burns per second

Reviewed-by: Frank Rowand <frank.rowand@am.sony.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110405152729.515897185@chello.nl
2011-04-14 08:52:41 +02:00
Peter Zijlstra
c05fbafba1 sched: Restructure ttwu() some more
Factor our helper functions to make the inner workings of try_to_wake_up()
more obvious, this also allows for adding remote queues.

Reviewed-by: Frank Rowand <frank.rowand@am.sony.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110405152729.475848012@chello.nl
2011-04-14 08:52:40 +02:00
Peter Zijlstra
23f41eeb42 sched: Rename ttwu_post_activation() to ttwu_do_wakeup()
The ttwu_post_activation() code does the core wakeup, it sets TASK_RUNNING
and performs wakeup-preemption, so give is a more descriptive name.

Reviewed-by: Frank Rowand <frank.rowand@am.sony.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110405152729.434609705@chello.nl
2011-04-14 08:52:40 +02:00
Peter Zijlstra
b84cb5df1f sched: Remove rq argument from ttwu_stat()
In order to call ttwu_stat() without holding rq->lock we must remove
its rq argument. Since we need to change rq stats, account to the
local rq instead of the task rq, this is safe since we have IRQs
disabled.

Reviewed-by: Frank Rowand <frank.rowand@am.sony.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110405152729.394638826@chello.nl
2011-04-14 08:52:40 +02:00
Peter Zijlstra
e4a52bcb9a sched: Remove rq->lock from the first half of ttwu()
Currently ttwu() does two rq->lock acquisitions, once on the task's
old rq, holding it over the p->state fiddling and load-balance pass.
Then it drops the old rq->lock to acquire the new rq->lock.

By having serialized ttwu(), p->sched_class, p->cpus_allowed with
p->pi_lock, we can now drop the whole first rq->lock acquisition.

The p->pi_lock serializing concurrent ttwu() calls protects p->state,
which we will set to TASK_WAKING to bridge possible p->pi_lock to
rq->lock gaps and serialize set_task_cpu() calls against
task_rq_lock().

The p->pi_lock serialization of p->sched_class allows us to call
scheduling class methods without holding the rq->lock, and the
serialization of p->cpus_allowed allows us to do the load-balancing
bits without races.

Reviewed-by: Frank Rowand <frank.rowand@am.sony.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110405152729.354401150@chello.nl
2011-04-14 08:52:39 +02:00
Peter Zijlstra
8f42ced974 sched: Drop rq->lock from sched_exec()
Since we can now call select_task_rq() and set_task_cpu() with only
p->pi_lock held, and sched_exec() load-balancing has always been
optimistic, drop all rq->lock usage.

Oleg also noted that need_migrate_task() will always be true for
current, so don't bother calling that at all.

Reviewed-by: Frank Rowand <frank.rowand@am.sony.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20110405152729.314204889@chello.nl
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-04-14 08:52:39 +02:00
Peter Zijlstra
ab2515c4b9 sched: Drop rq->lock from first part of wake_up_new_task()
Since p->pi_lock now protects all things needed to call
select_task_rq() avoid the double remote rq->lock acquisition and rely
on p->pi_lock.

Reviewed-by: Frank Rowand <frank.rowand@am.sony.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20110405152729.273362517@chello.nl
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-04-14 08:52:38 +02:00
Peter Zijlstra
0122ec5b02 sched: Add p->pi_lock to task_rq_lock()
In order to be able to call set_task_cpu() while either holding
p->pi_lock or task_rq(p)->lock we need to hold both locks in order to
stabilize task_rq().

This makes task_rq_lock() acquire both locks, and have
__task_rq_lock() validate that p->pi_lock is held. This increases the
locking overhead for most scheduler syscalls but allows reduction of
rq->lock contention for some scheduler hot paths (ttwu).

Reviewed-by: Frank Rowand <frank.rowand@am.sony.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20110405152729.232781355@chello.nl
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-04-14 08:52:38 +02:00
Peter Zijlstra
2acca55ed9 sched: Also serialize ttwu_local() with p->pi_lock
Since we now serialize ttwu() using p->pi_lock, we also need to
serialize ttwu_local() using that, otherwise, once we drop the
rq->lock from ttwu() it can race with ttwu_local().

Reviewed-by: Frank Rowand <frank.rowand@am.sony.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110405152729.192366907@chello.nl
2011-04-14 08:52:37 +02:00
Peter Zijlstra
a8e4f2eaec sched: Delay task_contributes_to_load()
In prepratation of having to call task_contributes_to_load() without
holding rq->lock, we need to store the result until we do and can
update the rq accounting accordingly.

Reviewed-by: Frank Rowand <frank.rowand@am.sony.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110405152729.151523907@chello.nl
2011-04-14 08:52:37 +02:00
Peter Zijlstra
3fe1698b7f sched: Deal with non-atomic min_vruntime reads on 32bits
In order to avoid reading partial updated min_vruntime values on 32bit
implement a seqcount like solution.

Reviewed-by: Frank Rowand <frank.rowand@am.sony.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20110405152729.111378493@chello.nl
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-04-14 08:52:37 +02:00
Peter Zijlstra
74f8e4b233 sched: Remove rq argument to sched_class::task_waking()
In preparation of calling this without rq->lock held, remove the
dependency on the rq argument.

Reviewed-by: Frank Rowand <frank.rowand@am.sony.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20110405152729.071474242@chello.nl
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-04-14 08:52:36 +02:00
Peter Zijlstra
7608dec2ce sched: Drop the rq argument to sched_class::select_task_rq()
In preparation of calling select_task_rq() without rq->lock held, drop
the dependency on the rq argument.

Reviewed-by: Frank Rowand <frank.rowand@am.sony.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20110405152729.031077745@chello.nl
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-04-14 08:52:36 +02:00
Peter Zijlstra
013fdb8086 sched: Serialize p->cpus_allowed and ttwu() using p->pi_lock
Currently p->pi_lock already serializes p->sched_class, also put
p->cpus_allowed and try_to_wake_up() under it, this prepares the way
to do the first part of ttwu() without holding rq->lock.

By having p->sched_class and p->cpus_allowed serialized by p->pi_lock,
we prepare the way to call select_task_rq() without holding rq->lock.

Reviewed-by: Frank Rowand <frank.rowand@am.sony.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20110405152728.990364093@chello.nl
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-04-14 08:52:35 +02:00
Peter Zijlstra
fd2f4419b4 sched: Provide p->on_rq
Provide a generic p->on_rq because the p->se.on_rq semantics are
unfavourable for lockless wakeups but needed for sched_fair.

In particular, p->on_rq is only cleared when we actually dequeue the
task in schedule() and not on any random dequeue as done by things
like __migrate_task() and __sched_setscheduler().

This also allows us to remove p->se usage from !sched_fair code.

Reviewed-by: Frank Rowand <frank.rowand@am.sony.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110405152728.949545047@chello.nl
2011-04-14 08:52:35 +02:00
Peter Zijlstra
d7c01d27ab sched: Clean up ttwu() stats
Collect all ttwu() stat code into a single function and ensure its
always called for an actual wakeup (changing p->state to
TASK_RUNNING).

Reviewed-by: Frank Rowand <frank.rowand@am.sony.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110405152728.908177058@chello.nl
2011-04-14 08:52:34 +02:00
Peter Zijlstra
893633817f sched: Change the ttwu() success details
try_to_wake_up() would only return a success when it would have to
place a task on a rq, change that to every time we change p->state to
TASK_RUNNING, because that's the real measure of wakeups.

This results in that success is always true for the tracepoints.

Reviewed-by: Frank Rowand <frank.rowand@am.sony.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110405152728.866866929@chello.nl
2011-04-14 08:52:34 +02:00
Peter Zijlstra
c2f7115e2e sched: Move wq_worker_waking to the correct site
wq_worker_waking_up() needs to match wq_worker_sleeping(), since the
latter is only called on deactivate, move the former near activate.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/n/top-t3m7n70n9frmv4pv2n5fwmov@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-04-14 08:52:33 +02:00
Peter Zijlstra
c6eb3dda25 mutex: Use p->on_cpu for the adaptive spin
Since we now have p->on_cpu unconditionally available, use it to
re-implement mutex_spin_on_owner.

Requested-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frank Rowand <frank.rowand@am.sony.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110405152728.826338173@chello.nl
2011-04-14 08:52:33 +02:00
Peter Zijlstra
3ca7a440da sched: Always provide p->on_cpu
Always provide p->on_cpu so that we can determine if its on a cpu
without having to lock the rq.

Reviewed-by: Frank Rowand <frank.rowand@am.sony.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20110405152728.785452014@chello.nl
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-04-14 08:52:32 +02:00
Ingo Molnar
a4c98f8bbe Merge branch 'linus' into sched/locking
Merge reason: Pick up this upstream commit:

  6631e635c6: block: don't flush plugged IO on forced preemtion scheduling

As it modifies the scheduler and we'll queue up dependent patches.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-04-14 08:51:07 +02:00
Linus Torvalds
6631e635c6 block: don't flush plugged IO on forced preemtion scheduling
We really only want to unplug the pending IO when the process actually
goes to sleep.  So move the test for flushing the plug up to the place
where we actually deactivate the task - where we have properly checked
for preemption and for the process really sleeping.

Acked-by: Jens Axboe <jaxboe@fusionio.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-04-13 08:08:20 -07:00
Jens Axboe
94b5eb28b4 block: fixup block IO unplug trace call
It was removed with the on-stack plugging, readd it and track the
depth of requests added when flushing the plug.

Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-04-12 10:12:19 +02:00
Jens Axboe
d9c9783317 block: remove block_unplug_timer() trace point
We no longer have an unplug timer running, so no point in keeping
the trace point.

Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-04-12 10:06:33 +02:00
Shriram Rajagopalan
d419e4c0f7 fix XEN_SAVE_RESTORE Kconfig dependencies
Make XEN_SAVE_RESTORE select HIBERNATE_CALLBACKS.
Remove XEN_SAVE_RESTORE dependency from PM_SLEEP.

Signed-off-by: Shriram Rajagopalan <rshriram@cs.ubc.ca>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2011-04-11 22:54:48 +02:00
Rafael J. Wysocki
1f112cee07 PM / Hibernate: Introduce CONFIG_HIBERNATE_CALLBACKS
Xen save/restore is going to use hibernate device callbacks for
quiescing devices and putting them back to normal operations and it
would need to select CONFIG_HIBERNATION for this purpose.  However,
that also would cause the hibernate interfaces for user space to be
enabled, which might confuse user space, because the Xen kernels
don't support hibernation.  Moreover, it would be wasteful, as it
would make the Xen kernels include a substantial amount of code that
they would never use.

To address this issue introduce new power management Kconfig option
CONFIG_HIBERNATE_CALLBACKS, such that it will only select the code
that is necessary for the hibernate device callbacks to work and make
CONFIG_HIBERNATION select it.  Then, Xen save/restore will be able to
select CONFIG_HIBERNATE_CALLBACKS without dragging the entire
hibernate code along with it.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Tested-by: Shriram Rajagopalan <rshriram@cs.ubc.ca>
2011-04-11 22:54:42 +02:00
Peter Zijlstra
60495e7760 sched: Dynamic sched_domain::level
Remove the SD_LV_ enum and use dynamic level assignments.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20110407122942.969433965@chello.nl
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-04-11 14:09:32 +02:00
Peter Zijlstra
54ab4ff431 sched: Move sched domain storage into the topology list
In order to remove the last dependency on the statid domain levels,
move the sd_data storage into the topology structure.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20110407122942.924926412@chello.nl
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-04-11 14:09:31 +02:00
Peter Zijlstra
d069b916f7 sched: Reverse the topology list
In order to get rid of static sched_domain::level assignments, reverse
the topology iteration.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20110407122942.876506131@chello.nl
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-04-11 14:09:29 +02:00
Peter Zijlstra
2c402dc3bb sched: Unify the sched_domain build functions
Since all the __build_$DOM_sched_domain() functions do pretty much the
same thing, unify them.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20110407122942.826347257@chello.nl
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-04-11 14:09:27 +02:00
Peter Zijlstra
eb7a74e6cd sched: Stuff the sched_domain creation in a data-structure
In order to make the topology contruction fully dynamic, remove the
still hard-coded list of possible domains and stick them in a
data-structure.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20110407122942.770335383@chello.nl
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-04-11 14:09:26 +02:00
Peter Zijlstra
d3081f52f2 sched: Create proper cpu_$DOM_mask() functions
In order to unify the sched domain creation more, create proper
cpu_$DOM_mask() functions for those domains that didn't already have
one.

Use the sched_domains_tmpmask for the weird NUMA domain span.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20110407122942.717702108@chello.nl
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-04-11 14:09:24 +02:00
Peter Zijlstra
4cb988395d sched: Avoid allocations in sched_domain_debug()
Since we're all serialized by sched_domains_mutex we can use
sched_domains_tmpmask and avoid having to do allocations. This means
we can use sched_domains_debug() for cpu_attach_domain() again.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20110407122942.664347467@chello.nl
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-04-11 14:05:00 +02:00
Peter Zijlstra
f96225fd51 sched: Create persistent sched_domains_tmpmask
Since sched domain creation is fully serialized by the
sched_domains_mutex we can create a single persistent tmpmask to use
during domain creation.

This removes the need for s_data::send_covered.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20110407122942.607287405@chello.nl
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-04-11 12:58:23 +02:00
Peter Zijlstra
7dd04b7307 sched: Remove some dead code
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20110407122942.553814623@chello.nl
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-04-11 12:58:22 +02:00
Peter Zijlstra
bf28b25326 sched: Remove nodemask allocation
There's only one nodemask user left so remove it with a direct
computation and save some memory and reduce some code-flow
complexity.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20110407122942.505608966@chello.nl
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-04-11 12:58:22 +02:00
Peter Zijlstra
3bd65a80af sched: Simplify NODE/ALLNODES domain creation
Don't treat ALLNODES/NODE different for difference's sake. Simply
always create the ALLNODES domain and let the sd_degenerate() checks
kill it when its redundant. This simplifies the code flow.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20110407122942.455464579@chello.nl
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-04-11 12:58:21 +02:00
Peter Zijlstra
a6c75f2f8d sched: Avoid using sd->level
Don't use sd->level for identifying properties of the domain.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20110407122942.350174079@chello.nl
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-04-11 12:58:20 +02:00
Peter Zijlstra
822ff793c3 sched: Simplify the free path some
If we check the root_domain reference count we can see if its been
used or not, use this observation to simplify some of the return
paths.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20110407122942.298339503@chello.nl
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-04-11 12:58:20 +02:00
Peter Zijlstra
dce840a087 sched: Dynamically allocate sched_domain/sched_group data-structures
Instead of relying on static allocations for the sched_domain and
sched_group trees, dynamically allocate and RCU free them.

Allocating this dynamically also allows for some build_sched_groups()
simplification since we can now (like with other simplifications) rely
on the sched_domain tree instead of hard-coded knowledge.

One tricky to note is that detach_destroy_domains() needs to hold
rcu_read_lock() over the entire tear-down, per-cpu is not sufficient
since that can lead to partial sched_group existance (could possibly
be solved by doing the tear-down backwards but this is much more
robust).

A concequence of the above is that we can no longer print the
sched_domain debug stuff from cpu_attach_domain() since that might now
run with preemption disabled (due to classic RCU etc.) and
sched_domain_debug() does some GFP_KERNEL allocations.

Another thing to note is that we now fully rely on normal RCU and not
RCU-sched, this is because with the new and exiting RCU flavours we
grew over the years BH doesn't necessarily hold off RCU-sched grace
periods (-rt is known to break this). This would in fact already cause
us grief since we do sched_domain/sched_group iterations from softirq
context.

This patch is somewhat larger than I would like it to be, but I didn't
find any means of shrinking/splitting this.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20110407122942.245307941@chello.nl
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-04-11 12:58:19 +02:00
Peter Zijlstra
a9c9a9b6bf sched: Simplify sched_groups_power initialization
Again, instead of relying on knowing the possible domains and their
order, simply rely on the sched_domain tree and whatever domains are
present in there to initialize the sched_group cpu_power.

Note: we need to iterate the CPU mask backwards because of the
cpumask_first() condition for iterating up the tree. By iterating the
mask backwards we ensure all groups of a domain are set-up before
starting on the parent groups that rely on its children to be
completely done.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20110407122942.187335414@chello.nl
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-04-11 12:58:19 +02:00
Peter Zijlstra
21d42ccfd6 sched: Simplify finding the lowest sched_domain
Instead of relying on knowing the build order and various CONFIG_
flags simply remember the bottom most sched_domain when we created the
domain hierarchy.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20110407122942.134511046@chello.nl
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-04-11 12:58:19 +02:00
Peter Zijlstra
1cf5190254 sched: Simplify sched_group creation
Instead of calling build_sched_groups() for each possible sched_domain
we might have created, note that we can simply iterate the
sched_domain tree and call it for each sched_domain present.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20110407122942.077862519@chello.nl
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-04-11 12:58:18 +02:00
Peter Zijlstra
3739494e08 sched: Clean up some ALLNODES code
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20110407122942.025636011@chello.nl
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-04-11 12:58:18 +02:00
Peter Zijlstra
cd4ea6ae39 sched: Change NODE sched_domain group creation
The NODE sched_domain is 'special' in that it allocates sched_groups
per CPU, instead of sharing the sched_groups between all CPUs.

While this might have some benefits on large NUMA and avoid remote
memory accesses when iterating the sched_groups, this does break
current code that assumes sched_groups are shared between all
sched_domains (since the dynamic cpu_power patches).

So refactor the NODE groups to behave like all other groups.

(The ALLNODES domain again shared its groups across the CPUs for some
reason).

If someone does measure a performance decrease due to this change we
need to revisit this and come up with another way to have both dynamic
cpu_power and NUMA work nice together.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20110407122941.978111700@chello.nl
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-04-11 12:58:17 +02:00
Peter Zijlstra
a06dadbec5 sched: Simplify build_sched_groups()
Notice that the mask being computed is the same as the domain span we
just computed. By using the domain_span we can avoid some mask
allocations and computations.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20110407122941.925028189@chello.nl
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-04-11 12:58:17 +02:00
Peter Zijlstra
d274cb30f4 sched: Simplify ->cpu_power initialization
The code in update_group_power() does what init_sched_groups_power()
does and more, so remove the special init_ code and call the generic
code instead.

Also move the sd->span_weight initialization because
update_group_power() needs it.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20110407122941.875856012@chello.nl
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-04-11 12:58:16 +02:00
Peter Zijlstra
c4a8849af9 sched: Remove obsolete arch_ prefixes
Non weak static functions clearly are not arch specific, so remove the
arch_ prefix.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20110407122941.820460566@chello.nl
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-04-11 12:58:16 +02:00
Shaohua Li
f4ad9bd208 sched: Eliminate dead code from wakeup_gran()
calc_delta_fair() checks NICE_0_LOAD already, delete duplicate check.

Signed-off-by: Shaohua Li<shaohua.li@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Link: http://lkml.kernel.org/r/1302238389.3981.92.camel@sli10-conroe
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-04-11 11:08:55 +02:00
Ken Chen
b30aef17f7 sched: Fix erroneous all_pinned logic
The scheduler load balancer has specific code to deal with cases of
unbalanced system due to lots of unmovable tasks (for example because of
hard CPU affinity). In those situation, it excludes the busiest CPU that
has pinned tasks for load balance consideration such that it can perform
second 2nd load balance pass on the rest of the system.

This all works as designed if there is only one cgroup in the system.

However, when we have multiple cgroups, this logic has false positives and
triggers multiple load balance passes despite there are actually no pinned
tasks at all.

The reason it has false positives is that the all pinned logic is deep in
the lowest function of can_migrate_task() and is too low level:

load_balance_fair() iterates each task group and calls balance_tasks() to
migrate target load. Along the way, balance_tasks() will also set a
all_pinned variable. Given that task-groups are iterated, this all_pinned
variable is essentially the status of last group in the scanning process.
Task group can have number of reasons that no load being migrated, none
due to cpu affinity. However, this status bit is being propagated back up
to the higher level load_balance(), which incorrectly think that no tasks
were moved.  It kick off the all pinned logic and start multiple passes
attempt to move load onto puller CPU.

To fix this, move the all_pinned aggregation up at the iterator level.
This ensures that the status is aggregated over all task-groups, not just
last one in the list.

Signed-off-by: Ken Chen <kenchen@google.com>
Cc: stable@kernel.org
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/BANLkTi=ernzNawaR5tJZEsV_QVnfxqXmsQ@mail.gmail.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-04-11 11:08:54 +02:00
Ken Chen
b0432d8f16 sched: Fix sched-domain avg_load calculation
In function find_busiest_group(), the sched-domain avg_load isn't
calculated at all if there is a group imbalance within the domain. This
will cause erroneous imbalance calculation.

The reason is that calculate_imbalance() sees sds->avg_load = 0 and it
will dump entire sds->max_load into imbalance variable, which is used
later on to migrate entire load from busiest CPU to the puller CPU.

This has two really bad effect:

1. stampede of task migration, and they won't be able to break out
   of the bad state because of positive feedback loop: large load
   delta -> heavier load migration -> larger imbalance and the cycle
   goes on.

2. severe imbalance in CPU queue depth.  This causes really long
   scheduling latency blip which affects badly on application that
   has tight latency requirement.

The fix is to have kernel calculate domain avg_load in both cases. This
will ensure that imbalance calculation is always sensible and the target
is usually half way between busiest and puller CPU.

Signed-off-by: Ken Chen <kenchen@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: <stable@kernel.org>
Link: http://lkml.kernel.org/r/20110408002322.3A0D812217F@elm.corp.google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-04-11 11:08:54 +02:00
Stephane Eranian
e566b76ed3 perf_event: Fix cgrp event scheduling bug in perf_enable_on_exec()
There is a bug in perf_event_enable_on_exec() when cgroup events are
active on a CPU: the cgroup events may be scheduled twice causing event
state corruptions which eventually may lead to kernel panics.

The reason is that the function needs to first schedule out the cgroup
events, just like for the per-thread events. The cgroup event are
scheduled back in automatically from the perf_event_context_sched_in()
function.

The patch also adds a WARN_ON_ONCE() is perf_cgroup_switch() to catch any
bogus state.

Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110406005454.GA1062@quad
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-04-11 11:07:55 +02:00
Randy Dunlap
f9fa0bc1fa signal.c: fix erroneous syscall kernel-doc
Fix erroneous syscall kernel-doc comments in kernel/signal.c.

Reported-by: Matt Fleming <matt@console-pimps.org>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-04-08 11:05:24 -07:00
Linus Torvalds
8b9686ff4d Merge branches 'x86-fixes-for-linus', 'sched-fixes-for-linus', 'timers-fixes-for-linus', 'irq-fixes-for-linus' and 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  x86-32, fpu: Fix FPU exception handling on non-SSE systems
  x86, hibernate: Initialize mmu_cr4_features during boot
  x86-32, NUMA: Fix ACPI NUMA init broken by recent x86-64 change
  x86: visws: Fixup irq overhaul fallout

* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  sched: Clean up rebalance_domains() load-balance interval calculation

* 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  x86/mrst/vrtc: Fix boot crash in mrst_rtc_init()
  rtc, x86/mrst/vrtc: Fix boot crash in rtc_read_alarm()

* 'irq-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  genirq: Fix cpumask leak in __setup_irq()

* 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  perf probe: Fix listing incorrect line number with inline function
  perf probe: Fix to find recursively inlined function
  perf probe: Fix multiple --vars options behavior
  perf probe: Fix to remove redundant close
  perf probe: Fix to ensure function declared file
2011-04-07 12:12:58 -07:00
Linus Torvalds
42933bac11 Merge branch 'for-linus2' of git://git.profusion.mobi/users/lucas/linux-2.6
* 'for-linus2' of git://git.profusion.mobi/users/lucas/linux-2.6:
  Fix common misspellings
2011-04-07 11:14:49 -07:00
Peter Zijlstra
49c022e657 sched: Clean up rebalance_domains() load-balance interval calculation
Instead of the possible multiple-evaluation of num_online_cpus()
in rebalance_domains() that Linus reported, avoid it altogether
in the normal case since it's implemented with a Hamming weight
function over a cpu bitmask which can be darn expensive for those
with big iron.

This also makes it cleaner, smaller and documents the code.

Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1301991265.2225.12.camel@twins>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-04-05 10:29:36 +02:00
Randy Dunlap
41c57892a2 kernel/signal.c: add kernel-doc notation to syscalls
Add kernel-doc to syscalls in signal.c.

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-04-04 17:51:46 -07:00
Randy Dunlap
5aba085ede kernel/signal.c: fix typos and coding style
General coding style and comment fixes; no code changes:

 - Use multi-line-comment coding style.
 - Put some function signatures completely on one line.
 - Hyphenate some words.
 - Spell Posix as POSIX.
 - Correct typos & spellos in some comments.
 - Drop trailing whitespace.
 - End sentences with periods.

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-04-04 17:51:46 -07:00
Jason Baron
d430d3d7e6 jump label: Introduce static_branch() interface
Introduce:

static __always_inline bool static_branch(struct jump_label_key *key);

instead of the old JUMP_LABEL(key, label) macro.

In this way, jump labels become really easy to use:

Define:

        struct jump_label_key jump_key;

Can be used as:

        if (static_branch(&jump_key))
                do unlikely code

enable/disale via:

        jump_label_inc(&jump_key);
        jump_label_dec(&jump_key);

that's it!

For the jump labels disabled case, the static_branch() becomes an
atomic_read(), and jump_label_inc()/dec() are simply atomic_inc(),
atomic_dec() operations. We show testing results for this change below.

Thanks to H. Peter Anvin for suggesting the 'static_branch()' construct.

Since we now require a 'struct jump_label_key *key', we can store a pointer into
the jump table addresses. In this way, we can enable/disable jump labels, in
basically constant time. This change allows us to completely remove the previous
hashtable scheme. Thanks to Peter Zijlstra for this re-write.

Testing:

I ran a series of 'tbench 20' runs 5 times (with reboots) for 3
configurations, where tracepoints were disabled.

jump label configured in
avg: 815.6

jump label *not* configured in (using atomic reads)
avg: 800.1

jump label *not* configured in (regular reads)
avg: 803.4

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20110316212947.GA8792@redhat.com>
Signed-off-by: Jason Baron <jbaron@redhat.com>
Suggested-by: H. Peter Anvin <hpa@linux.intel.com>
Tested-by: David Daney <ddaney@caviumnetworks.com>
Acked-by: Ralf Baechle <ralf@linux-mips.org>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2011-04-04 12:48:08 -04:00
Jiri Olsa
ee5e51f51b tracing: Avoid soft lockup in trace_pipe
running following commands:

  # enable the binary option
  echo 1 > ./options/bin
  # disable context info option
  echo 0 > ./options/context-info
  # tracing only events
  echo 1 > ./events/enable
  cat trace_pipe

plus forcing system to generate many tracing events,
is causing lockup (in NON preemptive kernels) inside
tracing_read_pipe function.

The issue is also easily reproduced by running ltp stress test.
(ftrace_stress_test.sh)

The reasons are:
 - bin/hex/raw output functions for events are set to
   trace_nop_print function, which prints nothing and
   returns TRACE_TYPE_HANDLED value
 - LOST EVENT trace do not handle trace_seq overflow

These reasons force the while loop in tracing_read_pipe
function never to break.

The attached patch fixies handling of lost event trace, and
changes trace_nop_print to print minimal info, which is needed
for the correct tracing_read_pipe processing.

v2 changes:
 - omit the cond_resched changes by trace_nop_print changes
 - WARN changed to WARN_ONCE and added info to be able
   to find out the culprit

v3 changes:
 - make more accurate patch comment

Signed-off-by: Jiri Olsa <jolsa@redhat.com>
LKML-Reference: <20110325110518.GC1922@jolsa.brq.redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2011-04-04 12:18:24 -04:00
Steven Rostedt
1813dc3776 tracing: Print trace_bprintk() formats for modules too
The file debugfs/tracing/printk_formats maps the addresses
to the formats that are used by trace_bprintk() so that userspace
tools can read the buffer and be able to decode trace_bprintk events
to get the format saved when reading the ring buffer directly.

This is because trace_bprintk() does not store the format into the
buffer, but just the address of the format, which is hidden in
the kernel memory.

But currently it only exports trace_bprintk()s from the kernel core
and not for modules. The modules need their formats exported
as well.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2011-04-04 12:18:24 -04:00
Steven Rostedt
0588fa30db tracing: Convert trace_printk() formats for module to const char *
The trace_printk() formats for modules do not show up in the
debugfs/tracing/printk_formats file. Only the formats that are
for trace_printk()s that are in the kernel core.

To facilitate the change to add trace_printk() formats from modules
into that file as well, we need to convert the structure that
holds the formats from char fmt[], into const char *fmt,
and allocate them separately.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2011-04-04 12:18:24 -04:00
Linus Torvalds
148086bb64 Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  sched: Fix rebalance interval calculation
  sched, doc: Beef up load balancing description
  sched: Leave sched_setscheduler() earlier if possible, do not disturb SCHED_FIFO tasks
2011-04-04 08:36:58 -07:00
Linus Torvalds
4da7e90e65 Merge branch 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  perf: Fix task_struct reference leak
  perf: Fix task context scheduling
  perf: mmap 512 kiB by default
  perf: Rebase max unprivileged mlock threshold on top of page size
  perf tools: Fix NO_NEWT=1 python build error
  perf symbols: Properly align symbol_conf.priv_size
  perf tools: Emit clearer message for sys_perf_event_open ENOENT return
  perf tools: Fixup exit path when not able to open events
  perf symbols: Fix vsyscall symbol lookup
  oprofile, x86: Allow setting EDGE/INV/CMASK for counter events
2011-04-04 08:36:40 -07:00
Richard Cochran
4352d9d44b ntp: fix non privileged system time shifting
The ADJ_SETOFFSET bit added in commit 094aa188 ("ntp: Add ADJ_SETOFFSET
mode bit") also introduced a way for any user to change the system time.
Sneaky or buggy calls to adjtimex() could set

    ADJ_OFFSET_SS_READ | ADJ_SETOFFSET

which would result in a successful call to timekeeping_inject_offset().
This patch fixes the issue by adding the capability check.

Signed-off-by: Richard Cochran <richard.cochran@omicron.at>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-04-04 08:31:23 -07:00
Xiaotian Feng
4f5058c3b7 genirq: Fix cpumask leak in __setup_irq()
The allocated cpumask should be freed in __setup_irq().

Signed-off-by: Xiaotian Feng <dfeng@redhat.com>
LKML-Reference: <1301744375-6812-1-git-send-email-dfeng@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2011-04-02 21:26:20 +02:00
Anton Blanchard
c0bb9e45f3 kdump: Allow shrinking of kdump region to be overridden
On ppc64 the crashkernel region almost always overlaps an area of firmware.
This works fine except when using the sysfs interface to reduce the kdump
region. If we free the firmware area we are guaranteed to crash.

Rename free_reserved_phys_range to crash_free_reserved_phys_range and make
it a weak function so we can override it.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2011-04-01 16:14:30 +11:00
Lucas De Marchi
25985edced Fix common misspellings
Fixes generated by 'codespell' and manually reviewed.

Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
2011-03-31 11:26:23 -03:00
Peter Zijlstra
fd1edb3aa2 perf: Fix task_struct reference leak
sys_perf_event_open() had an imbalance in the number of task refs it
took causing memory leakage

Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: stable@kernel.org # .37+
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-03-31 13:02:56 +02:00
Frederic Weisbecker
20443384fe perf: Rebase max unprivileged mlock threshold on top of page size
Ensure we allow 512 kiB + 1 page for user control without
assuming a 4096 bytes page size.

Reported-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: <stable@kernel.org>
LKML-Reference: <1301535209-9679-1-git-send-email-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-03-31 13:02:54 +02:00
Sisir Koppaka
3436ae1298 sched: Fix rebalance interval calculation
The interval for checking scheduling domains if they are due to be
balanced currently depends on boot state NR_CPUS, which may not
accurately reflect the number of online CPUs at the time of check.

Thus replace NR_CPUS with num_online_cpus().

 (ed: Should only affect those who set NR_CPUS really high, such as 4096
      or so :-)

Signed-off-by: Sisir Koppaka <sisir.koppaka@gmail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <AANLkTikqHWid2Q93F5U5Qw5snJH8C5PXoa7J6=6hYO94@mail.gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-03-31 13:00:37 +02:00
Dario Faggioli
a51e919818 sched: Leave sched_setscheduler() earlier if possible, do not disturb SCHED_FIFO tasks
sched_setscheduler() (in sched.c) is called in order of changing the
scheduling policy and/or the real-time priority of a task. Thus,
if we find out that neither of those are actually being modified, it
is possible to return earlier and save the overhead of a full
deactivate+activate cycle of the task in question.

Beside that, if we have more than one SCHED_FIFO task with the same
priority on the same rq (which means they share the same priority queue)
having one of them changing its position in the priority queue because of
a sched_setscheduler (as it happens by means of the deactivate+activate)
that does not actually change the priority violates POSIX which states,
for SCHED_FIFO:

  "If a thread whose policy or priority has been modified by
   pthread_setschedprio() is a running thread or is runnable, the effect on
   its position in the thread list depends on the direction of the
   modification, as follows: a. <...> b. If the priority is unchanged, the
   thread does not change position in the thread list. c. <...>"

     http://pubs.opengroup.org/onlinepubs/009695399/functions/xsh_chap02_08.html

 (ed: And the POSIX specification here does, briefly and somewhat unexpectedly,
      match what common sense tells us as well. )

Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1300971618.3960.82.camel@Palantir>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-03-31 13:00:34 +02:00