linux-kernel-test/arch/i386/kernel
Ingo Molnar 0888f06ac9 [PATCH] sched: fix bad missed wakeups in the i386, x86_64, ia64, ACPI and APM idle code
Fernando Lopez-Lezcano reported frequent scheduling latencies and audio
xruns starting at the 2.6.18-rt kernel, and those problems persisted all
until current -rt kernels. The latencies were serious and unjustified by
system load, often in the milliseconds range.

After a patient and heroic multi-month effort of Fernando, where he
tested dozens of kernels, tried various configs, boot options,
test-patches of mine and provided latency traces of those incidents, the
following 'smoking gun' trace was captured by him:

                 _------=> CPU#
                / _-----=> irqs-off
               | / _----=> need-resched
               || / _---=> hardirq/softirq
               ||| / _--=> preempt-depth
               |||| /
               |||||     delay
   cmd     pid ||||| time  |   caller
      \   /    |||||   \   |   /
  IRQ_19-1479  1D..1    0us : __trace_start_sched_wakeup (try_to_wake_up)
  IRQ_19-1479  1D..1    0us : __trace_start_sched_wakeup <<...>-5856> (37 0)
  IRQ_19-1479  1D..1    0us : __trace_start_sched_wakeup (c01262ba 0 0)
  IRQ_19-1479  1D..1    0us : resched_task (try_to_wake_up)
  IRQ_19-1479  1D..1    0us : __spin_unlock_irqrestore (try_to_wake_up)
  ...
  <idle>-0     1...1   11us!: default_idle (cpu_idle)
  ...
  <idle>-0     0Dn.1  602us : smp_apic_timer_interrupt (c0103baf 1 0)
  ...
   <...>-5856  0D..2  618us : __switch_to (__schedule)
   <...>-5856  0D..2  618us : __schedule <<idle>-0> (20 162)
   <...>-5856  0D..2  619us : __spin_unlock_irq (__schedule)
   <...>-5856  0...1  619us : trace_stop_sched_switched (__schedule)
   <...>-5856  0D..1  619us : trace_stop_sched_switched <<...>-5856> (37 0)

what is visible in this trace is that CPU#1 ran try_to_wake_up() for
PID:5856, it placed PID:5856 on CPU#0's runqueue and ran resched_task()
for CPU#0. But it decided to not send an IPI that no CPU - due to
TS_POLLING. But CPU#0 never woke up after its NEED_RESCHED bit was set,
and only rescheduled to PID:5856 upon the next lapic timer IRQ. The
result was a 600+ usecs latency and a missed wakeup!

the bug turned out to be an idle-wakeup bug introduced into the mainline
kernel this summer via an optimization in the x86_64 tree:

    commit 495ab9c045
    Author: Andi Kleen <ak@suse.de>
    Date:   Mon Jun 26 13:59:11 2006 +0200

    [PATCH] i386/x86-64/ia64: Move polling flag into thread_info_status

    During some profiling I noticed that default_idle causes a lot of
    memory traffic. I think that is caused by the atomic operations
    to clear/set the polling flag in thread_info. There is actually
    no reason to make this atomic - only the idle thread does it
    to itself, other CPUs only read it. So I moved it into ti->status.

the problem is this type of change:

        if (!hlt_counter && boot_cpu_data.hlt_works_ok) {
-               clear_thread_flag(TIF_POLLING_NRFLAG);
+               current_thread_info()->status &= ~TS_POLLING;
                smp_mb__after_clear_bit();
                while (!need_resched()) {
                        local_irq_disable();

this changes clear_thread_flag() to an explicit clearing of TS_POLLING.
clear_thread_flag() is defined as:

        clear_bit(flag, &ti->flags);

and clear_bit() is a LOCK-ed atomic instruction on all x86 platforms:

  static inline void clear_bit(int nr, volatile unsigned long * addr)
  {
          __asm__ __volatile__( LOCK_PREFIX
                  "btrl %1,%0"

hence smp_mb__after_clear_bit() is defined as a simple compile barrier:

  #define smp_mb__after_clear_bit()       barrier()

but the explicit TS_POLLING clearing introduced by the patch:

+               current_thread_info()->status &= ~TS_POLLING;

is not an atomic op! So the clearing of the TS_POLLING bit is freely
reorderable with the reading of the NEED_RESCHED bit - and both now
reside in different memory addresses.

CPU idle wakeup very much depends on ordered memory ops, the clearing of
the TS_POLLING flag must always be done before we test need_resched()
and hit the idle instruction(s). [Symmetrically, the wakeup code needs
to set NEED_RESCHED before it tests the TS_POLLING flag, so memory
ordering is paramount.]

Fernando's dual-core Athlon64 system has a sufficiently advanced memory
ordering model so that it triggered this scenario very often.

( And it also turned out that the reason why these latencies never
  triggered on my testsystems is that i routinely use idle=poll, which
  was the only idle variant not affected by this bug. )

The fix is to change the smp_mb__after_clear_bit() to an smp_mb(), to
act as an absolute barrier between the TS_POLLING write and the
NEED_RESCHED read. This affects almost all idling methods (default,
ACPI, APM), on all 3 x86 architectures: i386, x86_64, ia64.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Tested-by: Fernando Lopez-Lezcano <nando@ccrma.Stanford.EDU>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-22 08:55:51 -08:00
..
acpi Merge branch 'for-linus' of git://one.firstfloor.org/home/andi/git/linux-2.6 2006-12-07 08:59:11 -08:00
cpu [CPUFREQ] longhaul compile fix. 2006-12-17 19:09:59 -05:00
.gitignore [PATCH] x86: gitignore some autogenerated files for i386 2006-02-14 16:09:35 -08:00
alternative.c [PATCH] paravirt: Patch inline replacements for paravirt intercepts 2006-12-07 02:14:08 +01:00
apic.c [PATCH] x86: Regard MSRs in lapic_suspend()/lapic_resume() 2006-12-07 02:14:11 +01:00
apm.c [PATCH] sched: fix bad missed wakeups in the i386, x86_64, ia64, ACPI and APM idle code 2006-12-22 08:55:51 -08:00
asm-offsets.c [PATCH] paravirt: header and stubs for paravirtualisation 2006-12-07 02:14:07 +01:00
bootflag.c Remove obsolete #include <linux/config.h> 2006-06-30 19:25:36 +02:00
cpuid.c [PATCH] i386: change uses of f_{dentry, vfsmnt} to use f_path 2006-12-08 08:28:42 -08:00
crash_dump.c [PATCH] kdump: read previous kernel's memory 2006-01-10 08:01:28 -08:00
crash.c [PATCH] Kexec / Kdump: Unify elf note code 2006-12-07 08:39:46 -08:00
doublefault.c [PATCH] i386: cpu_relax() in crash.c and doublefault.c 2006-06-25 10:00:55 -07:00
e820.c [PATCH] compile error of register_memory() 2006-12-22 08:55:49 -08:00
early_printk.c
efi_stub.S [PATCH] x86: remove unused include from efi_stub.S 2006-09-26 08:48:56 -07:00
efi.c [PATCH] i386: call efi_get_time during suspend 2006-12-07 02:14:11 +01:00
entry.S Remove stack unwinder for now 2006-12-15 08:47:51 -08:00
head.S [PATCH] paravirt: Add startup infrastructure for paravirtualization 2006-12-07 02:14:08 +01:00
hpet.c [PATCH] i386: add missing iounmap in i386 hpet clocksource code 2006-12-07 02:14:02 +01:00
i386_ksyms.c Remove obsolete #include <linux/config.h> 2006-06-30 19:25:36 +02:00
i387.c Remove obsolete #include <linux/config.h> 2006-06-30 19:25:36 +02:00
i8237.c [PATCH] mmc (mainly): add "or later" clause to licence statement. 2006-10-01 00:39:23 -07:00
i8253.c [PATCH] i386 Time: Avoid PIT SMP lockups 2006-10-17 08:18:42 -07:00
i8259.c [PATCH] paravirt: header and stubs for paravirtualisation 2006-12-07 02:14:07 +01:00
init_task.c [PATCH] nsproxy: move init_nsproxy into kernel/nsproxy.c 2006-10-02 07:57:20 -07:00
io_apic.c [PATCH] i386: Fix io_apic.c warning 2006-12-09 21:33:36 +01:00
ioport.c [PATCH] i386: use thread_info flags for debug regs and IO bitmaps 2006-07-09 18:47:12 -07:00
irq.c [PATCH] genirq: clean up irq-flow-type naming 2006-10-17 08:18:45 -07:00
kprobes.c [PATCH] kprobes: enable booster on the preemptible kernel 2006-12-07 08:39:38 -08:00
ldt.c [PATCH] i386: remove default_ldt, and simplify ldt-setting. 2006-12-07 02:14:01 +01:00
machine_kexec.c [PATCH] i386: Avoid overwriting the current pgd (V4, i386) 2006-09-26 10:52:38 +02:00
Makefile [PATCH] paravirt: Add startup infrastructure for paravirtualization 2006-12-07 02:14:08 +01:00
mca.c [PATCH] i386: replace kmalloc+memset with kzalloc 2006-12-07 02:14:19 +01:00
microcode.c [PATCH] microcode: fix mc_cpu_notifier section warning 2006-12-22 08:55:50 -08:00
module.c [PATCH] Generic BUG for i386 2006-12-08 08:28:39 -08:00
mpparse.c [PATCH] x86-64: remove remaining pc98 code 2006-12-07 02:14:19 +01:00
msr.c [PATCH] i386: change uses of f_{dentry, vfsmnt} to use f_path 2006-12-08 08:28:42 -08:00
nmi.c [PATCH] x86: Fix boot hang due to nmi watchdog init code 2006-12-09 21:33:35 +01:00
numaq.c Remove obsolete #include <linux/config.h> 2006-06-30 19:25:36 +02:00
paravirt.c [PATCH] paravirt: Add MMU virtualization to paravirt_ops 2006-12-07 02:14:08 +01:00
pci-dma.c [PATCH] i386: replace kmalloc+memset with kzalloc 2006-12-07 02:14:19 +01:00
process.c [PATCH] sched: fix bad missed wakeups in the i386, x86_64, ia64, ACPI and APM idle code 2006-12-22 08:55:51 -08:00
ptrace.c [PATCH] ptrace: Fix EFL_OFFSET value according to i386 pda changes 2006-12-22 08:55:51 -08:00
quirks.c [PATCH] x86: Fix verify_quirk_intel_irqbalance() 2006-12-09 21:33:35 +01:00
reboot_fixups.c [PATCH] i386: Remove printk about reboot fixups at reboot 2006-04-09 11:53:53 -07:00
reboot.c [PATCH] arch/i386/kernel/reboot.c should #include <linux/reboot.h> 2006-12-07 08:39:44 -08:00
relocate_kernel.S [PATCH] i386: Avoid overwriting the current pgd (V4, i386) 2006-09-26 10:52:38 +02:00
scx200.c Remove obsolete #include <linux/config.h> 2006-06-30 19:25:36 +02:00
setup.c [PATCH] compile error of register_memory() 2006-12-22 08:55:49 -08:00
sigframe.h [PATCH] __user annotations for pointers in i386 sigframe 2005-09-09 10:31:59 -07:00
signal.c [PATCH] i386: Use %gs as the PDA base-segment in the kernel 2006-12-07 02:14:02 +01:00
smp.c Merge branch 'for-linus' of git://one.firstfloor.org/home/andi/git/linux-2.6 2006-12-07 08:59:11 -08:00
smpboot.c [PATCH] arch/i386/kernel/smpboot.c: remove unneeded ifdef 2006-12-13 09:05:46 -08:00
srat.c [PATCH] convert i386 Summit subarch to use SRAT info for apicid_to_node calls 2006-09-29 09:18:03 -07:00
summit.c
sys_i386.c [PATCH] provide kernel_execve on all architectures 2006-10-02 07:57:23 -07:00
syscall_table.S [PATCH] epoll_pwait() 2006-10-11 11:14:21 -07:00
sysenter.c Merge branch 'for-linus' of git://one.firstfloor.org/home/andi/git/linux-2.6 2006-12-07 08:59:11 -08:00
time_hpet.c [PATCH] i386: Add iounmap in error paths in hpet code 2006-12-07 02:14:02 +01:00
time.c [PATCH] paravirt: header and stubs for paravirtualisation 2006-12-07 02:14:07 +01:00
topology.c [PATCH] i386: change the 'no_control' field to 'hotpluggable' in the struct cpu 2006-12-07 02:14:10 +01:00
trampoline.S
traps.c Remove stack unwinder for now 2006-12-15 08:47:51 -08:00
tsc.c Merge branch 'for-linus' of git://one.firstfloor.org/home/andi/git/linux-2.6 2006-12-07 08:59:11 -08:00
vm86.c [PATCH] i386: Update sys_vm86 to cope with changed pt_regs and %gs usage 2006-12-07 02:14:03 +01:00
vmlinux.lds.S [PATCH] x86: Work around gcc 4.2 over aggressive optimizer 2006-12-09 21:33:36 +01:00
vsyscall-int80.S
vsyscall-note.S
vsyscall-sigreturn.S [PATCH] Mark unwind info for signal trampolines in vDSOs 2006-03-31 12:18:52 -08:00
vsyscall-sysenter.S [PATCH] vdso: randomize the i386 vDSO by moving it into a vma 2006-06-27 17:32:38 -07:00
vsyscall.lds.S [PATCH] vDSO hash-style fix 2006-07-31 13:28:43 -07:00
vsyscall.S