Commit Graph

12932 Commits

Author SHA1 Message Date
Ingo Molnar
301120396b perf events, x86: Add Westmere stalled-cycles-frontend/backend events
Extend the Intel Westmere PMU driver with definitions for generic front-end and
back-end stall events.

( These are only approximations. )

Reported-by: David Ahern <dsahern@gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/n/tip-7y40wib8n008io7hjpn1dsrm@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-04-29 16:22:37 +02:00
Ingo Molnar
91fc4cc000 perf, x86: Add new stalled cycles events for Intel and AMD CPUs
Extend the Intel and AMD event definitions with generic front-end and
back-end stall events.

( These are only approximations - suggestions are welcome for better events. )

Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/n/tip-7y40wib8n001io7hjpn1dsrm@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-04-29 14:24:15 +02:00
Ingo Molnar
8f62242246 perf events: Add generic front-end and back-end stalled cycle event definitions
Add two generic hardware events: front-end and back-end stalled cycles.

These events measure conditions when the CPU is executing code but its
capabilities are not fully utilized. Understanding such situations and
analyzing them is an important sub-task of code optimization workflows.

Both events limit performance: most front end stalls tend to be caused
by branch misprediction or instruction fetch cachemisses, backend
stalls can be caused by various resource shortages or inefficient
instruction scheduling.

Front-end stalls are the more important ones: code cannot run fast
if the instruction stream is not being kept up.

An over-utilized back-end can cause front-end stalls and thus
has to be kept an eye on as well.

The exact composition is very program logic and instruction mix
dependent.

We use the terms 'stall', 'front-end' and 'back-end' loosely and
try to use the best available events from specific CPUs that
approximate these concepts.

Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/n/tip-7y40wib8n000io7hjpn1dsrm@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-04-29 14:23:58 +02:00
Ingo Molnar
8a850cadca perf event, x86: Use better stalled cycles metric
Use the UOPS_EXECUTED.*,c=1,i=1 event on Intel CPUs - it is a rather
good indicator of CPU execution stalls, more sensitive and more inclusive
than the 0xa2 resource stalls event (which does not count nearly as many
stall types).

Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/n/tip-7y40wib8n1eqio7hjpn2dsrm@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-04-28 08:39:33 +02:00
Ingo Molnar
5c543e3c44 perf events, x86: Mark constrant tables read mostly
Various constraint tables were not marked read-mostly.

Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/n/tip-wpqwwvmhxucy5e718wnamjiv@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-04-26 20:04:53 +02:00
Ingo Molnar
94403f8863 perf events: Add stalled cycles generic event - PERF_COUNT_HW_STALLED_CYCLES
The new PERF_COUNT_HW_STALLED_CYCLES event tries to approximate
cycles the CPU does nothing useful, because it is stalled on a
cache-miss or some other condition.

Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/n/tip-fue11vymwqsoo5to72jxxjyl@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-04-26 20:04:53 +02:00
Ingo Molnar
7bd5fafeb4 Merge branch 'perf/urgent' into perf/stat
Merge reason: We want to queue up dependent changes.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-04-26 19:36:17 +02:00
Ingo Molnar
ec75a71634 perf events, x86: Work around the Nehalem AAJ80 erratum
On Nehalem CPUs the retired branch-misses event can be completely bogus,
when there are no branch-misses occuring. When there are a lot of branch
misses then the count is pretty accurate. Still, this leaves us with an
event that over-counts a lot.

Detect this erratum and work it around by using BR_MISP_EXEC.ANY events.
These will also count speculated branches but still it's a lot more
precise in practice than the architectural event.

Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/n/tip-yyfg0bxo9jsqxd6a0ovfny27@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-04-26 19:34:34 +02:00
Peter Zijlstra
18a073a3ac perf, x86: Fix BTS condition
Currently the x86 backend incorrectly assumes that any BRANCH_INSN
with sample_period==1 is a BTS request. This is not true when we do
frequency driven profiling such as 'perf record -e branches'.

Solves this error:

  $ perf record -e branches ./array
  Error: sys_perf_event_open() syscall returned with 95 (Operation not supported).

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reported-by: Ingo Molnar <mingo@elte.hu>
Cc: "Metzger, Markus T" <markus.t.metzger@intel.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/n/tip-rd2y4ct71hjawzz6fpvsy9hg@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-04-26 13:34:34 +02:00
Justin P. Mattock
fa7b69475a perf events, x86, P4: Fix typo in comment
Signed-off-by: Justin P. Mattock <justinmattock@gmail.com>
Acked-by: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: trivial@kernel.org
Link: http://lkml.kernel.org/r/1303492132-3004-1-git-send-email-justinmattock@gmail.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-04-24 13:16:04 +02:00
Linus Torvalds
686c4cbb10 Merge branch 'pm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6
* 'pm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6:
  PM: Add missing syscore_suspend() and syscore_resume() calls
  PM: Fix error code paths executed after failing syscore_suspend()
2011-04-23 22:35:16 -07:00
Peter Zijlstra
f4929bd372 perf, x86: Update/fix Intel Nehalem cache events
Change the Nehalem cache events to use retired memory instruction counters
(similar to Westmere), this greatly improves the provided stats.

Using:

main ()
{
        int i;

        for (i = 0; i < 1000000000; i++) {
                asm("mov (%%rsp), %%rbx;"
                    "mov %%rbx, (%%rsp);" : : : "rbx");
        }
}

We find:

 $ perf stat --repeat 10 -e instructions:u -e l1-dcache-loads:u -e l1-dcache-stores:u ./loop_1b_loads+stores
  Performance counter stats for './loop_1b_loads+stores' (10 runs):
      4,000,081,056 instructions:u           #      0.000 IPC ( +-   0.000% )
      4,999,502,846 l1-dcache-loads:u          ( +-   0.008% )
      1,000,034,832 l1-dcache-stores:u         ( +-   0.000% )
         1.565184942  seconds time elapsed   ( +-   0.005% )

The 5b is surprising - we'd expect 1b:

 $ perf stat --repeat 10 -e instructions:u -e r10b:u -e l1-dcache-stores:u ./loop_1b_loads+stores
  Performance counter stats for './loop_1b_loads+stores' (10 runs):
      4,000,081,054 instructions:u           #      0.000 IPC ( +-   0.000% )
      1,000,021,961 r10b:u                     ( +-   0.000% )
      1,000,030,951 l1-dcache-stores:u         ( +-   0.000% )
         1.565055422  seconds time elapsed   ( +-   0.003% )

Which this patch thus fixes.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Lin Ming <ming.m.lin@intel.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Link: http://lkml.kernel.org/n/tip-q9rtru7b7840tws75xzboapv@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-04-22 13:50:27 +02:00
Cyrill Gorcunov
1ea5a6afd9 perf, x86: P4 PMU - Don't forget to clear cpuc->active_mask on overflow
It's not enough to simply disable event on overflow the
cpuc->active_mask should be cleared as well otherwise counter
may stall in "active" even in real being already disabled (which
potentially may lead to the situation that user may not use this
counter further).

Don pointed out that:

 " I also noticed this patch fixed some unknown NMIs
   on a P4 when I stressed the box".

Tested-by: Lin Ming <ming.m.lin@intel.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Don Zickus <dzickus@redhat.com>
Signed-off-by: Don Zickus <dzickus@redhat.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Link: http://lkml.kernel.org/r/1303398203-2918-3-git-send-email-dzickus@redhat.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-04-22 10:21:34 +02:00
Cyrill Gorcunov
103b393481 perf, x86: P4 PMU -- Use perf_sample_data_init helper
Instead of opencoded assignments better to use
perf_sample_data_init helper.

Tested-by: Lin Ming <ming.m.lin@intel.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Don Zickus <dzickus@redhat.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Link: http://lkml.kernel.org/r/1303398203-2918-2-git-send-email-dzickus@redhat.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-04-22 10:20:04 +02:00
Ingo Molnar
eff430de53 Merge branch 'linus' into perf/core
Merge reason: Pick up upstream fixes.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-04-22 10:19:30 +02:00
Ingo Molnar
b52c55c6a2 x86, perf event: Turn off unstructured raw event access to offcore registers
Andi Kleen pointed out that the Intel offcore support patches were merged
without user-space tool support to the functionality:

 |
 | The offcore_msr perf kernel code was merged into 2.6.39-rc*, but the
 | user space bits were not. This made it impossible to set the extra mask
 | and actually do the OFFCORE profiling
 |

Andi submitted a preliminary patch for user-space support, as an
extension to perf's raw event syntax:

 |
 | Some raw events -- like the Intel OFFCORE events -- support additional
 | parameters. These can be appended after a ':'.
 |
 | For example on a multi socket Intel Nehalem:
 |
 |    perf stat -e r1b7:20ff -a sleep 1
 |
 | Profile the OFFCORE_RESPONSE.ANY_REQUEST with event mask REMOTE_DRAM_0
 | that measures any access to DRAM on another socket.
 |

But this kind of usability is absolutely unacceptable - users should not
be expected to type in magic, CPU and model specific incantations to get
access to useful hardware functionality.

The proper solution is to expose useful offcore functionality via
generalized events - that way users do not have to care which specific
CPU model they are using, they can use the conceptual event and not some
model specific quirky hexa number.

We already have such generalization in place for CPU cache events,
and it's all very extensible.

"Offcore" events measure general DRAM access patters along various
parameters. They are particularly useful in NUMA systems.

We want to support them via generalized DRAM events: either as the
fourth level of cache (after the last-level cache), or as a separate
generalization category.

That way user-space support would be very obvious, memory access
profiling could be done via self-explanatory commands like:

  perf record -e dram ./myapp
  perf record -e dram-remote ./myapp

... to measure DRAM accesses or more expensive cross-node NUMA DRAM
accesses.

These generalized events would work on all CPUs and architectures that
have comparable PMU features.

( Note, these are just examples: actual implementation could have more
  sophistication and more parameter - as long as they center around
  similarly simple usecases. )

Now we do not want to revert *all* of the current offcore bits, as they
are still somewhat useful for generic last-level-cache events, implemented
in this commit:

  e994d7d23a: perf: Fix LLC-* events on Intel Nehalem/Westmere

But we definitely do not yet want to expose the unstructured raw events
to user-space, until better generalization and usability is implemented
for these hardware event features.

( Note: after generalization has been implemented raw offcore events can be
  supported as well: there can always be an odd event that is marginally
  useful but not useful enough to generalize. DRAM profiling is definitely
  *not* such a category so generalization must be done first. )

Furthermore, PERF_TYPE_RAW access to these registers was not intended
to go upstream without proper support - it was a side-effect of the above
e994d7d23a commit, not mentioned in the changelog.

As v2.6.39 is nearing release we go for the simplest approach: disable
the PERF_TYPE_RAW offcore hack for now, before it escapes into a released
kernel and becomes an ABI.

Once proper structure is implemented for these hardware events and users
are offered usable solutions we can revisit this issue.

Reported-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1302658203-4239-1-git-send-email-andi@firstfloor.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-04-22 10:02:53 +02:00
Andi Kleen
b2508e828d perf: Support Xeon E7's via the Westmere PMU driver
There's a new model number public, 47, for Xeon E7 (aka Westmere EX).

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Cc: a.p.zijlstra@chello.nl
Link: http://lkml.kernel.org/r/1303429715-10202-1-git-send-email-andi@firstfloor.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-04-22 08:27:29 +02:00
David Rientjes
7a6c654782 x86, numa: Fix cpu nodemasks for NUMA emulation and CONFIG_DEBUG_PER_CPU_MAPS
The cpu<->node mappings under CONFIG_DEBUG_PER_CPU_MAPS=y
when NUMA emulation is enabled is currently broken because it does
not iterate through every emulated node and bind cpus that have
affinity to it.

NUMA emulation should bind each cpu to every local node to
accurately represent the true NUMA topology of the underlying
machine.

debug_cpumask_set_cpu() needs to be fixed at the same time so
that the debugging information that it emits shows the new
cpumask of the node being assigned when the cpu is being added
or removed.

It can now take responsibility of setting or clearing the cpu
itself to remove the need for duplicate code.

Also change its last parameter, "enable", to have the correct bool
type since it can only be true or false.

 -v2: Fix the return statements, by Kosaki Motohiro

Acked-and-Tested-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Andreas Herrmann <herrmann.der.user@googlemail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/alpine.DEB.2.00.1104201918470.12634@chino.kir.corp.google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-04-21 11:31:00 +02:00
David Rientjes
37f8527dbf Revert "x86, NUMA: Fix fakenuma boot failure"
Andreas Herrmann reported that 7d6b46707f ("x86, NUMA: Fix fakenuma
boot failure") causes certain physical NUMA topologies (for example
AMD Magny-Cours) to move sibling cpus to a single node when in reality
they are in separate domains.

This may result in some nodes being completely void of cpus, which
doesn't accurately represent the correct topology. The system will
boot, but will have suboptimal NUMA performance.

This commit was intended as a fix for NUMA emulation, but should
not cause a regression for real NUMA machines as a side effect.

( There will be a separate fix for the numa-debug code, which
  will not affect physical topologies. )

Reported-by: Andreas Herrmann <herrmann.der.user@googlemail.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/alpine.DEB.2.00.1104201918110.12634@chino.kir.corp.google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-04-21 11:30:59 +02:00
Linus Torvalds
8653b3f1d5 Merge branch 'stable/bug-fixes-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen
* 'stable/bug-fixes-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
  xen: mask_rw_pte: do not apply the early_ioremap checks on x86_32
  xen: do not create the extra e820 region at an addr lower than 4G
2011-04-20 17:40:25 -07:00
Stefano Stabellini
ee176455e2 xen: mask_rw_pte: do not apply the early_ioremap checks on x86_32
The two "is_early_ioremap_ptep" checks in mask_rw_pte are only used on
x86_64, in fact early_ioremap is not used at all to setup the initial
pagetable on x86_32.
Moreover on x86_32 the two checks are wrong because the range
pgt_buf_start..pgt_buf_end initially should be mapped RW because
the pages in the range are not pagetable pages yet and haven't been
cleared yet. Afterwards considering the pgt_buf_start..pgt_buf_end is
part of the initial mapping, xen_alloc_pte is capable of turning
the ptes RO when they become pagetable pages.

Fix the issue and improve the readability of the code providing two
different implementation of mask_rw_pte for x86_32 and x86_64.

Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2011-04-20 09:43:13 -04:00
Stefano Stabellini
24bdb0b62c xen: do not create the extra e820 region at an addr lower than 4G
Do not add the extra e820 region at a physical address lower than 4G
because it breaks e820_end_of_low_ram_pfn().

It is OK for us to move the xen_extra_mem_start up and down because this
is the index of the memory that can be ballooned in/out - it is memory
not available to the kernel during bootup.

Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2011-04-20 09:04:40 -04:00
Rafael J. Wysocki
19234c0819 PM: Add missing syscore_suspend() and syscore_resume() calls
Device suspend/resume infrastructure is used not only by the suspend
and hibernate code in kernel/power, but also by APM, Xen and the
kexec jump feature.  However, commit 40dc166cb5
(PM / Core: Introduce struct syscore_ops for core subsystems PM)
failed to add syscore_suspend() and syscore_resume() calls to that
code, which generally leads to breakage when the features in question
are used.

To fix this problem, add the missing syscore_suspend() and
syscore_resume() calls to arch/x86/kernel/apm_32.c, kernel/kexec.c
and drivers/xen/manage.c.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
2011-04-20 00:36:11 +02:00
Linus Torvalds
9d914b3ef3 Merge branch 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  x86, gart: Make sure GART does not map physmem above 1TB
  x86, gart: Set DISTLBWALKPRB bit always
  x86, gart: Convert spaces to tabs in enable_gart_translation
2011-04-19 10:58:13 -07:00
Linus Torvalds
96ad999918 Merge branch 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  perf, x86: Fix AMD family 15h FPU event constraints
  perf, x86: Fix pre-defined cache-misses event for AMD family 15h cpus
  perf evsel: Fix use of inherit
  perf hists browser: Fix seg fault when annotate null symbol
2011-04-19 10:56:02 -07:00
Robert Richter
c8e5910edf perf, x86: Use ALTERNATIVE() to check for X86_FEATURE_PERFCTR_CORE
Using ALTERNATIVE() when checking for X86_FEATURE_PERFCTR_CORE avoids
an extra pointer chase and data cache hit.

Signed-off-by: Robert Richter <robert.richter@amd.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1302913676-14352-4-git-send-email-robert.richter@amd.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-04-19 10:08:12 +02:00
Robert Richter
855357a217 perf, x86: Fix AMD family 15h FPU event constraints
Depending on the unit mask settings some FPU events may be scheduled
only on cpu counter #3. This patch fixes this.

Signed-off-by: Robert Richter <robert.richter@amd.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Stephane Eranian <eranian@googlemail.com>
Link: http://lkml.kernel.org/r/1302913676-14352-3-git-send-email-robert.richter@amd.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-04-19 10:07:55 +02:00
Andre Przywara
83112e688f perf, x86: Fix pre-defined cache-misses event for AMD family 15h cpus
With AMD cpu family 15h a unit mask was introduced for the Data Cache
Miss event (0x041/L1-dcache-load-misses). We need to enable bit 0
(first data cache miss or streaming store to a 64 B cache line) of
this mask to proper count data cache misses.

Now we set this bit for all families and models. In case a PMU does
not implement a unit mask for event 0x041 the bit is ignored.

Signed-off-by: Andre Przywara <andre.przywara@amd.com>
Signed-off-by: Robert Richter <robert.richter@amd.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1302913676-14352-2-git-send-email-robert.richter@amd.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-04-19 10:07:54 +02:00
Ingo Molnar
68d2cf25d3 Merge branch 'perf/urgent' into perf/core
Merge reason: we'll be queueing up dependent changes.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-04-19 07:56:17 +02:00
Joerg Roedel
665d3e2af8 x86, gart: Make sure GART does not map physmem above 1TB
The GART can only map physical memory below 1TB. Make sure
the gart driver in the kernel does not try to map memory
above 1TB.

Cc: <stable@kernel.org>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Link: http://lkml.kernel.org/r/1303134346-5805-5-git-send-email-joerg.roedel@amd.com
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2011-04-18 09:26:49 -07:00
Joerg Roedel
c34151a742 x86, gart: Set DISTLBWALKPRB bit always
The DISTLBWALKPRB bit must be set for the GART because the
gatt table is mapped UC. But the current code does not set
the bit at boot when the BIOS setup the aperture correctly.
Fix that by setting this bit when enabling the GART instead
of the other places.

Cc: <stable@kernel.org>
Cc: Borislav Petkov <borislav.petkov@amd.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Link: http://lkml.kernel.org/r/1303134346-5805-4-git-send-email-joerg.roedel@amd.com
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2011-04-18 09:26:48 -07:00
Joerg Roedel
af289bfe15 x86, gart: Convert spaces to tabs in enable_gart_translation
Probably by copy&paste this function was indented by spaces.
Convert this to tabs.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Link: http://lkml.kernel.org/r/1303134346-5805-3-git-send-email-joerg.roedel@amd.com
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2011-04-18 09:26:48 -07:00
Linus Torvalds
fdfc552abe Merge branches 'core-fixes-for-linus', 'perf-fixes-for-linus', 'sched-fixes-for-linus', 'timer-fixes-for-linus' and 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  futex: Set FLAGS_HAS_TIMEOUT during futex_wait restart setup

* 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  perf_event: Fix cgrp event scheduling bug in perf_enable_on_exec()
  perf: Fix a build error with some GCC versions

* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  sched: Fix erroneous all_pinned logic
  sched: Fix sched-domain avg_load calculation

* 'timer-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  RTC: rtc-mrst: follow on to the change of rtc_device_register()
  RTC: add missing "return 0" in new alarm func for rtc-bfin.c
  RTC: Fix s3c compile error due to missing s3c_rtc_setpie
  RTC: Fix early irqs caused by calling rtc_set_alarm too early

* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  x86, amd: Disable GartTlbWlkErr when BIOS forgets it
  x86, NUMA: Fix fakenuma boot failure
  x86/mrst: Fix boot crash caused by incorrect pin to irq mapping
  x86/ce4100: Add reg property to bridges
2011-04-16 09:45:08 -07:00
Joerg Roedel
5bbc097d89 x86, amd: Disable GartTlbWlkErr when BIOS forgets it
This patch disables GartTlbWlk errors on AMD Fam10h CPUs if
the BIOS forgets to do is (or is just too old). Letting
these errors enabled can cause a sync-flood on the CPU
causing a reboot.

The AMD BKDG recommends disabling GART TLB Wlk Error completely.

This patch is the fix for

	https://bugzilla.kernel.org/show_bug.cgi?id=33012

on my machine.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Link: http://lkml.kernel.org/r/20110415131152.GJ18463@8bytes.org
Tested-by: Alexandre Demers <alexandre.f.demers@gmail.com>
Cc: <stable@kernel.org>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2011-04-15 16:03:16 -07:00
KOSAKI Motohiro
7d6b46707f x86, NUMA: Fix fakenuma boot failure
Currently, numa=fake boot parameter is broken. If it's used,
kernel may panic due to devide by zero error depending on CPU
configuration

Call Trace:
 [<ffffffff8104ad4c>] find_busiest_group+0x38c/0xd30
 [<ffffffff81086aff>] ? local_clock+0x6f/0x80
 [<ffffffff81050533>] load_balance+0xa3/0x600
 [<ffffffff81050f53>] idle_balance+0xf3/0x180
 [<ffffffff81550092>] schedule+0x722/0x7d0
 [<ffffffff81550538>] ? wait_for_common+0x128/0x190
 [<ffffffff81550a65>] schedule_timeout+0x265/0x320
 [<ffffffff81095815>] ? lock_release_holdtime+0x35/0x1a0
 [<ffffffff81550538>] ? wait_for_common+0x128/0x190
 [<ffffffff8109bb6c>] ? __lock_release+0x9c/0x1d0
 [<ffffffff815534e0>] ? _raw_spin_unlock_irq+0x30/0x40
 [<ffffffff815534e0>] ? _raw_spin_unlock_irq+0x30/0x40
 [<ffffffff81550540>] wait_for_common+0x130/0x190
 [<ffffffff81051920>] ? try_to_wake_up+0x510/0x510
 [<ffffffff8155067d>] wait_for_completion+0x1d/0x20
 [<ffffffff8107f36c>] kthread_create_on_node+0xac/0x150
 [<ffffffff81077bb0>] ? process_scheduled_works+0x40/0x40
 [<ffffffff8155045f>] ? wait_for_common+0x4f/0x190
 [<ffffffff8107a283>] __alloc_workqueue_key+0x1a3/0x590
 [<ffffffff81e0cce2>] cpuset_init_smp+0x6b/0x7b
 [<ffffffff81df3d07>] kernel_init+0xc3/0x182
 [<ffffffff8155d5e4>] kernel_thread_helper+0x4/0x10
 [<ffffffff81553cd4>] ? retint_restore_args+0x13/0x13
 [<ffffffff81df3c44>] ? start_kernel+0x400/0x400
 [<ffffffff8155d5e0>] ? gs_change+0x13/0x13

The divede by zero is caused by the following line,
group->cpu_power==0:

 kernel/sched_fair.c::update_sg_lb_stats()
        /* Adjust by relative CPU power of the group */
        sgs->avg_load = (sgs->group_load * SCHED_LOAD_SCALE) / group->cpu_power;

This regression was caused by commit e23bba6044 ("x86-64, NUMA: Unify
emulated distance mapping") because it changes cpu -> node
mapping in the process of dropping fake_physnodes().

  old) all cpus are assinged node 0
  now) cpus are assigned round robin
       (the logic is implemented by numa_init_array())

  Note: The change in behavior only happens if the system doesn't
        have neither ACPI SRAT table nor AMD northbridge NUMA
	information.

Round robin assignment doesn't work because init_numa_sched_groups_power()
assumes all logical cpus in the same physical cpu share the same node
(then it only accounts for group_first_cpu()), and the simple round robin
breaks the above assumption.

Thus, this patch implements a reassignment of node-ids if buggy firmware
or numa emulation makes wrong cpu node map. Tt enforce all logical cpus
in the same physical cpu share the same node.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Shaohui Zheng <shaohui.zheng@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: H. Peter Anvin <hpa@linux.intel.com>
Link: http://lkml.kernel.org/r/20110415203928.1303.A69D9226@jp.fujitsu.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-04-15 20:28:19 +02:00
Linus Torvalds
aaa119a3d4 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6:
  fix XEN_SAVE_RESTORE Kconfig dependencies
  PM / Hibernate: Introduce CONFIG_HIBERNATE_CALLBACKS
2011-04-12 17:18:05 -07:00
Jacob Pan
9d90e49da5 x86/mrst: Fix boot crash caused by incorrect pin to irq mapping
Moorestown systems crash on boot because the secondary CPU
clockevent (apbt1) will fail to request irq#1, which does not
have ioapic chip in its irq_desc[] entry.

Background:

Moorestown platform does not have ISA bus nor legacy IRQs. It
reuses the range of legacy IRQs for regular device interrupts.
The routing information of early system device IRQs (timers) are
obtained from firmware provided SFI tables. We reuse/fake MP
configuration table to facilitate IRQ setup with IOAPIC.

Maintaining a 1:1 mapping of IOAPIC pin (RTE entry) and IRQ#
makes routing information clean and easy to understand on
Moorestown. Though optional.

This patch allows SFI timer and vRTC IRQ to be treated as ISA
IRQ so that pin2irq mapping will be 1:1.

Also fixed MP table type and use macros to clearly set MP IRQ
entries. As a result, apbt timer and RTC interrupts on
Moorestown are within legacy IRQ range:

 # cat /proc/interrupts
            CPU0       CPU1
   0:      11249          0   IO-APIC-edge      apbt0
   1:          0      12271   IO-APIC-edge      apbt1
   8:        887          0   IO-APIC-fasteoi   dw_spi
  13:          0          0   IO-APIC-fasteoi   INTEL_MID_DMAC2
  14:          0          0   IO-APIC-fasteoi   rtc0

Further discussion of this patch can be found at:

  https://lkml.org/lkml/2010/6/10/70

Suggested-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Feng Tang <feng.tang@intel.com>
Cc: Alan Cox <alan@linux.intel.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Link: http://lkml.kernel.org/r/1302286980-21139-1-git-send-email-jacob.jun.pan@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-04-12 08:38:52 +02:00
Linus Torvalds
16ad56972c Merge branch 'stable/bug-fixes-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen
* 'stable/bug-fixes-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
  xen: Allow PV-OPS kernel to detect whether XSAVE is supported
  xen: just completely disable XSAVE
  xen/debug: Don't be so verbose with WARN on 1-1 mapping errors.
  xen: events: fix error checks in bind_*_to_irqhandler()
2011-04-11 20:01:04 -07:00
Shriram Rajagopalan
d419e4c0f7 fix XEN_SAVE_RESTORE Kconfig dependencies
Make XEN_SAVE_RESTORE select HIBERNATE_CALLBACKS.
Remove XEN_SAVE_RESTORE dependency from PM_SLEEP.

Signed-off-by: Shriram Rajagopalan <rshriram@cs.ubc.ca>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2011-04-11 22:54:48 +02:00
Sebastian Andrzej Siewior
30d746c680 x86/ce4100: Add reg property to bridges
without the reg property Ben's new code won't find the PCI & ISA
bridge and the devices won't get the DT-node attached.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: davem@davemloft.net
Cc: monstr@monstr.eu
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Link: http://lkml.kernel.org/r/20110407121315.GA9204@linutronix.de
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-04-11 17:37:02 +02:00
Linus Torvalds
8b9686ff4d Merge branches 'x86-fixes-for-linus', 'sched-fixes-for-linus', 'timers-fixes-for-linus', 'irq-fixes-for-linus' and 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  x86-32, fpu: Fix FPU exception handling on non-SSE systems
  x86, hibernate: Initialize mmu_cr4_features during boot
  x86-32, NUMA: Fix ACPI NUMA init broken by recent x86-64 change
  x86: visws: Fixup irq overhaul fallout

* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  sched: Clean up rebalance_domains() load-balance interval calculation

* 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  x86/mrst/vrtc: Fix boot crash in mrst_rtc_init()
  rtc, x86/mrst/vrtc: Fix boot crash in rtc_read_alarm()

* 'irq-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  genirq: Fix cpumask leak in __setup_irq()

* 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  perf probe: Fix listing incorrect line number with inline function
  perf probe: Fix to find recursively inlined function
  perf probe: Fix multiple --vars options behavior
  perf probe: Fix to remove redundant close
  perf probe: Fix to ensure function declared file
2011-04-07 12:12:58 -07:00
Feng Tang
09552b2696 x86/mrst/vrtc: Fix boot crash in mrst_rtc_init()
The sfi_mrtc_array[] only gets initialized when the sfi mrtc
table is parsed, so the vrtc_paddr should be initalized after it
too.

Signed-off-by: Feng Tang <feng.tang@intel.com>
Signed-off-by: Alan Cox <alan@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1302140389-27603-1-git-send-email-feng.tang@intel.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-04-07 11:27:42 +02:00
Hans Rosenfeld
f994d99cf1 x86-32, fpu: Fix FPU exception handling on non-SSE systems
On 32bit systems without SSE (that is, they use FSAVE/FRSTOR for FPU
context switches), FPU exceptions in user mode cause Oopses, BUGs,
recursive faults and other nasty things:

fpu exception: 0000 [#1]
last sysfs file: /sys/power/state
Modules linked in: psmouse evdev pcspkr serio_raw [last unloaded: scsi_wait_scan]

Pid: 1638, comm: fxsave-32-excep Not tainted 2.6.35-07798-g58a992b-dirty #633 VP3-596B-DD/VT82C597
EIP: 0060:[<c1003527>] EFLAGS: 00010202 CPU: 0
EIP is at math_error+0x1b4/0x1c8
EAX: 00000003 EBX: cf9be7e0 ECX: 00000000 EDX: cf9c5c00
ESI: cf9d9fb4 EDI: c1372db3 EBP: 00000010 ESP: cf9d9f1c
DS: 007b ES: 007b FS: 0000 GS: 00e0 SS: 0068
Process fxsave-32-excep (pid: 1638, ti=cf9d8000 task=cf9be7e0 task.ti=cf9d8000)
Stack:
00000000 00000301 00000004 00000000 00000000 cf9d3000 cf9da8f0 00000001
<0> 00000004 cf9b6b60 c1019a6b c1019a79 00000020 00000242 000001b6 cf9c5380
<0> cf806b40 cf791880 00000000 00000282 00000282 c108a213 00000020 cf9c5380
Call Trace:
[<c1019a6b>] ? need_resched+0x11/0x1a
[<c1019a79>] ? should_resched+0x5/0x1f
[<c108a213>] ? do_sys_open+0xbd/0xc7
[<c108a213>] ? do_sys_open+0xbd/0xc7
[<c100353b>] ? do_coprocessor_error+0x0/0x11
[<c12d5965>] ? error_code+0x65/0x70
Code: a8 20 74 30 c7 44 24 0c 06 00 03 00 8d 54 24 04 89 d9 b8 08 00 00 00 e8 9b 6d 02 00 eb 16 8b 93 5c 02 00 00 eb 05 e9 04 ff ff ff <9b> dd 32 9b e9 16 ff ff ff 81 c4 84 00 00 00 5b 5e 5f 5d c3 c6
EIP: [<c1003527>] math_error+0x1b4/0x1c8 SS:ESP 0068:cf9d9f1c

This usually continues in slight variations until the system is reset.

This bug was introduced by commit 58a992b9cb:
	x86-32, fpu: Rewrite fpu_save_init()

Signed-off-by: Hans Rosenfeld <hans.rosenfeld@amd.com>
Link: http://lkml.kernel.org/r/1302106003-366952-1-git-send-email-hans.rosenfeld@amd.com
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2011-04-06 16:53:01 -07:00
H. Peter Anvin
4da9484bde x86, hibernate: Initialize mmu_cr4_features during boot
Restore the initialization of mmu_cr4_features during boot, which was
removed without comment in checkin e5f15b45dd

x86: Cleanup highmap after brk is concluded

thereby breaking resume from hibernate.  This restores previous
functionality in approximately the same place, and corrects the
reading of %cr4 on pre-CPUID hardware (%cr4 exists if and only if
CPUID is supported.)

However, part of the problem is that the hibernate suspend/resume
sequence should manage the save/restore of %cr4 explicitly.

Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Cc: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <201104020154.57136.rjw@sisk.pl>
2011-04-06 13:10:02 -07:00
Shan Haitao
947ccf9c3c xen: Allow PV-OPS kernel to detect whether XSAVE is supported
Xen fails to mask XSAVE from the cpuid feature, despite not historically
supporting guest use of XSAVE.  However, now that XSAVE support has been
added to Xen, we need to reliably detect its presence.

The most reliable way to do this is to look at the OSXSAVE feature in
cpuid which is set iff the OS (Xen, in this case), has set
CR4.OSXSAVE.

[ Cleaned up conditional a bit. - Jeremy ]

Signed-off-by: Shan Haitao <haitao.shan@intel.com>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2011-04-06 08:31:13 -04:00
Jeremy Fitzhardinge
61f4237d5b xen: just completely disable XSAVE
Some (old) versions of Xen just kill the domain if it tries to set any
unknown bits in CR4, so we can't reliably probe for OSXSAVE in
CR4.

Since Xen doesn't support XSAVE for guests at the moment, and no such
support is being worked on, there's no downside in just unconditionally
masking XSAVE support.

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2011-04-06 08:31:00 -04:00
Andre Przywara
bd22f5cfcf KVM: move and fix substitue search for missing CPUID entries
If KVM cannot find an exact match for a requested CPUID leaf, the
code will try to find the closest match instead of simply confessing
it's failure.
The implementation was meant to satisfy the CPUID specification, but
did not properly check for extended and standard leaves and also
didn't account for the index subleaf.
Beside that this rule only applies to CPUID intercepts, which is not
the only user of the kvm_find_cpuid_entry() function.

So fix this algorithm and call it from kvm_emulate_cpuid().
This fixes a crash of newer Linux kernels as KVM guests on
AMD Bulldozer CPUs, where bogus values were returned in response to
a CPUID intercept.

Signed-off-by: Andre Przywara <andre.przywara@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-04-06 13:15:56 +03:00
Andre Przywara
20800bc940 KVM: fix XSAVE bit scanning
When KVM scans the 0xD CPUID leaf for propagating the XSAVE save area
leaves, it assumes that the leaves are contigious and stops at the
first zero one. On AMD hardware there is a gap, though, as LWP uses
leaf 62 to announce it's state save area.
So lets iterate through all 64 possible leaves and simply skip zero
ones to also cover later features.

Signed-off-by: Andre Przywara <andre.przywara@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-04-06 13:15:55 +03:00
Konrad Rzeszutek Wilk
d88885d092 xen/debug: Don't be so verbose with WARN on 1-1 mapping errors.
There are valid situations in which this error is not
a warning. Mainly when QEMU maps a guest memory and uses
the VM_IO flag to set the MFNs. For right now make the
WARN be WARN_ONCE. In the future we will:

 1). Remove the VM_IO code handling..
 2). .. which will also remove this debug facility.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2011-04-04 14:48:20 -04:00
Linus Torvalds
d7c764c4c7 Merge branch 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  x86, UV: Fix kdump reboot
  x86, amd-nb: Rename CPU PCI id define for F4
  sound: Add delay.h to sound/soc/codecs/sn95031.c
  x86, mtrr, pat: Fix one cpu getting out of sync during resume
  x86, microcode: Unregister syscore_ops after microcode unloaded
  x86: Stop including <linux/delay.h> in two asm header files
2011-04-04 08:37:45 -07:00