Commit Graph

1458 Commits

Author SHA1 Message Date
Avi Kivity
e48bb497b9 KVM: MMU: Fix memory leak on guest demand faults
While backporting 72dc67a696, a gfn_to_page()
call was duplicated instead of moved (due to an unrelated patch not being
present in mainline).  This caused a page reference leak, resulting in a
fairly massive memory leak.

Fix by removing the extraneous gfn_to_page() call.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-03-25 10:22:17 +02:00
Marcelo Tosatti
707a18a51d KVM: VMX: convert init_rmode_tss() to slots_lock
init_rmode_tss was forgotten during the conversion from mmap_sem to
slots_lock.

INFO: task qemu-system-x86:3748 blocked for more than 120 seconds.
Call Trace:
 [<ffffffff8053d100>] __down_read+0x86/0x9e
 [<ffffffff8053fb43>] do_page_fault+0x346/0x78e
 [<ffffffff8053d235>] trace_hardirqs_on_thunk+0x35/0x3a
 [<ffffffff8053dcad>] error_exit+0x0/0xa9
 [<ffffffff8035a7a7>] copy_user_generic_string+0x17/0x40
 [<ffffffff88099a8a>] :kvm:kvm_write_guest_page+0x3e/0x5f
 [<ffffffff880b661a>] :kvm_intel:init_rmode_tss+0xa7/0xf9
 [<ffffffff880b7d7e>] :kvm_intel:vmx_vcpu_reset+0x10/0x38a
 [<ffffffff8809b9a5>] :kvm:kvm_arch_vcpu_setup+0x20/0x53
 [<ffffffff8809a1e4>] :kvm:kvm_vm_ioctl+0xad/0x1cf
 [<ffffffff80249dea>] __lock_acquire+0x4f7/0xc28
 [<ffffffff8028fad9>] vfs_ioctl+0x21/0x6b
 [<ffffffff8028fd75>] do_vfs_ioctl+0x252/0x26b
 [<ffffffff8028fdca>] sys_ioctl+0x3c/0x5e
 [<ffffffff8020b01b>] system_call_after_swapgs+0x7b/0x80

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-03-25 10:22:17 +02:00
Marcelo Tosatti
15aaa819e2 KVM: MMU: handle page removal with shadow mapping
Do not assume that a shadow mapping will always point to the same host
frame number.  Fixes crash with madvise(MADV_DONTNEED).

[avi: move after first printk(), add another printk()]

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-03-25 10:22:17 +02:00
Avi Kivity
4b1a80fa65 KVM: MMU: Fix is_rmap_pte() with io ptes
is_rmap_pte() doesn't take into account io ptes, which have the avail bit set.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-03-25 10:22:16 +02:00
Avi Kivity
5dc8326282 KVM: VMX: Restore tss even on x86_64
The vmx hardware state restore restores the tss selector and base address, but
not its length.  Usually, this does not matter since most of the tss contents
is within the default length of 0x67.  However, if a process is using ioperm()
to grant itself I/O port permissions, an additional bitmap within the tss,
but outside the default length is consulted.  The effect is that the process
will receive a SIGSEGV instead of transparently accessing the port.

Fix by restoring the tss length.  Note that i386 had this working already.

Closes bugzilla 10246.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-03-25 10:22:16 +02:00
Linus Torvalds
b9e76a0074 x86-32: Pass the full resource data to ioremap()
It appears that 64-bit PCI resources cannot possibly ever have worked on
x86-32 even when the RESOURCES_64BIT config option was set, because any
driver that tried to [pci_]ioremap() the resource would have been unable
to do so because the high 32 bits would have been silently dropped on
the floor by the ioremap() routines that only used "unsigned long".

Change them to use "resource_size_t" instead, which properly encodes the
whole 64-bit resource data if RESOURCES_64BIT is enabled.

Acked-by: H. Peter Anvin <hpa@kernel.org>
Acked-by: Stefan Richter <stefanr@s5r6.in-berlin.de>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-24 11:22:39 -07:00
Thomas Gleixner
9e9630481e x86: revert: reserve dma32 early for gart
Revert

commit f62f1fc9ef
Author: Yinghai Lu <yhlu.kernel@gmail.com>
Date:   Fri Mar 7 15:02:50 2008 -0800

    x86: reserve dma32 early for gart

The patch has a dependency on bootmem modifications which are not .25
material that late in the -rc cycle. The problem which is addressed by
the patch is limited to machines with 256G and more memory booted with
NUMA disabled. This is not a .25 regression and the audience which is
affected by this problem is very limited, so it's safer to do the
revert than pulling in intrusive bootmem changes right now.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-03-22 19:25:41 +01:00
Yinghai Lu
37bff62e98 x86_64: free_bootmem should take phys
so use nodedata_phys directly.

Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-03-21 17:06:15 +01:00
Yinghai Lu
5dca6a1bb0 x86: trim mtrr don't close gap for resource allocation.
fix the bug reported here:

	http://bugzilla.kernel.org/show_bug.cgi?id=10232

use update_memory_range() instead of add_memory_range() directly
to avoid closing the gap.

( the new code only affects and runs on systems where the MTRR
  workaround triggers. )

Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-03-21 17:06:15 +01:00
Heinz-Ado Arnolds
fc1c8925c8 x86: fix reboot problem with Dell Optiplex 745, 0KW626 board
we have seen a little problem in rebooting Dell Optiplex 745 with the
0KW626 board. Here is a small patch enabling reboot with this board,
which forces the default reboot path it into the BIOS reboot mode.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-03-21 17:06:15 +01:00
Jiri Slaby
e215f3c2c5 x86: fix fault_msg nul termination
The fault_msg text is not explictly nul terminated now in startup
assembly. Do so by converting .ascii to .asciz.

Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-03-21 17:06:15 +01:00
Pavel Machek
2050d45d7c x86: fix long standing bug with usb after hibernation with 4GB ram
aperture_64.c takes a piece of memory and makes it into iommu
window... but such window may not be saved by swsusp -- that leads to
oops during hibernation.

Signed-off-by: Pavel Machek <pavel@suse.cz>
Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-03-21 17:06:15 +01:00
Zbigniew Luszpinski
96bcf458cb x86: hpet clock enable quirk on nVidia nForce 430
this patch allows hpet=force on nVidia nForce 430 southbridge.
This patch was tested by me on my old Asus A8N-VM CSM (where bios does not
support hpet and does not advertise it via acpi entry). My nForce430 version:
lspci -nn | grep LPC
00:0a.0 ISA bridge [0601]: nVidia Corporation MCP51 LPC Bridge [10de:0260]
(rev a2)

Kernel 2.6.24.3 after patching and using hpet=force reports this:
dmesg | grep -i hpet
Kernel command line: root=/dev/sda8 ro vga=773 video=vesafb:mtrr:4,ywrap
vt.default_utf8=0 hpet=force
Force enabled HPET at base address 0xfed00000
hpet clockevent registered
Time: hpet clocksource has been installed.

grep -i hpet /proc/timer_list
Clock Event Device: hpet
 set_next_event: hpet_legacy_next_event
 set_mode:       hpet_legacy_set_mode

grep Clock /proc/timer_list (before patching)
Clock Event Device: pit
Clock Event Device: lapic

grep Clock /proc/timer_list (after patching)
Clock Event Device: hpet
Clock Event Device: lapic

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-03-21 17:06:15 +01:00
Yinghai Lu
f62f1fc9ef x86: reserve dma32 early for gart
a system with 256 GB of RAM, when NUMA is disabled crashes the
following way:

Your BIOS doesn't leave a aperture memory hole
Please enable the IOMMU option in the BIOS setup
This costs you 64 MB of RAM
Cannot allocate aperture memory hole (ffff8101c0000000,65536K)
Kernel panic - not syncing: Not enough memory for aperture
Pid: 0, comm: swapper Not tainted 2.6.25-rc4-x86-latest.git #33

Call Trace:
 [<ffffffff84037c62>] panic+0xb2/0x190
 [<ffffffff840381fc>] ? release_console_sem+0x7c/0x250
 [<ffffffff847b1628>] ? __alloc_bootmem_nopanic+0x48/0x90
 [<ffffffff847b0ac9>] ? free_bootmem+0x29/0x50
 [<ffffffff847ac1f7>] gart_iommu_hole_init+0x5e7/0x680
 [<ffffffff847b255b>] ? alloc_large_system_hash+0x16b/0x310
 [<ffffffff84506a2f>] ? _etext+0x0/0x1
 [<ffffffff847a2e8c>] pci_iommu_alloc+0x1c/0x40
 [<ffffffff847ac795>] mem_init+0x45/0x1a0
 [<ffffffff8479ff35>] start_kernel+0x295/0x380
 [<ffffffff8479f1c2>] _sinittext+0x1c2/0x230

the root cause is : memmap PMD is too big,
[ffffe200e0600000-ffffe200e07fffff] PMD ->ffff81383c000000 on node 0
almost near 4G..., and vmemmap_alloc_block will use up the ram under 4G.

solution will be:
1. make memmap allocation get memory above 4G...
2. reserve some dma32 range early before we try to set up memmap for all.
and release that before pci_iommu_alloc, so gart or swiotlb could get some
range under 4g limit for sure.

the patch is using method 2.
because method1 may need more code to handle SPARSEMEM and SPASEMEM_VMEMMAP

will get
Your BIOS doesn't leave a aperture memory hole
Please enable the IOMMU option in the BIOS setup
This costs you 64 MB of RAM
Mapping aperture over 65536 KB of RAM @ 4000000
Memory: 264245736k/268959744k available (8484k kernel code, 4187464k reserved, 4004k data, 724k init)

Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-03-21 17:06:15 +01:00
Coleman Kane
fc115bf19b x86: add the DFF (Desktop Form Factor) Dell Optiplex 745 to the reboot errata list
We recently got some of the "Desktop Form Factor" Optiplex 745's in.  I
noticed that there's an entry for the SFF one's, but the BIOS model number
of the DFF differs from that of the SFF.  We have been reliably
experiencing the same (as far as I can tell) reboot bug as the SFF boxes.

Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-03-21 17:06:15 +01:00
Randy Dunlap
3178071560 x86/visws: fix printk format warnings
Fix visws printk format warnings:

/local/linsrc/linux-2.6.24-git15/arch/x86/mach-visws/traps.c:50: warning: format '%#lx' expects type 'long unsigned int', but argument 2 has type 'u32'
/local/linsrc/linux-2.6.24-git15/arch/x86/mach-visws/traps.c:50: warning: format '%#lx' expects type 'long unsigned int', but argument 3 has type 'u32'

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-03-21 17:06:15 +01:00
Yinghai Lu
7d2de13762 x86: tight online check in setup_per_cpu_areas
when numa disabled I got this compile warning:

arch/x86/kernel/setup64.c: In function setup_per_cpu_areas:
arch/x86/kernel/setup64.c:147: warning: the address of
                      contig_page_data will always evaluate as true

it seems we missed checking if the node is online before we try to refer
NODE_DATA. Fix it.

Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-03-21 17:06:15 +01:00
Yinghai Lu
6721fc0a0d x86: fix dma_alloc_pages
memory-less node support:

this patch uses updated dev_to_node, because dev_to_node already makes sure
it returns an online node.

Signed-off-by: Yinghai Lu <yinghai.lu@sun.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-03-21 17:06:14 +01:00
Randy Dunlap
53471121a8 documentation: Move power-related files to Documentation/power/
Move 00-INDEX entries to power/00-INDEX (and add entry for
pm_qos_interface.txt).

Update references to moved filenames.

Fix some trailing whitespace.

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Len Brown <len.brown@intel.com>
2008-03-12 18:10:51 -04:00
Thomas Gleixner
985a34bd75 x86: remove quicklists
quicklists cause a serious memory leak on 32-bit x86,
as documented at:

  http://bugzilla.kernel.org/show_bug.cgi?id=9991

the reason is that the quicklist pool is a special-purpose
cache that grows out of proportion. It is not accounted for
anywhere and users have no way to even realize that it's
the quicklists that are causing RAM usage spikes. It was
supposed to be a relatively small pool, but as demonstrated
by KOSAKI Motohiro, they can grow as large as:

  Quicklists:    1194304 kB

given how much trouble this code has caused historically,
and given that Andrew objected to its introduction on x86
(years ago), the best option at this point is to remove them.

[ any performance benefits of caching constructed pgds should
  be implemented in a more generic way (possibly within the page
  allocator), while still allowing constructed pages to be
  allocated by other workloads. ]

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-03-11 17:11:55 +01:00
Roland McGrath
40f0933d51 x86: ia32 syscall restart fix
The code to restart syscalls after signals depends on checking for a
negative orig_ax, and for particular negative -ERESTART* values in ax.
These fields are 64 bits and for a 32-bit task they get zero-extended.
The syscall restart behavior is lost, a regression from a native 32-bit
kernel and from 64-bit tasks' behavior.

This patch fixes the problem by doing sign-extension where it matters.

For orig_ax, the only time the value should be -1 but winds up as
0x0ffffffff is via a 32-bit ptrace call. So the patch changes ptrace to
sign-extend the 32-bit orig_eax value when it's stored; it doesn't
change the checks on orig_ax, though it uses the new current_syscall()
inline to better document the subtle importance of the used of
signedness there.

The ax value is stored a lot of ways and it seems hard to get them all
sign-extended at their origins. So for that, we use the
current_syscall_ret() to sign-extend it only for 32-bit tasks at the
time of the -ERESTART* comparisons.

Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-03-11 17:11:54 +01:00
Ingo Molnar
9a46d7e5b6 x86: ioremap, remove WARN_ON()
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-03-11 17:11:54 +01:00
Ingo Molnar
f5dbb55b99 fix BIOS PCI config cycle buglet causing ACPI boot regression
I figured out another ACPI related regression today.

randconfig testing triggered an early boot-time hang on a laptop of mine
(32-bit x86, config attached) - the screen was scrolling ACPI AML
exceptions [with no serial port and no early debugging available].

v2.6.24 works fine on that laptop with the same .config, so after a few
hours of bisection (had to restart it 3 times - other regressions
interacted), it honed in on this commit:

| 10270d4838 is first bad commit
|
| Author: Linus Torvalds <torvalds@woody.linux-foundation.org>
| Date:   Wed Feb 13 09:56:14 2008 -0800
|
|     acpi: fix acpi_os_read_pci_configuration() misuse of raw_pci_read()

reverting this commit ontop of -rc5 gave a correctly booting kernel.

But this commit fixes a real bug so the real question is, why did it
break the bootup?

After quite some head-scratching, the following change stood out:

-                               pci_id->bus = tu8;
+                               pci_id->bus = val;

pci_id->bus is defined as u16:

   struct acpi_pci_id {
           u16 segment;
           u16 bus;
   ...

and 'tu8' changed from u8 to u32. So previously we'd unconditionally
mask the return value of acpi_os_read_pci_configuration()
(raw_pci_read()) to 8 bits, but now we just trust whatever comes back
from the PCI access routines and only crop it to 16 bits.

But if the high 8 bits of that result contains any noise then we'll
write that into ACPI's PCI ID descriptor and confuse the heck out of the
rest of ACPI.

So lets check the PCI-BIOS code on that theory. We have this codepath
for 8-bit accesses (arch/x86/pci/pcbios.c:pci_bios_read()):

        switch (len) {
        case 1:
                __asm__("lcall *(%%esi); cld\n\t"
                        "jc 1f\n\t"
                        "xor %%ah, %%ah\n"
                        "1:"
                        : "=c" (*value),
                          "=a" (result)
                        : "1" (PCIBIOS_READ_CONFIG_BYTE),
                          "b" (bx),
                          "D" ((long)reg),
                          "S" (&pci_indirect));

Aha! The "=a" output constraint puts the full 32 bits of EAX into
*value. But if the BIOS's routines set any of the high bits to nonzero,
we'll return a value with more set in it than intended.

The other, more common PCI access methods (v1 and v2 PCI reads) clear
out the high bits already, for example pci_conf1_read() does:

        switch (len) {
        case 1:
                *value = inb(0xCFC + (reg & 3));

which explicitly converts the return byte up to 32 bits and zero-extends
it.

So zero-extending the result in the PCI-BIOS read routine fixes the
regression on my laptop. ( It might fix some other long-standing issues
we had with PCI-BIOS during the past decade ... ) Both 8-bit and 16-bit
accesses were buggy.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-10 18:09:05 -07:00
Rusty Russell
4357bd9453 lguest: Revert 1ce70c4fac, fix real problem.
Ahmed managed to crash the Host in release_pgd(), which cannot be a Guest
bug, and indeed it wasn't.

The bug was that handing a 0 as the address of the toplevel page table
being manipulated can cause the lookup code in find_pgdir() to return
an uninitialized cache entry (we shadow up to 4 top level page tables
for each Guest).

Commit 37cc8d7f96 introduced this
behaviour in the Guest, uncovering the bug.

The patch which he submitted (which removed the /4 from the index
calculation) simply ensured that these high-indexed entries hit the
early exit path of guest_set_pmd().  But you get lots of segfaults in
guest userspace as the PMDs aren't being updated.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2008-03-11 09:35:58 +11:00
Rusty Russell
3fabc55f34 lguest: Sanitize the lguest clock.
Now the TSC code handles a zero return from calculate_cpu_khz(),
lguest can simply pass through the value it gets from the Host: if
non-zero, all the normal TSC code applies.

Otherwise (or if the Host really doesn't support TSC), the clocksource
code will fall back to the slower but reasonable lguest clock.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2008-03-11 09:35:57 +11:00
Roland McGrath
84c6f6046c x86_64: make ptrace always sign-extend orig_ax to 64 bits
This makes 64-bit ptrace calls setting the 64-bit orig_ax field for a
32-bit task sign-extend the low 32 bits up to 64.  This matches what a
64-bit debugger expects when tracing a 32-bit task.

This follows on my "x86_64 ia32 syscall restart fix".  This didn't
matter until that was fixed.

The debugger ignores or zeros the high half of every register slot it
sets (including the orig_rax pseudo-register) uniformly.  It expects
that the setting of the low 32 bits always has the same meaning as a
32-bit debugger setting those same 32 bits with native 32-bit
facilities.

This never arose before because the syscall restart check never
matched any -ERESTART* values due to lack of sign extension.  Before
that fix, even 32-bit ptrace setting orig_eax to -1 failed to trigger
the restart check anyway.  So this was never noticed as a regression
of 64-bit debuggers vs 32-bit debuggers on the same 64-bit kernel.

Signed-off-by: Roland McGrath <roland@redhat.com>
[ Changed to just do the sign-extension unconditionally on x86-64,
  since orig_ax is always just a small integer and doesn't need
  the full 64-bit range ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-07 19:05:58 -08:00
Peter Korsgaard
1722770f13 x86-boot: don't request VBE2 information
The new x86 setup code (4fd06960f1) broke booting on an old P3/500MHz
with an onboard Voodoo3 of mine. After debugging it, it turned out
to be caused by the fact that the vesa probing now asks for VBE2 data.

Disassembing the video BIOS shows that it overflows the vesa_general_info
structure when VBE2 data is requested because the source addresses for the
information strings which get strcpy'ed to the buffer lie outside the 32K
BIOS code (and hence contain long sequences of 0xff's).

E.G.:

get_vbe_controller_info:
00002A9C  60                pushaw
00002A9D  1E                push ds
00002A9E  0E                push cs
00002A9F  1F                pop ds
00002AA0  2BC9              sub cx,cx
00002AA2  6626813D56424532  cmp dword [es:di],0x32454256 ; "VBE2"
00002AAA  7501              jnz .1
00002AAC  41                inc cx
.1:
00002AAD  51                push cx
00002AAE  B91400            mov cx,0x14
00002AB1  BED47F            mov si, controller_header
00002AB4  57                push di
00002AB5  F3A4              rep movsb ; copy vbe1.2 header

00002AB7  B9EC00            mov cx,0xec
00002ABA  2AC0              sub al,al
00002ABC  F3AA              rep stosb ; zero pad remainder

00002ABE  5F                pop di
00002ABF  E8EB0D            call word get_memory
00002AC2  C1E002            shl ax,0x2
00002AC5  26894512          mov [es:di+0x12],ax ; total memory
00002AC9  26C745040003      mov word [es:di+0x4],0x300 ; VBE version
00002ACF  268C4D08          mov [es:di+0x8],cs
00002AD3  268C4D10          mov [es:di+0x10],cs
00002AD7  59                pop cx
00002AD8  E361              jcxz .done ; VBE2 requested?
00002ADA  8D9D0001          lea bx,[di+0x100]
00002ADE  53                push bx
00002ADF  87DF              xchg bx,di ; di now points to 2nd half
00002AE1  26C747140001      mov word [es:bx+0x14],0x100 ; sw rev

00002AE7  26897F06          mov [es:bx+0x6],di		; oem string
00002AEB  268C4708          mov [es:bx+0x8],es
00002AEF  BE5280            mov si,0x8052 ; oem string
00002AF2  E87A1B            call word strcpy

00002AF5  26897F0E          mov [es:bx+0xe],di ; video mode list
00002AF9  268C4710          mov [es:bx+0x10],es
00002AFD  B91E00            mov cx,0x1e
00002B00  BEE87F            mov si,vidmodes
00002B03  F3A5              rep movsw

00002B05  26897F16          mov [es:bx+0x16],di ; oem vendor
00002B09  268C4718          mov [es:bx+0x18],es
00002B0D  BE2480            mov si,0x8024 ; oem vendor
00002B10  E85C1B            call word strcpy

00002B13  26897F1A          mov [es:bx+0x1a],di ; oem product
00002B17  268C471C          mov [es:bx+0x1c],es
00002B1B  BE3880            mov si,0x8038 ; oem product
00002B1E  E84E1B            call word strcpy

00002B21  26897F1E          mov [es:bx+0x1e],di ; oem product rev
00002B25  268C4720          mov [es:bx+0x20],es
00002B29  BE4580            mov si,0x8045 ; oem product rev
00002B2C  E8401B            call word strcpy

00002B2F  58                pop ax
00002B30  B90001            mov cx,0x100
00002B33  2BCF              sub cx,di
00002B35  03C8              add cx,ax
00002B37  2AC0              sub al,al
00002B39  F3AA              rep stosb ; zero pad
.done:
00002B3B  1F                pop ds
00002B3C  61                popaw
00002B3D  B84F00            mov ax,0x4f
00002B40  C3                ret

(The full BIOS can be found at http://peter.korsgaard.com/vgabios.bin
if interested).

The old setup code didn't ask for VBE2 info, and the new code doesn't
actually do anything with the extra information, so the fix is to simply
not request it. Other BIOS'es might have the same problem.

Signed-off-by: Peter Korsgaard <jacmet@sunsite.dk>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-03-07 16:39:14 +01:00
Ingo Molnar
7432d149fd x86: re-add reboot fixups
Jan Beulich noticed that the reboot fixups went missing during
reboot.c unification.

(commit 4d022e35fd)

Geode and a few other rare boards with special reboot quirks are
affected.

Reported-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-03-07 16:39:14 +01:00
Jan Beulich
d032b31a3a x86: fix typo in step.c
TIF_DEBUGCTLMSR has no meaning in the actual MSR...

Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-03-07 16:39:14 +01:00
Jan Beulich
609b5297bc x86: fix merge mistake in i387.c
convert_fxsr_to_user() in 2.6.24's i387_32.c did this, and
convert_to_fxsr() also does the inverse, so I assume it's an oversight
that it is no longer being done.

[ mingo@elte.hu:

  we encode it this way because there's no space for the 'FPU Last
  Instruction Opcode' (->fop) field in the legacy user_i387_ia32_struct
  that PTRACE_GETFPREGS/PTRACE_SETFPREGS uses.

  it's probably pure legacy - i'd be surprised if any user-space relied on
  the FPU Last Opcode in any way. But indeed we used to do it previously
  so the most conservative thing is to preserve that piece of information.
]

Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-03-07 16:39:14 +01:00
Aurelien Jarno
e40cd10ccf x86: clear DF before calling signal handler
The Linux kernel currently does not clear the direction flag before
calling a signal handler, whereas the x86/x86-64 ABI requires that.

Linux had this behavior/bug forever, but this becomes a real problem
with gcc version 4.3, which assumes that the direction flag is
correctly cleared at the entry of a function.

This patches changes the setup_frame() functions to clear the
direction before entering the signal handler.

Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Acked-by: H. Peter Anvin <hpa@zytor.com>
2008-03-07 16:39:14 +01:00
Dave Jones
0e5aa8d621 [CPUFREQ] Remove debugging message from e_powersaver
We don't need to printk a message every time we transition.
Leave the code there, but ifdef'd out, as it's useful when
adding support for new processors.

Reported-by: Petr Titěra <P.Titera@century.cz>
Signed-off-by: Dave Jones <davej@redhat.com>
2008-03-05 14:45:31 -05:00
Ananth N Mavinakayanahalli
9edddaa200 Kprobes: indicate kretprobe support in Kconfig
Add CONFIG_HAVE_KRETPROBES to the arch/<arch>/Kconfig file for relevant
architectures with kprobes support.  This facilitates easy handling of
in-kernel modules (like samples/kprobes/kretprobe_example.c) that depend on
kretprobes being present in the kernel.

Thanks to Sam Ravnborg for helping make the patch more lean.

Per Mathieu's suggestion, added CONFIG_KRETPROBES and fixed up dependencies.

Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-04 16:35:11 -08:00
Hugh Dickins
fcab59a318 x86: a P4 is a P6 not an i486
P4 has been coming out as CPU_FAMILY=4 instead of 6: fix MPENTIUM4 typo.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-04 11:55:34 -08:00
Linus Torvalds
34f10fc988 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86:
  x86/xen: fix DomU boot problem
  x86: not set node to cpu_to_node if the node is not online
  x86, i387: fix ptrace leakage using init_fpu()
2008-03-04 09:22:32 -08:00
Linus Torvalds
67171a3f03 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm:
  x86: disable KVM for Voyager and friends
  KVM: VMX: Avoid rearranging switched guest msrs while they are loaded
  KVM: MMU: Fix race when instantiating a shadow pte
  KVM: Route irq 0 to vcpu 0 exclusively
  KVM: Avoid infinite-frequency local apic timer
  KVM: make MMU_DEBUG compile again
  KVM: move alloc_apic_access_page() outside of non-preemptable region
  KVM: SVM: fix Windows XP 64 bit installation crash
  KVM: remove the usage of the mmap_sem for the protection of the memory slots.
  KVM: emulate access to MSR_IA32_MCG_CTL
  KVM: Make the supported cpuid list a host property rather than a vm property
  KVM: Fix kvm_arch_vcpu_ioctl_set_sregs so that set_cr0 works properly
  KVM: SVM: set NM intercept when enabling CR0.TS in the guest
  KVM: SVM: Fix lazy FPU switching
2008-03-04 09:22:05 -08:00
Ian Campbell
87d034f313 x86/xen: fix DomU boot problem
Construct Xen guest e820 map with a hole between 640K-1M.

It's pure luck that Xen kernels have gotten away with it in the past.

The patch below seems like the right thing to do. It certainly boots in
a domU without the DMI problem (without any of the other related patches
such as Alexander's).

Signed-off-by: Ian Campbell <ijc@hellion.org.uk>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Tested-by: Mark McLoughlin <markmc@redhat.com>
Acked-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
2008-03-04 17:10:12 +01:00
Yinghai Lu
7c9e92b6cd x86: not set node to cpu_to_node if the node is not online
resolve boot problem reported by Mel Gorman:

   http://lkml.org/lkml/2008/2/13/404

init_cpu_to_node will use cpu->apic (from MADT or mptable) and
apic->node(from SRAT or AMD config space with k8_bus_64.c) to have
cpu->node mapping, and later identify_cpu will overwrite them
again...(with nearby_node...)

this patch checks if the node is online, otherwise it will not
update cpu_node map. so keep cpu_node map to online node before
identify_cpu..., to prevent possible error.

Signed-off-by: Yinghai Lu <yinghai.lu@sun.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
2008-03-04 17:10:12 +01:00
Suresh Siddha
18a8622101 x86, i387: fix ptrace leakage using init_fpu()
This bug got introduced by the recent i387 merge:

  commit 4421011120
  Author: Roland McGrath <roland@redhat.com>
  Date:   Wed Jan 30 13:31:50 2008 +0100

      x86: x86 i387 user_regset

Current usage of unlazy_fpu() in ptrace specific routines is wrong.
unlazy_fpu() will not init fpu if the task never used math. So the
ptrace calls can expose the parent tasks FPU data in some cases.

Replace it with the init_fpu() which will init the math state, if the
task never used math before.

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
2008-03-04 17:10:12 +01:00
Randy Dunlap
1a4e3f89c6 x86: disable KVM for Voyager and friends
Most classic Pentiums don't have hardware virtualization extension,
and building kvm with Voyager, Visual Workstation, or NUMAQ
generates spurious failures.

Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
2008-03-04 17:42:55 +02:00
Avi Kivity
33f9c505ed KVM: VMX: Avoid rearranging switched guest msrs while they are loaded
KVM tries to run as much as possible with the guest msrs loaded instead of
host msrs, since switching msrs is very expensive.  It also tries to minimize
the number of msrs switched according to the guest mode; for example,
MSR_LSTAR is needed only by long mode guests.  This optimization is done by
setup_msrs().

However, we must not change which msrs are switched while we are running with
guest msr state:

 - switch to guest msr state
 - call setup_msrs(), removing some msrs from the list
 - switch to host msr state, leaving a few guest msrs loaded

An easy way to trigger this is to kexec an x86_64 linux guest.  Early during
setup, the guest will switch EFER to not include SCE.  KVM will stop saving
MSR_LSTAR, and on the next msr switch it will leave the guest LSTAR loaded.
The next host syscall will end up in a random location in the kernel.

Fix by reloading the host msrs before changing the msr list.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-03-04 15:19:50 +02:00
Avi Kivity
f7d9c7b7b9 KVM: MMU: Fix race when instantiating a shadow pte
For improved concurrency, the guest walk is performed concurrently with other
vcpus.  This means that we need to revalidate the guest ptes once we have
write-protected the guest page tables, at which point they can no longer be
modified.

The current code attempts to avoid this check if the shadow page table is not
new, on the assumption that if it has existed before, the guest could not have
modified the pte without the shadow lock.  However the assumption is incorrect,
as the racing vcpu could have modified the pte, then instantiated the shadow
page, before our vcpu regains control:

  vcpu0        vcpu1

  fault
  walk pte

               modify pte
               fault in same pagetable
               instantiate shadow page

  lookup shadow page
  conclude it is old
  instantiate spte based on stale guest pte

We could do something clever with generation counters, but a test run by
Marcelo suggests this is unnecessary and we can just do the revalidation
unconditionally.  The pte will be in the processor cache and the check can
be quite fast.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-03-04 15:19:49 +02:00
Avi Kivity
0b975a3c2d KVM: Avoid infinite-frequency local apic timer
If the local apic initial count is zero, don't start a an hrtimer with infinite
frequency, locking up the host.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-03-04 15:19:48 +02:00
Marcelo Tosatti
24993d5349 KVM: make MMU_DEBUG compile again
the cr3 variable is now inside the vcpu->arch structure.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-03-04 15:19:47 +02:00
Marcelo Tosatti
5e4a0b3c1b KVM: move alloc_apic_access_page() outside of non-preemptable region
alloc_apic_access_page() can sleep, while vmx_vcpu_setup is called
inside a non preemptable region. Move it after put_cpu().

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-03-04 15:19:46 +02:00
Joerg Roedel
a2938c8070 KVM: SVM: fix Windows XP 64 bit installation crash
While installing Windows XP 64 bit wants to access the DEBUGCTL and the last
branch record (LBR) MSRs. Don't allowing this in KVM causes the installation to
crash. This patch allow the access to these MSRs and fixes the issue.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Markus Rechberger <markus.rechberger@amd.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-03-04 15:19:45 +02:00
Izik Eidus
72dc67a696 KVM: remove the usage of the mmap_sem for the protection of the memory slots.
This patch replaces the mmap_sem lock for the memory slots with a new
kvm private lock, it is needed beacuse untill now there were cases where
kvm accesses user memory while holding the mmap semaphore.

Signed-off-by: Izik Eidus <izike@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-03-04 15:19:40 +02:00
Rafael J. Wysocki
9b5cf48b06 x86: revert "x86: CPA: avoid split of alias mappings"
Revert:

  commit 8be8f54bae
  Author: Thomas Gleixner <tglx@linutronix.de>
  Date:   Sat Feb 23 20:43:21 2008 +0100

      x86: CPA: avoid split of alias mappings

because it clearly mishandles the case when __change_page_attr(), called
from __change_page_attr_set_clr(), changes cpa->processed to 1 and
cpa_process_alias(cpa) is executed right after that.

This crashes my x86-64 test box early in the boot process
(ref. http://bugzilla.kernel.org/show_bug.cgi?id=10140#c4).

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-03-03 14:18:27 +01:00
Joerg Roedel
c7ac679c16 KVM: emulate access to MSR_IA32_MCG_CTL
Injecting an GP when accessing this MSR lets Windows crash when running some
stress test tools in KVM.  So this patch emulates access to this MSR.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Markus Rechberger <markus.rechberger@amd.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-03-03 11:22:37 +02:00
Avi Kivity
674eea0fc4 KVM: Make the supported cpuid list a host property rather than a vm property
One of the use cases for the supported cpuid list is to create a "greatest
common denominator" of cpu capabilities in a server farm.  As such, it is
useful to be able to get the list without creating a virtual machine first.

Since the code does not depend on the vm in any way, all that is needed is
to move it to the device ioctl handler.  The capability identifier is also
changed so that binaries made against -rc1 will fail gracefully.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-03-03 11:22:25 +02:00