Move kvm, uml, and lguest subdirectories under a common "virtual" directory, I.E:
cd Documentation mkdir virtual git mv kvm uml lguest virtual Signed-off-by: Rob Landley <rlandley@parallels.com> Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
This commit is contained in:
committed by
Randy Dunlap
parent
bfd412db9e
commit
ed16648eb5
1451
Documentation/virtual/kvm/api.txt
Normal file
1451
Documentation/virtual/kvm/api.txt
Normal file
File diff suppressed because it is too large
Load Diff
45
Documentation/virtual/kvm/cpuid.txt
Normal file
45
Documentation/virtual/kvm/cpuid.txt
Normal file
@@ -0,0 +1,45 @@
|
||||
KVM CPUID bits
|
||||
Glauber Costa <glommer@redhat.com>, Red Hat Inc, 2010
|
||||
=====================================================
|
||||
|
||||
A guest running on a kvm host, can check some of its features using
|
||||
cpuid. This is not always guaranteed to work, since userspace can
|
||||
mask-out some, or even all KVM-related cpuid features before launching
|
||||
a guest.
|
||||
|
||||
KVM cpuid functions are:
|
||||
|
||||
function: KVM_CPUID_SIGNATURE (0x40000000)
|
||||
returns : eax = 0,
|
||||
ebx = 0x4b4d564b,
|
||||
ecx = 0x564b4d56,
|
||||
edx = 0x4d.
|
||||
Note that this value in ebx, ecx and edx corresponds to the string "KVMKVMKVM".
|
||||
This function queries the presence of KVM cpuid leafs.
|
||||
|
||||
|
||||
function: define KVM_CPUID_FEATURES (0x40000001)
|
||||
returns : ebx, ecx, edx = 0
|
||||
eax = and OR'ed group of (1 << flag), where each flags is:
|
||||
|
||||
|
||||
flag || value || meaning
|
||||
=============================================================================
|
||||
KVM_FEATURE_CLOCKSOURCE || 0 || kvmclock available at msrs
|
||||
|| || 0x11 and 0x12.
|
||||
------------------------------------------------------------------------------
|
||||
KVM_FEATURE_NOP_IO_DELAY || 1 || not necessary to perform delays
|
||||
|| || on PIO operations.
|
||||
------------------------------------------------------------------------------
|
||||
KVM_FEATURE_MMU_OP || 2 || deprecated.
|
||||
------------------------------------------------------------------------------
|
||||
KVM_FEATURE_CLOCKSOURCE2 || 3 || kvmclock available at msrs
|
||||
|| || 0x4b564d00 and 0x4b564d01
|
||||
------------------------------------------------------------------------------
|
||||
KVM_FEATURE_ASYNC_PF || 4 || async pf can be enabled by
|
||||
|| || writing to msr 0x4b564d02
|
||||
------------------------------------------------------------------------------
|
||||
KVM_FEATURE_CLOCKSOURCE_STABLE_BIT || 24 || host will warn if no guest-side
|
||||
|| || per-cpu warps are expected in
|
||||
|| || kvmclock.
|
||||
------------------------------------------------------------------------------
|
25
Documentation/virtual/kvm/locking.txt
Normal file
25
Documentation/virtual/kvm/locking.txt
Normal file
@@ -0,0 +1,25 @@
|
||||
KVM Lock Overview
|
||||
=================
|
||||
|
||||
1. Acquisition Orders
|
||||
---------------------
|
||||
|
||||
(to be written)
|
||||
|
||||
2. Reference
|
||||
------------
|
||||
|
||||
Name: kvm_lock
|
||||
Type: raw_spinlock
|
||||
Arch: any
|
||||
Protects: - vm_list
|
||||
- hardware virtualization enable/disable
|
||||
Comment: 'raw' because hardware enabling/disabling must be atomic /wrt
|
||||
migration.
|
||||
|
||||
Name: kvm_arch::tsc_write_lock
|
||||
Type: raw_spinlock
|
||||
Arch: x86
|
||||
Protects: - kvm_arch::{last_tsc_write,last_tsc_nsec,last_tsc_offset}
|
||||
- tsc offset in vmcb
|
||||
Comment: 'raw' because updating the tsc offsets must not be preempted.
|
348
Documentation/virtual/kvm/mmu.txt
Normal file
348
Documentation/virtual/kvm/mmu.txt
Normal file
@@ -0,0 +1,348 @@
|
||||
The x86 kvm shadow mmu
|
||||
======================
|
||||
|
||||
The mmu (in arch/x86/kvm, files mmu.[ch] and paging_tmpl.h) is responsible
|
||||
for presenting a standard x86 mmu to the guest, while translating guest
|
||||
physical addresses to host physical addresses.
|
||||
|
||||
The mmu code attempts to satisfy the following requirements:
|
||||
|
||||
- correctness: the guest should not be able to determine that it is running
|
||||
on an emulated mmu except for timing (we attempt to comply
|
||||
with the specification, not emulate the characteristics of
|
||||
a particular implementation such as tlb size)
|
||||
- security: the guest must not be able to touch host memory not assigned
|
||||
to it
|
||||
- performance: minimize the performance penalty imposed by the mmu
|
||||
- scaling: need to scale to large memory and large vcpu guests
|
||||
- hardware: support the full range of x86 virtualization hardware
|
||||
- integration: Linux memory management code must be in control of guest memory
|
||||
so that swapping, page migration, page merging, transparent
|
||||
hugepages, and similar features work without change
|
||||
- dirty tracking: report writes to guest memory to enable live migration
|
||||
and framebuffer-based displays
|
||||
- footprint: keep the amount of pinned kernel memory low (most memory
|
||||
should be shrinkable)
|
||||
- reliability: avoid multipage or GFP_ATOMIC allocations
|
||||
|
||||
Acronyms
|
||||
========
|
||||
|
||||
pfn host page frame number
|
||||
hpa host physical address
|
||||
hva host virtual address
|
||||
gfn guest frame number
|
||||
gpa guest physical address
|
||||
gva guest virtual address
|
||||
ngpa nested guest physical address
|
||||
ngva nested guest virtual address
|
||||
pte page table entry (used also to refer generically to paging structure
|
||||
entries)
|
||||
gpte guest pte (referring to gfns)
|
||||
spte shadow pte (referring to pfns)
|
||||
tdp two dimensional paging (vendor neutral term for NPT and EPT)
|
||||
|
||||
Virtual and real hardware supported
|
||||
===================================
|
||||
|
||||
The mmu supports first-generation mmu hardware, which allows an atomic switch
|
||||
of the current paging mode and cr3 during guest entry, as well as
|
||||
two-dimensional paging (AMD's NPT and Intel's EPT). The emulated hardware
|
||||
it exposes is the traditional 2/3/4 level x86 mmu, with support for global
|
||||
pages, pae, pse, pse36, cr0.wp, and 1GB pages. Work is in progress to support
|
||||
exposing NPT capable hardware on NPT capable hosts.
|
||||
|
||||
Translation
|
||||
===========
|
||||
|
||||
The primary job of the mmu is to program the processor's mmu to translate
|
||||
addresses for the guest. Different translations are required at different
|
||||
times:
|
||||
|
||||
- when guest paging is disabled, we translate guest physical addresses to
|
||||
host physical addresses (gpa->hpa)
|
||||
- when guest paging is enabled, we translate guest virtual addresses, to
|
||||
guest physical addresses, to host physical addresses (gva->gpa->hpa)
|
||||
- when the guest launches a guest of its own, we translate nested guest
|
||||
virtual addresses, to nested guest physical addresses, to guest physical
|
||||
addresses, to host physical addresses (ngva->ngpa->gpa->hpa)
|
||||
|
||||
The primary challenge is to encode between 1 and 3 translations into hardware
|
||||
that support only 1 (traditional) and 2 (tdp) translations. When the
|
||||
number of required translations matches the hardware, the mmu operates in
|
||||
direct mode; otherwise it operates in shadow mode (see below).
|
||||
|
||||
Memory
|
||||
======
|
||||
|
||||
Guest memory (gpa) is part of the user address space of the process that is
|
||||
using kvm. Userspace defines the translation between guest addresses and user
|
||||
addresses (gpa->hva); note that two gpas may alias to the same hva, but not
|
||||
vice versa.
|
||||
|
||||
These hvas may be backed using any method available to the host: anonymous
|
||||
memory, file backed memory, and device memory. Memory might be paged by the
|
||||
host at any time.
|
||||
|
||||
Events
|
||||
======
|
||||
|
||||
The mmu is driven by events, some from the guest, some from the host.
|
||||
|
||||
Guest generated events:
|
||||
- writes to control registers (especially cr3)
|
||||
- invlpg/invlpga instruction execution
|
||||
- access to missing or protected translations
|
||||
|
||||
Host generated events:
|
||||
- changes in the gpa->hpa translation (either through gpa->hva changes or
|
||||
through hva->hpa changes)
|
||||
- memory pressure (the shrinker)
|
||||
|
||||
Shadow pages
|
||||
============
|
||||
|
||||
The principal data structure is the shadow page, 'struct kvm_mmu_page'. A
|
||||
shadow page contains 512 sptes, which can be either leaf or nonleaf sptes. A
|
||||
shadow page may contain a mix of leaf and nonleaf sptes.
|
||||
|
||||
A nonleaf spte allows the hardware mmu to reach the leaf pages and
|
||||
is not related to a translation directly. It points to other shadow pages.
|
||||
|
||||
A leaf spte corresponds to either one or two translations encoded into
|
||||
one paging structure entry. These are always the lowest level of the
|
||||
translation stack, with optional higher level translations left to NPT/EPT.
|
||||
Leaf ptes point at guest pages.
|
||||
|
||||
The following table shows translations encoded by leaf ptes, with higher-level
|
||||
translations in parentheses:
|
||||
|
||||
Non-nested guests:
|
||||
nonpaging: gpa->hpa
|
||||
paging: gva->gpa->hpa
|
||||
paging, tdp: (gva->)gpa->hpa
|
||||
Nested guests:
|
||||
non-tdp: ngva->gpa->hpa (*)
|
||||
tdp: (ngva->)ngpa->gpa->hpa
|
||||
|
||||
(*) the guest hypervisor will encode the ngva->gpa translation into its page
|
||||
tables if npt is not present
|
||||
|
||||
Shadow pages contain the following information:
|
||||
role.level:
|
||||
The level in the shadow paging hierarchy that this shadow page belongs to.
|
||||
1=4k sptes, 2=2M sptes, 3=1G sptes, etc.
|
||||
role.direct:
|
||||
If set, leaf sptes reachable from this page are for a linear range.
|
||||
Examples include real mode translation, large guest pages backed by small
|
||||
host pages, and gpa->hpa translations when NPT or EPT is active.
|
||||
The linear range starts at (gfn << PAGE_SHIFT) and its size is determined
|
||||
by role.level (2MB for first level, 1GB for second level, 0.5TB for third
|
||||
level, 256TB for fourth level)
|
||||
If clear, this page corresponds to a guest page table denoted by the gfn
|
||||
field.
|
||||
role.quadrant:
|
||||
When role.cr4_pae=0, the guest uses 32-bit gptes while the host uses 64-bit
|
||||
sptes. That means a guest page table contains more ptes than the host,
|
||||
so multiple shadow pages are needed to shadow one guest page.
|
||||
For first-level shadow pages, role.quadrant can be 0 or 1 and denotes the
|
||||
first or second 512-gpte block in the guest page table. For second-level
|
||||
page tables, each 32-bit gpte is converted to two 64-bit sptes
|
||||
(since each first-level guest page is shadowed by two first-level
|
||||
shadow pages) so role.quadrant takes values in the range 0..3. Each
|
||||
quadrant maps 1GB virtual address space.
|
||||
role.access:
|
||||
Inherited guest access permissions in the form uwx. Note execute
|
||||
permission is positive, not negative.
|
||||
role.invalid:
|
||||
The page is invalid and should not be used. It is a root page that is
|
||||
currently pinned (by a cpu hardware register pointing to it); once it is
|
||||
unpinned it will be destroyed.
|
||||
role.cr4_pae:
|
||||
Contains the value of cr4.pae for which the page is valid (e.g. whether
|
||||
32-bit or 64-bit gptes are in use).
|
||||
role.nxe:
|
||||
Contains the value of efer.nxe for which the page is valid.
|
||||
role.cr0_wp:
|
||||
Contains the value of cr0.wp for which the page is valid.
|
||||
gfn:
|
||||
Either the guest page table containing the translations shadowed by this
|
||||
page, or the base page frame for linear translations. See role.direct.
|
||||
spt:
|
||||
A pageful of 64-bit sptes containing the translations for this page.
|
||||
Accessed by both kvm and hardware.
|
||||
The page pointed to by spt will have its page->private pointing back
|
||||
at the shadow page structure.
|
||||
sptes in spt point either at guest pages, or at lower-level shadow pages.
|
||||
Specifically, if sp1 and sp2 are shadow pages, then sp1->spt[n] may point
|
||||
at __pa(sp2->spt). sp2 will point back at sp1 through parent_pte.
|
||||
The spt array forms a DAG structure with the shadow page as a node, and
|
||||
guest pages as leaves.
|
||||
gfns:
|
||||
An array of 512 guest frame numbers, one for each present pte. Used to
|
||||
perform a reverse map from a pte to a gfn. When role.direct is set, any
|
||||
element of this array can be calculated from the gfn field when used, in
|
||||
this case, the array of gfns is not allocated. See role.direct and gfn.
|
||||
slot_bitmap:
|
||||
A bitmap containing one bit per memory slot. If the page contains a pte
|
||||
mapping a page from memory slot n, then bit n of slot_bitmap will be set
|
||||
(if a page is aliased among several slots, then it is not guaranteed that
|
||||
all slots will be marked).
|
||||
Used during dirty logging to avoid scanning a shadow page if none if its
|
||||
pages need tracking.
|
||||
root_count:
|
||||
A counter keeping track of how many hardware registers (guest cr3 or
|
||||
pdptrs) are now pointing at the page. While this counter is nonzero, the
|
||||
page cannot be destroyed. See role.invalid.
|
||||
multimapped:
|
||||
Whether there exist multiple sptes pointing at this page.
|
||||
parent_pte/parent_ptes:
|
||||
If multimapped is zero, parent_pte points at the single spte that points at
|
||||
this page's spt. Otherwise, parent_ptes points at a data structure
|
||||
with a list of parent_ptes.
|
||||
unsync:
|
||||
If true, then the translations in this page may not match the guest's
|
||||
translation. This is equivalent to the state of the tlb when a pte is
|
||||
changed but before the tlb entry is flushed. Accordingly, unsync ptes
|
||||
are synchronized when the guest executes invlpg or flushes its tlb by
|
||||
other means. Valid for leaf pages.
|
||||
unsync_children:
|
||||
How many sptes in the page point at pages that are unsync (or have
|
||||
unsynchronized children).
|
||||
unsync_child_bitmap:
|
||||
A bitmap indicating which sptes in spt point (directly or indirectly) at
|
||||
pages that may be unsynchronized. Used to quickly locate all unsychronized
|
||||
pages reachable from a given page.
|
||||
|
||||
Reverse map
|
||||
===========
|
||||
|
||||
The mmu maintains a reverse mapping whereby all ptes mapping a page can be
|
||||
reached given its gfn. This is used, for example, when swapping out a page.
|
||||
|
||||
Synchronized and unsynchronized pages
|
||||
=====================================
|
||||
|
||||
The guest uses two events to synchronize its tlb and page tables: tlb flushes
|
||||
and page invalidations (invlpg).
|
||||
|
||||
A tlb flush means that we need to synchronize all sptes reachable from the
|
||||
guest's cr3. This is expensive, so we keep all guest page tables write
|
||||
protected, and synchronize sptes to gptes when a gpte is written.
|
||||
|
||||
A special case is when a guest page table is reachable from the current
|
||||
guest cr3. In this case, the guest is obliged to issue an invlpg instruction
|
||||
before using the translation. We take advantage of that by removing write
|
||||
protection from the guest page, and allowing the guest to modify it freely.
|
||||
We synchronize modified gptes when the guest invokes invlpg. This reduces
|
||||
the amount of emulation we have to do when the guest modifies multiple gptes,
|
||||
or when the a guest page is no longer used as a page table and is used for
|
||||
random guest data.
|
||||
|
||||
As a side effect we have to resynchronize all reachable unsynchronized shadow
|
||||
pages on a tlb flush.
|
||||
|
||||
|
||||
Reaction to events
|
||||
==================
|
||||
|
||||
- guest page fault (or npt page fault, or ept violation)
|
||||
|
||||
This is the most complicated event. The cause of a page fault can be:
|
||||
|
||||
- a true guest fault (the guest translation won't allow the access) (*)
|
||||
- access to a missing translation
|
||||
- access to a protected translation
|
||||
- when logging dirty pages, memory is write protected
|
||||
- synchronized shadow pages are write protected (*)
|
||||
- access to untranslatable memory (mmio)
|
||||
|
||||
(*) not applicable in direct mode
|
||||
|
||||
Handling a page fault is performed as follows:
|
||||
|
||||
- if needed, walk the guest page tables to determine the guest translation
|
||||
(gva->gpa or ngpa->gpa)
|
||||
- if permissions are insufficient, reflect the fault back to the guest
|
||||
- determine the host page
|
||||
- if this is an mmio request, there is no host page; call the emulator
|
||||
to emulate the instruction instead
|
||||
- walk the shadow page table to find the spte for the translation,
|
||||
instantiating missing intermediate page tables as necessary
|
||||
- try to unsynchronize the page
|
||||
- if successful, we can let the guest continue and modify the gpte
|
||||
- emulate the instruction
|
||||
- if failed, unshadow the page and let the guest continue
|
||||
- update any translations that were modified by the instruction
|
||||
|
||||
invlpg handling:
|
||||
|
||||
- walk the shadow page hierarchy and drop affected translations
|
||||
- try to reinstantiate the indicated translation in the hope that the
|
||||
guest will use it in the near future
|
||||
|
||||
Guest control register updates:
|
||||
|
||||
- mov to cr3
|
||||
- look up new shadow roots
|
||||
- synchronize newly reachable shadow pages
|
||||
|
||||
- mov to cr0/cr4/efer
|
||||
- set up mmu context for new paging mode
|
||||
- look up new shadow roots
|
||||
- synchronize newly reachable shadow pages
|
||||
|
||||
Host translation updates:
|
||||
|
||||
- mmu notifier called with updated hva
|
||||
- look up affected sptes through reverse map
|
||||
- drop (or update) translations
|
||||
|
||||
Emulating cr0.wp
|
||||
================
|
||||
|
||||
If tdp is not enabled, the host must keep cr0.wp=1 so page write protection
|
||||
works for the guest kernel, not guest guest userspace. When the guest
|
||||
cr0.wp=1, this does not present a problem. However when the guest cr0.wp=0,
|
||||
we cannot map the permissions for gpte.u=1, gpte.w=0 to any spte (the
|
||||
semantics require allowing any guest kernel access plus user read access).
|
||||
|
||||
We handle this by mapping the permissions to two possible sptes, depending
|
||||
on fault type:
|
||||
|
||||
- kernel write fault: spte.u=0, spte.w=1 (allows full kernel access,
|
||||
disallows user access)
|
||||
- read fault: spte.u=1, spte.w=0 (allows full read access, disallows kernel
|
||||
write access)
|
||||
|
||||
(user write faults generate a #PF)
|
||||
|
||||
Large pages
|
||||
===========
|
||||
|
||||
The mmu supports all combinations of large and small guest and host pages.
|
||||
Supported page sizes include 4k, 2M, 4M, and 1G. 4M pages are treated as
|
||||
two separate 2M pages, on both guest and host, since the mmu always uses PAE
|
||||
paging.
|
||||
|
||||
To instantiate a large spte, four constraints must be satisfied:
|
||||
|
||||
- the spte must point to a large host page
|
||||
- the guest pte must be a large pte of at least equivalent size (if tdp is
|
||||
enabled, there is no guest pte and this condition is satisified)
|
||||
- if the spte will be writeable, the large page frame may not overlap any
|
||||
write-protected pages
|
||||
- the guest page must be wholly contained by a single memory slot
|
||||
|
||||
To check the last two conditions, the mmu maintains a ->write_count set of
|
||||
arrays for each memory slot and large page size. Every write protected page
|
||||
causes its write_count to be incremented, thus preventing instantiation of
|
||||
a large spte. The frames at the end of an unaligned memory slot have
|
||||
artificically inflated ->write_counts so they can never be instantiated.
|
||||
|
||||
Further reading
|
||||
===============
|
||||
|
||||
- NPT presentation from KVM Forum 2008
|
||||
http://www.linux-kvm.org/wiki/images/c/c8/KvmForum2008%24kdf2008_21.pdf
|
||||
|
187
Documentation/virtual/kvm/msr.txt
Normal file
187
Documentation/virtual/kvm/msr.txt
Normal file
@@ -0,0 +1,187 @@
|
||||
KVM-specific MSRs.
|
||||
Glauber Costa <glommer@redhat.com>, Red Hat Inc, 2010
|
||||
=====================================================
|
||||
|
||||
KVM makes use of some custom MSRs to service some requests.
|
||||
|
||||
Custom MSRs have a range reserved for them, that goes from
|
||||
0x4b564d00 to 0x4b564dff. There are MSRs outside this area,
|
||||
but they are deprecated and their use is discouraged.
|
||||
|
||||
Custom MSR list
|
||||
--------
|
||||
|
||||
The current supported Custom MSR list is:
|
||||
|
||||
MSR_KVM_WALL_CLOCK_NEW: 0x4b564d00
|
||||
|
||||
data: 4-byte alignment physical address of a memory area which must be
|
||||
in guest RAM. This memory is expected to hold a copy of the following
|
||||
structure:
|
||||
|
||||
struct pvclock_wall_clock {
|
||||
u32 version;
|
||||
u32 sec;
|
||||
u32 nsec;
|
||||
} __attribute__((__packed__));
|
||||
|
||||
whose data will be filled in by the hypervisor. The hypervisor is only
|
||||
guaranteed to update this data at the moment of MSR write.
|
||||
Users that want to reliably query this information more than once have
|
||||
to write more than once to this MSR. Fields have the following meanings:
|
||||
|
||||
version: guest has to check version before and after grabbing
|
||||
time information and check that they are both equal and even.
|
||||
An odd version indicates an in-progress update.
|
||||
|
||||
sec: number of seconds for wallclock.
|
||||
|
||||
nsec: number of nanoseconds for wallclock.
|
||||
|
||||
Note that although MSRs are per-CPU entities, the effect of this
|
||||
particular MSR is global.
|
||||
|
||||
Availability of this MSR must be checked via bit 3 in 0x4000001 cpuid
|
||||
leaf prior to usage.
|
||||
|
||||
MSR_KVM_SYSTEM_TIME_NEW: 0x4b564d01
|
||||
|
||||
data: 4-byte aligned physical address of a memory area which must be in
|
||||
guest RAM, plus an enable bit in bit 0. This memory is expected to hold
|
||||
a copy of the following structure:
|
||||
|
||||
struct pvclock_vcpu_time_info {
|
||||
u32 version;
|
||||
u32 pad0;
|
||||
u64 tsc_timestamp;
|
||||
u64 system_time;
|
||||
u32 tsc_to_system_mul;
|
||||
s8 tsc_shift;
|
||||
u8 flags;
|
||||
u8 pad[2];
|
||||
} __attribute__((__packed__)); /* 32 bytes */
|
||||
|
||||
whose data will be filled in by the hypervisor periodically. Only one
|
||||
write, or registration, is needed for each VCPU. The interval between
|
||||
updates of this structure is arbitrary and implementation-dependent.
|
||||
The hypervisor may update this structure at any time it sees fit until
|
||||
anything with bit0 == 0 is written to it.
|
||||
|
||||
Fields have the following meanings:
|
||||
|
||||
version: guest has to check version before and after grabbing
|
||||
time information and check that they are both equal and even.
|
||||
An odd version indicates an in-progress update.
|
||||
|
||||
tsc_timestamp: the tsc value at the current VCPU at the time
|
||||
of the update of this structure. Guests can subtract this value
|
||||
from current tsc to derive a notion of elapsed time since the
|
||||
structure update.
|
||||
|
||||
system_time: a host notion of monotonic time, including sleep
|
||||
time at the time this structure was last updated. Unit is
|
||||
nanoseconds.
|
||||
|
||||
tsc_to_system_mul: a function of the tsc frequency. One has
|
||||
to multiply any tsc-related quantity by this value to get
|
||||
a value in nanoseconds, besides dividing by 2^tsc_shift
|
||||
|
||||
tsc_shift: cycle to nanosecond divider, as a power of two, to
|
||||
allow for shift rights. One has to shift right any tsc-related
|
||||
quantity by this value to get a value in nanoseconds, besides
|
||||
multiplying by tsc_to_system_mul.
|
||||
|
||||
With this information, guests can derive per-CPU time by
|
||||
doing:
|
||||
|
||||
time = (current_tsc - tsc_timestamp)
|
||||
time = (time * tsc_to_system_mul) >> tsc_shift
|
||||
time = time + system_time
|
||||
|
||||
flags: bits in this field indicate extended capabilities
|
||||
coordinated between the guest and the hypervisor. Availability
|
||||
of specific flags has to be checked in 0x40000001 cpuid leaf.
|
||||
Current flags are:
|
||||
|
||||
flag bit | cpuid bit | meaning
|
||||
-------------------------------------------------------------
|
||||
| | time measures taken across
|
||||
0 | 24 | multiple cpus are guaranteed to
|
||||
| | be monotonic
|
||||
-------------------------------------------------------------
|
||||
|
||||
Availability of this MSR must be checked via bit 3 in 0x4000001 cpuid
|
||||
leaf prior to usage.
|
||||
|
||||
|
||||
MSR_KVM_WALL_CLOCK: 0x11
|
||||
|
||||
data and functioning: same as MSR_KVM_WALL_CLOCK_NEW. Use that instead.
|
||||
|
||||
This MSR falls outside the reserved KVM range and may be removed in the
|
||||
future. Its usage is deprecated.
|
||||
|
||||
Availability of this MSR must be checked via bit 0 in 0x4000001 cpuid
|
||||
leaf prior to usage.
|
||||
|
||||
MSR_KVM_SYSTEM_TIME: 0x12
|
||||
|
||||
data and functioning: same as MSR_KVM_SYSTEM_TIME_NEW. Use that instead.
|
||||
|
||||
This MSR falls outside the reserved KVM range and may be removed in the
|
||||
future. Its usage is deprecated.
|
||||
|
||||
Availability of this MSR must be checked via bit 0 in 0x4000001 cpuid
|
||||
leaf prior to usage.
|
||||
|
||||
The suggested algorithm for detecting kvmclock presence is then:
|
||||
|
||||
if (!kvm_para_available()) /* refer to cpuid.txt */
|
||||
return NON_PRESENT;
|
||||
|
||||
flags = cpuid_eax(0x40000001);
|
||||
if (flags & 3) {
|
||||
msr_kvm_system_time = MSR_KVM_SYSTEM_TIME_NEW;
|
||||
msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK_NEW;
|
||||
return PRESENT;
|
||||
} else if (flags & 0) {
|
||||
msr_kvm_system_time = MSR_KVM_SYSTEM_TIME;
|
||||
msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK;
|
||||
return PRESENT;
|
||||
} else
|
||||
return NON_PRESENT;
|
||||
|
||||
MSR_KVM_ASYNC_PF_EN: 0x4b564d02
|
||||
data: Bits 63-6 hold 64-byte aligned physical address of a
|
||||
64 byte memory area which must be in guest RAM and must be
|
||||
zeroed. Bits 5-2 are reserved and should be zero. Bit 0 is 1
|
||||
when asynchronous page faults are enabled on the vcpu 0 when
|
||||
disabled. Bit 2 is 1 if asynchronous page faults can be injected
|
||||
when vcpu is in cpl == 0.
|
||||
|
||||
First 4 byte of 64 byte memory location will be written to by
|
||||
the hypervisor at the time of asynchronous page fault (APF)
|
||||
injection to indicate type of asynchronous page fault. Value
|
||||
of 1 means that the page referred to by the page fault is not
|
||||
present. Value 2 means that the page is now available. Disabling
|
||||
interrupt inhibits APFs. Guest must not enable interrupt
|
||||
before the reason is read, or it may be overwritten by another
|
||||
APF. Since APF uses the same exception vector as regular page
|
||||
fault guest must reset the reason to 0 before it does
|
||||
something that can generate normal page fault. If during page
|
||||
fault APF reason is 0 it means that this is regular page
|
||||
fault.
|
||||
|
||||
During delivery of type 1 APF cr2 contains a token that will
|
||||
be used to notify a guest when missing page becomes
|
||||
available. When page becomes available type 2 APF is sent with
|
||||
cr2 set to the token associated with the page. There is special
|
||||
kind of token 0xffffffff which tells vcpu that it should wake
|
||||
up all processes waiting for APFs and no individual type 2 APFs
|
||||
will be sent.
|
||||
|
||||
If APF is disabled while there are outstanding APFs, they will
|
||||
not be delivered.
|
||||
|
||||
Currently type 2 APF will be always delivered on the same vcpu as
|
||||
type 1 was, but guest should not rely on that.
|
196
Documentation/virtual/kvm/ppc-pv.txt
Normal file
196
Documentation/virtual/kvm/ppc-pv.txt
Normal file
@@ -0,0 +1,196 @@
|
||||
The PPC KVM paravirtual interface
|
||||
=================================
|
||||
|
||||
The basic execution principle by which KVM on PowerPC works is to run all kernel
|
||||
space code in PR=1 which is user space. This way we trap all privileged
|
||||
instructions and can emulate them accordingly.
|
||||
|
||||
Unfortunately that is also the downfall. There are quite some privileged
|
||||
instructions that needlessly return us to the hypervisor even though they
|
||||
could be handled differently.
|
||||
|
||||
This is what the PPC PV interface helps with. It takes privileged instructions
|
||||
and transforms them into unprivileged ones with some help from the hypervisor.
|
||||
This cuts down virtualization costs by about 50% on some of my benchmarks.
|
||||
|
||||
The code for that interface can be found in arch/powerpc/kernel/kvm*
|
||||
|
||||
Querying for existence
|
||||
======================
|
||||
|
||||
To find out if we're running on KVM or not, we leverage the device tree. When
|
||||
Linux is running on KVM, a node /hypervisor exists. That node contains a
|
||||
compatible property with the value "linux,kvm".
|
||||
|
||||
Once you determined you're running under a PV capable KVM, you can now use
|
||||
hypercalls as described below.
|
||||
|
||||
KVM hypercalls
|
||||
==============
|
||||
|
||||
Inside the device tree's /hypervisor node there's a property called
|
||||
'hypercall-instructions'. This property contains at most 4 opcodes that make
|
||||
up the hypercall. To call a hypercall, just call these instructions.
|
||||
|
||||
The parameters are as follows:
|
||||
|
||||
Register IN OUT
|
||||
|
||||
r0 - volatile
|
||||
r3 1st parameter Return code
|
||||
r4 2nd parameter 1st output value
|
||||
r5 3rd parameter 2nd output value
|
||||
r6 4th parameter 3rd output value
|
||||
r7 5th parameter 4th output value
|
||||
r8 6th parameter 5th output value
|
||||
r9 7th parameter 6th output value
|
||||
r10 8th parameter 7th output value
|
||||
r11 hypercall number 8th output value
|
||||
r12 - volatile
|
||||
|
||||
Hypercall definitions are shared in generic code, so the same hypercall numbers
|
||||
apply for x86 and powerpc alike with the exception that each KVM hypercall
|
||||
also needs to be ORed with the KVM vendor code which is (42 << 16).
|
||||
|
||||
Return codes can be as follows:
|
||||
|
||||
Code Meaning
|
||||
|
||||
0 Success
|
||||
12 Hypercall not implemented
|
||||
<0 Error
|
||||
|
||||
The magic page
|
||||
==============
|
||||
|
||||
To enable communication between the hypervisor and guest there is a new shared
|
||||
page that contains parts of supervisor visible register state. The guest can
|
||||
map this shared page using the KVM hypercall KVM_HC_PPC_MAP_MAGIC_PAGE.
|
||||
|
||||
With this hypercall issued the guest always gets the magic page mapped at the
|
||||
desired location in effective and physical address space. For now, we always
|
||||
map the page to -4096. This way we can access it using absolute load and store
|
||||
functions. The following instruction reads the first field of the magic page:
|
||||
|
||||
ld rX, -4096(0)
|
||||
|
||||
The interface is designed to be extensible should there be need later to add
|
||||
additional registers to the magic page. If you add fields to the magic page,
|
||||
also define a new hypercall feature to indicate that the host can give you more
|
||||
registers. Only if the host supports the additional features, make use of them.
|
||||
|
||||
The magic page has the following layout as described in
|
||||
arch/powerpc/include/asm/kvm_para.h:
|
||||
|
||||
struct kvm_vcpu_arch_shared {
|
||||
__u64 scratch1;
|
||||
__u64 scratch2;
|
||||
__u64 scratch3;
|
||||
__u64 critical; /* Guest may not get interrupts if == r1 */
|
||||
__u64 sprg0;
|
||||
__u64 sprg1;
|
||||
__u64 sprg2;
|
||||
__u64 sprg3;
|
||||
__u64 srr0;
|
||||
__u64 srr1;
|
||||
__u64 dar;
|
||||
__u64 msr;
|
||||
__u32 dsisr;
|
||||
__u32 int_pending; /* Tells the guest if we have an interrupt */
|
||||
};
|
||||
|
||||
Additions to the page must only occur at the end. Struct fields are always 32
|
||||
or 64 bit aligned, depending on them being 32 or 64 bit wide respectively.
|
||||
|
||||
Magic page features
|
||||
===================
|
||||
|
||||
When mapping the magic page using the KVM hypercall KVM_HC_PPC_MAP_MAGIC_PAGE,
|
||||
a second return value is passed to the guest. This second return value contains
|
||||
a bitmap of available features inside the magic page.
|
||||
|
||||
The following enhancements to the magic page are currently available:
|
||||
|
||||
KVM_MAGIC_FEAT_SR Maps SR registers r/w in the magic page
|
||||
|
||||
For enhanced features in the magic page, please check for the existence of the
|
||||
feature before using them!
|
||||
|
||||
MSR bits
|
||||
========
|
||||
|
||||
The MSR contains bits that require hypervisor intervention and bits that do
|
||||
not require direct hypervisor intervention because they only get interpreted
|
||||
when entering the guest or don't have any impact on the hypervisor's behavior.
|
||||
|
||||
The following bits are safe to be set inside the guest:
|
||||
|
||||
MSR_EE
|
||||
MSR_RI
|
||||
MSR_CR
|
||||
MSR_ME
|
||||
|
||||
If any other bit changes in the MSR, please still use mtmsr(d).
|
||||
|
||||
Patched instructions
|
||||
====================
|
||||
|
||||
The "ld" and "std" instructions are transormed to "lwz" and "stw" instructions
|
||||
respectively on 32 bit systems with an added offset of 4 to accommodate for big
|
||||
endianness.
|
||||
|
||||
The following is a list of mapping the Linux kernel performs when running as
|
||||
guest. Implementing any of those mappings is optional, as the instruction traps
|
||||
also act on the shared page. So calling privileged instructions still works as
|
||||
before.
|
||||
|
||||
From To
|
||||
==== ==
|
||||
|
||||
mfmsr rX ld rX, magic_page->msr
|
||||
mfsprg rX, 0 ld rX, magic_page->sprg0
|
||||
mfsprg rX, 1 ld rX, magic_page->sprg1
|
||||
mfsprg rX, 2 ld rX, magic_page->sprg2
|
||||
mfsprg rX, 3 ld rX, magic_page->sprg3
|
||||
mfsrr0 rX ld rX, magic_page->srr0
|
||||
mfsrr1 rX ld rX, magic_page->srr1
|
||||
mfdar rX ld rX, magic_page->dar
|
||||
mfdsisr rX lwz rX, magic_page->dsisr
|
||||
|
||||
mtmsr rX std rX, magic_page->msr
|
||||
mtsprg 0, rX std rX, magic_page->sprg0
|
||||
mtsprg 1, rX std rX, magic_page->sprg1
|
||||
mtsprg 2, rX std rX, magic_page->sprg2
|
||||
mtsprg 3, rX std rX, magic_page->sprg3
|
||||
mtsrr0 rX std rX, magic_page->srr0
|
||||
mtsrr1 rX std rX, magic_page->srr1
|
||||
mtdar rX std rX, magic_page->dar
|
||||
mtdsisr rX stw rX, magic_page->dsisr
|
||||
|
||||
tlbsync nop
|
||||
|
||||
mtmsrd rX, 0 b <special mtmsr section>
|
||||
mtmsr rX b <special mtmsr section>
|
||||
|
||||
mtmsrd rX, 1 b <special mtmsrd section>
|
||||
|
||||
[Book3S only]
|
||||
mtsrin rX, rY b <special mtsrin section>
|
||||
|
||||
[BookE only]
|
||||
wrteei [0|1] b <special wrteei section>
|
||||
|
||||
|
||||
Some instructions require more logic to determine what's going on than a load
|
||||
or store instruction can deliver. To enable patching of those, we keep some
|
||||
RAM around where we can live translate instructions to. What happens is the
|
||||
following:
|
||||
|
||||
1) copy emulation code to memory
|
||||
2) patch that code to fit the emulated instruction
|
||||
3) patch that code to return to the original pc + 4
|
||||
4) patch the original instruction to branch to the new code
|
||||
|
||||
That way we can inject an arbitrary amount of code as replacement for a single
|
||||
instruction. This allows us to check for pending interrupts when setting EE=1
|
||||
for example.
|
38
Documentation/virtual/kvm/review-checklist.txt
Normal file
38
Documentation/virtual/kvm/review-checklist.txt
Normal file
@@ -0,0 +1,38 @@
|
||||
Review checklist for kvm patches
|
||||
================================
|
||||
|
||||
1. The patch must follow Documentation/CodingStyle and
|
||||
Documentation/SubmittingPatches.
|
||||
|
||||
2. Patches should be against kvm.git master branch.
|
||||
|
||||
3. If the patch introduces or modifies a new userspace API:
|
||||
- the API must be documented in Documentation/kvm/api.txt
|
||||
- the API must be discoverable using KVM_CHECK_EXTENSION
|
||||
|
||||
4. New state must include support for save/restore.
|
||||
|
||||
5. New features must default to off (userspace should explicitly request them).
|
||||
Performance improvements can and should default to on.
|
||||
|
||||
6. New cpu features should be exposed via KVM_GET_SUPPORTED_CPUID2
|
||||
|
||||
7. Emulator changes should be accompanied by unit tests for qemu-kvm.git
|
||||
kvm/test directory.
|
||||
|
||||
8. Changes should be vendor neutral when possible. Changes to common code
|
||||
are better than duplicating changes to vendor code.
|
||||
|
||||
9. Similarly, prefer changes to arch independent code than to arch dependent
|
||||
code.
|
||||
|
||||
10. User/kernel interfaces and guest/host interfaces must be 64-bit clean
|
||||
(all variables and sizes naturally aligned on 64-bit; use specific types
|
||||
only - u64 rather than ulong).
|
||||
|
||||
11. New guest visible features must either be documented in a hardware manual
|
||||
or be accompanied by documentation.
|
||||
|
||||
12. Features must be robust against reset and kexec - for example, shared
|
||||
host/guest memory must be unshared to prevent the host from writing to
|
||||
guest memory that the guest has not reserved for this purpose.
|
612
Documentation/virtual/kvm/timekeeping.txt
Normal file
612
Documentation/virtual/kvm/timekeeping.txt
Normal file
@@ -0,0 +1,612 @@
|
||||
|
||||
Timekeeping Virtualization for X86-Based Architectures
|
||||
|
||||
Zachary Amsden <zamsden@redhat.com>
|
||||
Copyright (c) 2010, Red Hat. All rights reserved.
|
||||
|
||||
1) Overview
|
||||
2) Timing Devices
|
||||
3) TSC Hardware
|
||||
4) Virtualization Problems
|
||||
|
||||
=========================================================================
|
||||
|
||||
1) Overview
|
||||
|
||||
One of the most complicated parts of the X86 platform, and specifically,
|
||||
the virtualization of this platform is the plethora of timing devices available
|
||||
and the complexity of emulating those devices. In addition, virtualization of
|
||||
time introduces a new set of challenges because it introduces a multiplexed
|
||||
division of time beyond the control of the guest CPU.
|
||||
|
||||
First, we will describe the various timekeeping hardware available, then
|
||||
present some of the problems which arise and solutions available, giving
|
||||
specific recommendations for certain classes of KVM guests.
|
||||
|
||||
The purpose of this document is to collect data and information relevant to
|
||||
timekeeping which may be difficult to find elsewhere, specifically,
|
||||
information relevant to KVM and hardware-based virtualization.
|
||||
|
||||
=========================================================================
|
||||
|
||||
2) Timing Devices
|
||||
|
||||
First we discuss the basic hardware devices available. TSC and the related
|
||||
KVM clock are special enough to warrant a full exposition and are described in
|
||||
the following section.
|
||||
|
||||
2.1) i8254 - PIT
|
||||
|
||||
One of the first timer devices available is the programmable interrupt timer,
|
||||
or PIT. The PIT has a fixed frequency 1.193182 MHz base clock and three
|
||||
channels which can be programmed to deliver periodic or one-shot interrupts.
|
||||
These three channels can be configured in different modes and have individual
|
||||
counters. Channel 1 and 2 were not available for general use in the original
|
||||
IBM PC, and historically were connected to control RAM refresh and the PC
|
||||
speaker. Now the PIT is typically integrated as part of an emulated chipset
|
||||
and a separate physical PIT is not used.
|
||||
|
||||
The PIT uses I/O ports 0x40 - 0x43. Access to the 16-bit counters is done
|
||||
using single or multiple byte access to the I/O ports. There are 6 modes
|
||||
available, but not all modes are available to all timers, as only timer 2
|
||||
has a connected gate input, required for modes 1 and 5. The gate line is
|
||||
controlled by port 61h, bit 0, as illustrated in the following diagram.
|
||||
|
||||
-------------- ----------------
|
||||
| | | |
|
||||
| 1.1932 MHz |---------->| CLOCK OUT | ---------> IRQ 0
|
||||
| Clock | | | |
|
||||
-------------- | +->| GATE TIMER 0 |
|
||||
| ----------------
|
||||
|
|
||||
| ----------------
|
||||
| | |
|
||||
|------>| CLOCK OUT | ---------> 66.3 KHZ DRAM
|
||||
| | | (aka /dev/null)
|
||||
| +->| GATE TIMER 1 |
|
||||
| ----------------
|
||||
|
|
||||
| ----------------
|
||||
| | |
|
||||
|------>| CLOCK OUT | ---------> Port 61h, bit 5
|
||||
| | |
|
||||
Port 61h, bit 0 ---------->| GATE TIMER 2 | \_.---- ____
|
||||
---------------- _| )--|LPF|---Speaker
|
||||
/ *---- \___/
|
||||
Port 61h, bit 1 -----------------------------------/
|
||||
|
||||
The timer modes are now described.
|
||||
|
||||
Mode 0: Single Timeout. This is a one-shot software timeout that counts down
|
||||
when the gate is high (always true for timers 0 and 1). When the count
|
||||
reaches zero, the output goes high.
|
||||
|
||||
Mode 1: Triggered One-shot. The output is initially set high. When the gate
|
||||
line is set high, a countdown is initiated (which does not stop if the gate is
|
||||
lowered), during which the output is set low. When the count reaches zero,
|
||||
the output goes high.
|
||||
|
||||
Mode 2: Rate Generator. The output is initially set high. When the countdown
|
||||
reaches 1, the output goes low for one count and then returns high. The value
|
||||
is reloaded and the countdown automatically resumes. If the gate line goes
|
||||
low, the count is halted. If the output is low when the gate is lowered, the
|
||||
output automatically goes high (this only affects timer 2).
|
||||
|
||||
Mode 3: Square Wave. This generates a high / low square wave. The count
|
||||
determines the length of the pulse, which alternates between high and low
|
||||
when zero is reached. The count only proceeds when gate is high and is
|
||||
automatically reloaded on reaching zero. The count is decremented twice at
|
||||
each clock to generate a full high / low cycle at the full periodic rate.
|
||||
If the count is even, the clock remains high for N/2 counts and low for N/2
|
||||
counts; if the clock is odd, the clock is high for (N+1)/2 counts and low
|
||||
for (N-1)/2 counts. Only even values are latched by the counter, so odd
|
||||
values are not observed when reading. This is the intended mode for timer 2,
|
||||
which generates sine-like tones by low-pass filtering the square wave output.
|
||||
|
||||
Mode 4: Software Strobe. After programming this mode and loading the counter,
|
||||
the output remains high until the counter reaches zero. Then the output
|
||||
goes low for 1 clock cycle and returns high. The counter is not reloaded.
|
||||
Counting only occurs when gate is high.
|
||||
|
||||
Mode 5: Hardware Strobe. After programming and loading the counter, the
|
||||
output remains high. When the gate is raised, a countdown is initiated
|
||||
(which does not stop if the gate is lowered). When the counter reaches zero,
|
||||
the output goes low for 1 clock cycle and then returns high. The counter is
|
||||
not reloaded.
|
||||
|
||||
In addition to normal binary counting, the PIT supports BCD counting. The
|
||||
command port, 0x43 is used to set the counter and mode for each of the three
|
||||
timers.
|
||||
|
||||
PIT commands, issued to port 0x43, using the following bit encoding:
|
||||
|
||||
Bit 7-4: Command (See table below)
|
||||
Bit 3-1: Mode (000 = Mode 0, 101 = Mode 5, 11X = undefined)
|
||||
Bit 0 : Binary (0) / BCD (1)
|
||||
|
||||
Command table:
|
||||
|
||||
0000 - Latch Timer 0 count for port 0x40
|
||||
sample and hold the count to be read in port 0x40;
|
||||
additional commands ignored until counter is read;
|
||||
mode bits ignored.
|
||||
|
||||
0001 - Set Timer 0 LSB mode for port 0x40
|
||||
set timer to read LSB only and force MSB to zero;
|
||||
mode bits set timer mode
|
||||
|
||||
0010 - Set Timer 0 MSB mode for port 0x40
|
||||
set timer to read MSB only and force LSB to zero;
|
||||
mode bits set timer mode
|
||||
|
||||
0011 - Set Timer 0 16-bit mode for port 0x40
|
||||
set timer to read / write LSB first, then MSB;
|
||||
mode bits set timer mode
|
||||
|
||||
0100 - Latch Timer 1 count for port 0x41 - as described above
|
||||
0101 - Set Timer 1 LSB mode for port 0x41 - as described above
|
||||
0110 - Set Timer 1 MSB mode for port 0x41 - as described above
|
||||
0111 - Set Timer 1 16-bit mode for port 0x41 - as described above
|
||||
|
||||
1000 - Latch Timer 2 count for port 0x42 - as described above
|
||||
1001 - Set Timer 2 LSB mode for port 0x42 - as described above
|
||||
1010 - Set Timer 2 MSB mode for port 0x42 - as described above
|
||||
1011 - Set Timer 2 16-bit mode for port 0x42 as described above
|
||||
|
||||
1101 - General counter latch
|
||||
Latch combination of counters into corresponding ports
|
||||
Bit 3 = Counter 2
|
||||
Bit 2 = Counter 1
|
||||
Bit 1 = Counter 0
|
||||
Bit 0 = Unused
|
||||
|
||||
1110 - Latch timer status
|
||||
Latch combination of counter mode into corresponding ports
|
||||
Bit 3 = Counter 2
|
||||
Bit 2 = Counter 1
|
||||
Bit 1 = Counter 0
|
||||
|
||||
The output of ports 0x40-0x42 following this command will be:
|
||||
|
||||
Bit 7 = Output pin
|
||||
Bit 6 = Count loaded (0 if timer has expired)
|
||||
Bit 5-4 = Read / Write mode
|
||||
01 = MSB only
|
||||
10 = LSB only
|
||||
11 = LSB / MSB (16-bit)
|
||||
Bit 3-1 = Mode
|
||||
Bit 0 = Binary (0) / BCD mode (1)
|
||||
|
||||
2.2) RTC
|
||||
|
||||
The second device which was available in the original PC was the MC146818 real
|
||||
time clock. The original device is now obsolete, and usually emulated by the
|
||||
system chipset, sometimes by an HPET and some frankenstein IRQ routing.
|
||||
|
||||
The RTC is accessed through CMOS variables, which uses an index register to
|
||||
control which bytes are read. Since there is only one index register, read
|
||||
of the CMOS and read of the RTC require lock protection (in addition, it is
|
||||
dangerous to allow userspace utilities such as hwclock to have direct RTC
|
||||
access, as they could corrupt kernel reads and writes of CMOS memory).
|
||||
|
||||
The RTC generates an interrupt which is usually routed to IRQ 8. The interrupt
|
||||
can function as a periodic timer, an additional once a day alarm, and can issue
|
||||
interrupts after an update of the CMOS registers by the MC146818 is complete.
|
||||
The type of interrupt is signalled in the RTC status registers.
|
||||
|
||||
The RTC will update the current time fields by battery power even while the
|
||||
system is off. The current time fields should not be read while an update is
|
||||
in progress, as indicated in the status register.
|
||||
|
||||
The clock uses a 32.768kHz crystal, so bits 6-4 of register A should be
|
||||
programmed to a 32kHz divider if the RTC is to count seconds.
|
||||
|
||||
This is the RAM map originally used for the RTC/CMOS:
|
||||
|
||||
Location Size Description
|
||||
------------------------------------------
|
||||
00h byte Current second (BCD)
|
||||
01h byte Seconds alarm (BCD)
|
||||
02h byte Current minute (BCD)
|
||||
03h byte Minutes alarm (BCD)
|
||||
04h byte Current hour (BCD)
|
||||
05h byte Hours alarm (BCD)
|
||||
06h byte Current day of week (BCD)
|
||||
07h byte Current day of month (BCD)
|
||||
08h byte Current month (BCD)
|
||||
09h byte Current year (BCD)
|
||||
0Ah byte Register A
|
||||
bit 7 = Update in progress
|
||||
bit 6-4 = Divider for clock
|
||||
000 = 4.194 MHz
|
||||
001 = 1.049 MHz
|
||||
010 = 32 kHz
|
||||
10X = test modes
|
||||
110 = reset / disable
|
||||
111 = reset / disable
|
||||
bit 3-0 = Rate selection for periodic interrupt
|
||||
000 = periodic timer disabled
|
||||
001 = 3.90625 uS
|
||||
010 = 7.8125 uS
|
||||
011 = .122070 mS
|
||||
100 = .244141 mS
|
||||
...
|
||||
1101 = 125 mS
|
||||
1110 = 250 mS
|
||||
1111 = 500 mS
|
||||
0Bh byte Register B
|
||||
bit 7 = Run (0) / Halt (1)
|
||||
bit 6 = Periodic interrupt enable
|
||||
bit 5 = Alarm interrupt enable
|
||||
bit 4 = Update-ended interrupt enable
|
||||
bit 3 = Square wave interrupt enable
|
||||
bit 2 = BCD calendar (0) / Binary (1)
|
||||
bit 1 = 12-hour mode (0) / 24-hour mode (1)
|
||||
bit 0 = 0 (DST off) / 1 (DST enabled)
|
||||
OCh byte Register C (read only)
|
||||
bit 7 = interrupt request flag (IRQF)
|
||||
bit 6 = periodic interrupt flag (PF)
|
||||
bit 5 = alarm interrupt flag (AF)
|
||||
bit 4 = update interrupt flag (UF)
|
||||
bit 3-0 = reserved
|
||||
ODh byte Register D (read only)
|
||||
bit 7 = RTC has power
|
||||
bit 6-0 = reserved
|
||||
32h byte Current century BCD (*)
|
||||
(*) location vendor specific and now determined from ACPI global tables
|
||||
|
||||
2.3) APIC
|
||||
|
||||
On Pentium and later processors, an on-board timer is available to each CPU
|
||||
as part of the Advanced Programmable Interrupt Controller. The APIC is
|
||||
accessed through memory-mapped registers and provides interrupt service to each
|
||||
CPU, used for IPIs and local timer interrupts.
|
||||
|
||||
Although in theory the APIC is a safe and stable source for local interrupts,
|
||||
in practice, many bugs and glitches have occurred due to the special nature of
|
||||
the APIC CPU-local memory-mapped hardware. Beware that CPU errata may affect
|
||||
the use of the APIC and that workarounds may be required. In addition, some of
|
||||
these workarounds pose unique constraints for virtualization - requiring either
|
||||
extra overhead incurred from extra reads of memory-mapped I/O or additional
|
||||
functionality that may be more computationally expensive to implement.
|
||||
|
||||
Since the APIC is documented quite well in the Intel and AMD manuals, we will
|
||||
avoid repetition of the detail here. It should be pointed out that the APIC
|
||||
timer is programmed through the LVT (local vector timer) register, is capable
|
||||
of one-shot or periodic operation, and is based on the bus clock divided down
|
||||
by the programmable divider register.
|
||||
|
||||
2.4) HPET
|
||||
|
||||
HPET is quite complex, and was originally intended to replace the PIT / RTC
|
||||
support of the X86 PC. It remains to be seen whether that will be the case, as
|
||||
the de facto standard of PC hardware is to emulate these older devices. Some
|
||||
systems designated as legacy free may support only the HPET as a hardware timer
|
||||
device.
|
||||
|
||||
The HPET spec is rather loose and vague, requiring at least 3 hardware timers,
|
||||
but allowing implementation freedom to support many more. It also imposes no
|
||||
fixed rate on the timer frequency, but does impose some extremal values on
|
||||
frequency, error and slew.
|
||||
|
||||
In general, the HPET is recommended as a high precision (compared to PIT /RTC)
|
||||
time source which is independent of local variation (as there is only one HPET
|
||||
in any given system). The HPET is also memory-mapped, and its presence is
|
||||
indicated through ACPI tables by the BIOS.
|
||||
|
||||
Detailed specification of the HPET is beyond the current scope of this
|
||||
document, as it is also very well documented elsewhere.
|
||||
|
||||
2.5) Offboard Timers
|
||||
|
||||
Several cards, both proprietary (watchdog boards) and commonplace (e1000) have
|
||||
timing chips built into the cards which may have registers which are accessible
|
||||
to kernel or user drivers. To the author's knowledge, using these to generate
|
||||
a clocksource for a Linux or other kernel has not yet been attempted and is in
|
||||
general frowned upon as not playing by the agreed rules of the game. Such a
|
||||
timer device would require additional support to be virtualized properly and is
|
||||
not considered important at this time as no known operating system does this.
|
||||
|
||||
=========================================================================
|
||||
|
||||
3) TSC Hardware
|
||||
|
||||
The TSC or time stamp counter is relatively simple in theory; it counts
|
||||
instruction cycles issued by the processor, which can be used as a measure of
|
||||
time. In practice, due to a number of problems, it is the most complicated
|
||||
timekeeping device to use.
|
||||
|
||||
The TSC is represented internally as a 64-bit MSR which can be read with the
|
||||
RDMSR, RDTSC, or RDTSCP (when available) instructions. In the past, hardware
|
||||
limitations made it possible to write the TSC, but generally on old hardware it
|
||||
was only possible to write the low 32-bits of the 64-bit counter, and the upper
|
||||
32-bits of the counter were cleared. Now, however, on Intel processors family
|
||||
0Fh, for models 3, 4 and 6, and family 06h, models e and f, this restriction
|
||||
has been lifted and all 64-bits are writable. On AMD systems, the ability to
|
||||
write the TSC MSR is not an architectural guarantee.
|
||||
|
||||
The TSC is accessible from CPL-0 and conditionally, for CPL > 0 software by
|
||||
means of the CR4.TSD bit, which when enabled, disables CPL > 0 TSC access.
|
||||
|
||||
Some vendors have implemented an additional instruction, RDTSCP, which returns
|
||||
atomically not just the TSC, but an indicator which corresponds to the
|
||||
processor number. This can be used to index into an array of TSC variables to
|
||||
determine offset information in SMP systems where TSCs are not synchronized.
|
||||
The presence of this instruction must be determined by consulting CPUID feature
|
||||
bits.
|
||||
|
||||
Both VMX and SVM provide extension fields in the virtualization hardware which
|
||||
allows the guest visible TSC to be offset by a constant. Newer implementations
|
||||
promise to allow the TSC to additionally be scaled, but this hardware is not
|
||||
yet widely available.
|
||||
|
||||
3.1) TSC synchronization
|
||||
|
||||
The TSC is a CPU-local clock in most implementations. This means, on SMP
|
||||
platforms, the TSCs of different CPUs may start at different times depending
|
||||
on when the CPUs are powered on. Generally, CPUs on the same die will share
|
||||
the same clock, however, this is not always the case.
|
||||
|
||||
The BIOS may attempt to resynchronize the TSCs during the poweron process and
|
||||
the operating system or other system software may attempt to do this as well.
|
||||
Several hardware limitations make the problem worse - if it is not possible to
|
||||
write the full 64-bits of the TSC, it may be impossible to match the TSC in
|
||||
newly arriving CPUs to that of the rest of the system, resulting in
|
||||
unsynchronized TSCs. This may be done by BIOS or system software, but in
|
||||
practice, getting a perfectly synchronized TSC will not be possible unless all
|
||||
values are read from the same clock, which generally only is possible on single
|
||||
socket systems or those with special hardware support.
|
||||
|
||||
3.2) TSC and CPU hotplug
|
||||
|
||||
As touched on already, CPUs which arrive later than the boot time of the system
|
||||
may not have a TSC value that is synchronized with the rest of the system.
|
||||
Either system software, BIOS, or SMM code may actually try to establish the TSC
|
||||
to a value matching the rest of the system, but a perfect match is usually not
|
||||
a guarantee. This can have the effect of bringing a system from a state where
|
||||
TSC is synchronized back to a state where TSC synchronization flaws, however
|
||||
small, may be exposed to the OS and any virtualization environment.
|
||||
|
||||
3.3) TSC and multi-socket / NUMA
|
||||
|
||||
Multi-socket systems, especially large multi-socket systems are likely to have
|
||||
individual clocksources rather than a single, universally distributed clock.
|
||||
Since these clocks are driven by different crystals, they will not have
|
||||
perfectly matched frequency, and temperature and electrical variations will
|
||||
cause the CPU clocks, and thus the TSCs to drift over time. Depending on the
|
||||
exact clock and bus design, the drift may or may not be fixed in absolute
|
||||
error, and may accumulate over time.
|
||||
|
||||
In addition, very large systems may deliberately slew the clocks of individual
|
||||
cores. This technique, known as spread-spectrum clocking, reduces EMI at the
|
||||
clock frequency and harmonics of it, which may be required to pass FCC
|
||||
standards for telecommunications and computer equipment.
|
||||
|
||||
It is recommended not to trust the TSCs to remain synchronized on NUMA or
|
||||
multiple socket systems for these reasons.
|
||||
|
||||
3.4) TSC and C-states
|
||||
|
||||
C-states, or idling states of the processor, especially C1E and deeper sleep
|
||||
states may be problematic for TSC as well. The TSC may stop advancing in such
|
||||
a state, resulting in a TSC which is behind that of other CPUs when execution
|
||||
is resumed. Such CPUs must be detected and flagged by the operating system
|
||||
based on CPU and chipset identifications.
|
||||
|
||||
The TSC in such a case may be corrected by catching it up to a known external
|
||||
clocksource.
|
||||
|
||||
3.5) TSC frequency change / P-states
|
||||
|
||||
To make things slightly more interesting, some CPUs may change frequency. They
|
||||
may or may not run the TSC at the same rate, and because the frequency change
|
||||
may be staggered or slewed, at some points in time, the TSC rate may not be
|
||||
known other than falling within a range of values. In this case, the TSC will
|
||||
not be a stable time source, and must be calibrated against a known, stable,
|
||||
external clock to be a usable source of time.
|
||||
|
||||
Whether the TSC runs at a constant rate or scales with the P-state is model
|
||||
dependent and must be determined by inspecting CPUID, chipset or vendor
|
||||
specific MSR fields.
|
||||
|
||||
In addition, some vendors have known bugs where the P-state is actually
|
||||
compensated for properly during normal operation, but when the processor is
|
||||
inactive, the P-state may be raised temporarily to service cache misses from
|
||||
other processors. In such cases, the TSC on halted CPUs could advance faster
|
||||
than that of non-halted processors. AMD Turion processors are known to have
|
||||
this problem.
|
||||
|
||||
3.6) TSC and STPCLK / T-states
|
||||
|
||||
External signals given to the processor may also have the effect of stopping
|
||||
the TSC. This is typically done for thermal emergency power control to prevent
|
||||
an overheating condition, and typically, there is no way to detect that this
|
||||
condition has happened.
|
||||
|
||||
3.7) TSC virtualization - VMX
|
||||
|
||||
VMX provides conditional trapping of RDTSC, RDMSR, WRMSR and RDTSCP
|
||||
instructions, which is enough for full virtualization of TSC in any manner. In
|
||||
addition, VMX allows passing through the host TSC plus an additional TSC_OFFSET
|
||||
field specified in the VMCS. Special instructions must be used to read and
|
||||
write the VMCS field.
|
||||
|
||||
3.8) TSC virtualization - SVM
|
||||
|
||||
SVM provides conditional trapping of RDTSC, RDMSR, WRMSR and RDTSCP
|
||||
instructions, which is enough for full virtualization of TSC in any manner. In
|
||||
addition, SVM allows passing through the host TSC plus an additional offset
|
||||
field specified in the SVM control block.
|
||||
|
||||
3.9) TSC feature bits in Linux
|
||||
|
||||
In summary, there is no way to guarantee the TSC remains in perfect
|
||||
synchronization unless it is explicitly guaranteed by the architecture. Even
|
||||
if so, the TSCs in multi-sockets or NUMA systems may still run independently
|
||||
despite being locally consistent.
|
||||
|
||||
The following feature bits are used by Linux to signal various TSC attributes,
|
||||
but they can only be taken to be meaningful for UP or single node systems.
|
||||
|
||||
X86_FEATURE_TSC : The TSC is available in hardware
|
||||
X86_FEATURE_RDTSCP : The RDTSCP instruction is available
|
||||
X86_FEATURE_CONSTANT_TSC : The TSC rate is unchanged with P-states
|
||||
X86_FEATURE_NONSTOP_TSC : The TSC does not stop in C-states
|
||||
X86_FEATURE_TSC_RELIABLE : TSC sync checks are skipped (VMware)
|
||||
|
||||
4) Virtualization Problems
|
||||
|
||||
Timekeeping is especially problematic for virtualization because a number of
|
||||
challenges arise. The most obvious problem is that time is now shared between
|
||||
the host and, potentially, a number of virtual machines. Thus the virtual
|
||||
operating system does not run with 100% usage of the CPU, despite the fact that
|
||||
it may very well make that assumption. It may expect it to remain true to very
|
||||
exacting bounds when interrupt sources are disabled, but in reality only its
|
||||
virtual interrupt sources are disabled, and the machine may still be preempted
|
||||
at any time. This causes problems as the passage of real time, the injection
|
||||
of machine interrupts and the associated clock sources are no longer completely
|
||||
synchronized with real time.
|
||||
|
||||
This same problem can occur on native harware to a degree, as SMM mode may
|
||||
steal cycles from the naturally on X86 systems when SMM mode is used by the
|
||||
BIOS, but not in such an extreme fashion. However, the fact that SMM mode may
|
||||
cause similar problems to virtualization makes it a good justification for
|
||||
solving many of these problems on bare metal.
|
||||
|
||||
4.1) Interrupt clocking
|
||||
|
||||
One of the most immediate problems that occurs with legacy operating systems
|
||||
is that the system timekeeping routines are often designed to keep track of
|
||||
time by counting periodic interrupts. These interrupts may come from the PIT
|
||||
or the RTC, but the problem is the same: the host virtualization engine may not
|
||||
be able to deliver the proper number of interrupts per second, and so guest
|
||||
time may fall behind. This is especially problematic if a high interrupt rate
|
||||
is selected, such as 1000 HZ, which is unfortunately the default for many Linux
|
||||
guests.
|
||||
|
||||
There are three approaches to solving this problem; first, it may be possible
|
||||
to simply ignore it. Guests which have a separate time source for tracking
|
||||
'wall clock' or 'real time' may not need any adjustment of their interrupts to
|
||||
maintain proper time. If this is not sufficient, it may be necessary to inject
|
||||
additional interrupts into the guest in order to increase the effective
|
||||
interrupt rate. This approach leads to complications in extreme conditions,
|
||||
where host load or guest lag is too much to compensate for, and thus another
|
||||
solution to the problem has risen: the guest may need to become aware of lost
|
||||
ticks and compensate for them internally. Although promising in theory, the
|
||||
implementation of this policy in Linux has been extremely error prone, and a
|
||||
number of buggy variants of lost tick compensation are distributed across
|
||||
commonly used Linux systems.
|
||||
|
||||
Windows uses periodic RTC clocking as a means of keeping time internally, and
|
||||
thus requires interrupt slewing to keep proper time. It does use a low enough
|
||||
rate (ed: is it 18.2 Hz?) however that it has not yet been a problem in
|
||||
practice.
|
||||
|
||||
4.2) TSC sampling and serialization
|
||||
|
||||
As the highest precision time source available, the cycle counter of the CPU
|
||||
has aroused much interest from developers. As explained above, this timer has
|
||||
many problems unique to its nature as a local, potentially unstable and
|
||||
potentially unsynchronized source. One issue which is not unique to the TSC,
|
||||
but is highlighted because of its very precise nature is sampling delay. By
|
||||
definition, the counter, once read is already old. However, it is also
|
||||
possible for the counter to be read ahead of the actual use of the result.
|
||||
This is a consequence of the superscalar execution of the instruction stream,
|
||||
which may execute instructions out of order. Such execution is called
|
||||
non-serialized. Forcing serialized execution is necessary for precise
|
||||
measurement with the TSC, and requires a serializing instruction, such as CPUID
|
||||
or an MSR read.
|
||||
|
||||
Since CPUID may actually be virtualized by a trap and emulate mechanism, this
|
||||
serialization can pose a performance issue for hardware virtualization. An
|
||||
accurate time stamp counter reading may therefore not always be available, and
|
||||
it may be necessary for an implementation to guard against "backwards" reads of
|
||||
the TSC as seen from other CPUs, even in an otherwise perfectly synchronized
|
||||
system.
|
||||
|
||||
4.3) Timespec aliasing
|
||||
|
||||
Additionally, this lack of serialization from the TSC poses another challenge
|
||||
when using results of the TSC when measured against another time source. As
|
||||
the TSC is much higher precision, many possible values of the TSC may be read
|
||||
while another clock is still expressing the same value.
|
||||
|
||||
That is, you may read (T,T+10) while external clock C maintains the same value.
|
||||
Due to non-serialized reads, you may actually end up with a range which
|
||||
fluctuates - from (T-1.. T+10). Thus, any time calculated from a TSC, but
|
||||
calibrated against an external value may have a range of valid values.
|
||||
Re-calibrating this computation may actually cause time, as computed after the
|
||||
calibration, to go backwards, compared with time computed before the
|
||||
calibration.
|
||||
|
||||
This problem is particularly pronounced with an internal time source in Linux,
|
||||
the kernel time, which is expressed in the theoretically high resolution
|
||||
timespec - but which advances in much larger granularity intervals, sometimes
|
||||
at the rate of jiffies, and possibly in catchup modes, at a much larger step.
|
||||
|
||||
This aliasing requires care in the computation and recalibration of kvmclock
|
||||
and any other values derived from TSC computation (such as TSC virtualization
|
||||
itself).
|
||||
|
||||
4.4) Migration
|
||||
|
||||
Migration of a virtual machine raises problems for timekeeping in two ways.
|
||||
First, the migration itself may take time, during which interrupts cannot be
|
||||
delivered, and after which, the guest time may need to be caught up. NTP may
|
||||
be able to help to some degree here, as the clock correction required is
|
||||
typically small enough to fall in the NTP-correctable window.
|
||||
|
||||
An additional concern is that timers based off the TSC (or HPET, if the raw bus
|
||||
clock is exposed) may now be running at different rates, requiring compensation
|
||||
in some way in the hypervisor by virtualizing these timers. In addition,
|
||||
migrating to a faster machine may preclude the use of a passthrough TSC, as a
|
||||
faster clock cannot be made visible to a guest without the potential of time
|
||||
advancing faster than usual. A slower clock is less of a problem, as it can
|
||||
always be caught up to the original rate. KVM clock avoids these problems by
|
||||
simply storing multipliers and offsets against the TSC for the guest to convert
|
||||
back into nanosecond resolution values.
|
||||
|
||||
4.5) Scheduling
|
||||
|
||||
Since scheduling may be based on precise timing and firing of interrupts, the
|
||||
scheduling algorithms of an operating system may be adversely affected by
|
||||
virtualization. In theory, the effect is random and should be universally
|
||||
distributed, but in contrived as well as real scenarios (guest device access,
|
||||
causes of virtualization exits, possible context switch), this may not always
|
||||
be the case. The effect of this has not been well studied.
|
||||
|
||||
In an attempt to work around this, several implementations have provided a
|
||||
paravirtualized scheduler clock, which reveals the true amount of CPU time for
|
||||
which a virtual machine has been running.
|
||||
|
||||
4.6) Watchdogs
|
||||
|
||||
Watchdog timers, such as the lock detector in Linux may fire accidentally when
|
||||
running under hardware virtualization due to timer interrupts being delayed or
|
||||
misinterpretation of the passage of real time. Usually, these warnings are
|
||||
spurious and can be ignored, but in some circumstances it may be necessary to
|
||||
disable such detection.
|
||||
|
||||
4.7) Delays and precision timing
|
||||
|
||||
Precise timing and delays may not be possible in a virtualized system. This
|
||||
can happen if the system is controlling physical hardware, or issues delays to
|
||||
compensate for slower I/O to and from devices. The first issue is not solvable
|
||||
in general for a virtualized system; hardware control software can't be
|
||||
adequately virtualized without a full real-time operating system, which would
|
||||
require an RT aware virtualization platform.
|
||||
|
||||
The second issue may cause performance problems, but this is unlikely to be a
|
||||
significant issue. In many cases these delays may be eliminated through
|
||||
configuration or paravirtualization.
|
||||
|
||||
4.8) Covert channels and leaks
|
||||
|
||||
In addition to the above problems, time information will inevitably leak to the
|
||||
guest about the host in anything but a perfect implementation of virtualized
|
||||
time. This may allow the guest to infer the presence of a hypervisor (as in a
|
||||
red-pill type detection), and it may allow information to leak between guests
|
||||
by using CPU utilization itself as a signalling channel. Preventing such
|
||||
problems would require completely isolated virtual time which may not track
|
||||
real time any longer. This may be useful in certain security or QA contexts,
|
||||
but in general isn't recommended for real-world deployment scenarios.
|
Reference in New Issue
Block a user