Skip to content

Commit

Permalink
Merge tag 'kvm-x86-xen-6.9' of https://github.com/kvm-x86/linux into …
Browse files Browse the repository at this point in the history
…HEAD

KVM Xen and pfncache changes for 6.9:

 - Rip out the half-baked support for using gfn_to_pfn caches to manage pages
   that are "mapped" into guests via physical addresses.

 - Add support for using gfn_to_pfn caches with only a host virtual address,
   i.e. to bypass the "gfn" stage of the cache.  The primary use case is
   overlay pages, where the guest may change the gfn used to reference the
   overlay page, but the backing hva+pfn remains the same.

 - Add an ioctl() to allow mapping Xen's shared_info page using an hva instead
   of a gpa, so that userspace doesn't need to reconfigure and invalidate the
   cache/mapping if the guest changes the gpa (but userspace keeps the resolved
   hva the same).

 - When possible, use a single host TSC value when computing the deadline for
   Xen timers in order to improve the accuracy of the timer emulation.

 - Inject pending upcall events when the vCPU software-enables its APIC to fix
   a bug where an upcall can be lost (and to follow Xen's behavior).

 - Fall back to the slow path instead of warning if "fast" IRQ delivery of Xen
   events fails, e.g. if the guest has aliased xAPIC IDs.

 - Extend gfn_to_pfn_cache's mutex to cover (de)activation (in addition to
   refresh), and drop a now-redundant acquisition of xen_lock (that was
   protecting the shared_info cache) to fix a deadlock due to recursively
   acquiring xen_lock.
  • Loading branch information
bonzini committed Mar 11, 2024
2 parents e9025cd + 7a36d68 commit e9a2bba
Show file tree
Hide file tree
Showing 16 changed files with 601 additions and 268 deletions.
53 changes: 41 additions & 12 deletions Documentation/virt/kvm/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -372,7 +372,7 @@ The bits in the dirty bitmap are cleared before the ioctl returns, unless
KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2 is enabled. For more information,
see the description of the capability.

Note that the Xen shared info page, if configured, shall always be assumed
Note that the Xen shared_info page, if configured, shall always be assumed
to be dirty. KVM will not explicitly mark it such.


Expand Down Expand Up @@ -5487,8 +5487,9 @@ KVM_PV_ASYNC_CLEANUP_PERFORM
__u8 long_mode;
__u8 vector;
__u8 runstate_update_flag;
struct {
union {
__u64 gfn;
__u64 hva;
} shared_info;
struct {
__u32 send_port;
Expand Down Expand Up @@ -5516,19 +5517,20 @@ type values:

KVM_XEN_ATTR_TYPE_LONG_MODE
Sets the ABI mode of the VM to 32-bit or 64-bit (long mode). This
determines the layout of the shared info pages exposed to the VM.
determines the layout of the shared_info page exposed to the VM.

KVM_XEN_ATTR_TYPE_SHARED_INFO
Sets the guest physical frame number at which the Xen "shared info"
Sets the guest physical frame number at which the Xen shared_info
page resides. Note that although Xen places vcpu_info for the first
32 vCPUs in the shared_info page, KVM does not automatically do so
and instead requires that KVM_XEN_VCPU_ATTR_TYPE_VCPU_INFO be used
explicitly even when the vcpu_info for a given vCPU resides at the
"default" location in the shared_info page. This is because KVM may
not be aware of the Xen CPU id which is used as the index into the
vcpu_info[] array, so may know the correct default location.

Note that the shared info page may be constantly written to by KVM;
and instead requires that KVM_XEN_VCPU_ATTR_TYPE_VCPU_INFO or
KVM_XEN_VCPU_ATTR_TYPE_VCPU_INFO_HVA be used explicitly even when
the vcpu_info for a given vCPU resides at the "default" location
in the shared_info page. This is because KVM may not be aware of
the Xen CPU id which is used as the index into the vcpu_info[]
array, so may know the correct default location.

Note that the shared_info page may be constantly written to by KVM;
it contains the event channel bitmap used to deliver interrupts to
a Xen guest, amongst other things. It is exempt from dirty tracking
mechanisms — KVM will not explicitly mark the page as dirty each
Expand All @@ -5537,9 +5539,21 @@ KVM_XEN_ATTR_TYPE_SHARED_INFO
any vCPU has been running or any event channel interrupts can be
routed to the guest.

Setting the gfn to KVM_XEN_INVALID_GFN will disable the shared info
Setting the gfn to KVM_XEN_INVALID_GFN will disable the shared_info
page.

KVM_XEN_ATTR_TYPE_SHARED_INFO_HVA
If the KVM_XEN_HVM_CONFIG_SHARED_INFO_HVA flag is also set in the
Xen capabilities, then this attribute may be used to set the
userspace address at which the shared_info page resides, which
will always be fixed in the VMM regardless of where it is mapped
in guest physical address space. This attribute should be used in
preference to KVM_XEN_ATTR_TYPE_SHARED_INFO as it avoids
unnecessary invalidation of an internal cache when the page is
re-mapped in guest physcial address space.

Setting the hva to zero will disable the shared_info page.

KVM_XEN_ATTR_TYPE_UPCALL_VECTOR
Sets the exception vector used to deliver Xen event channel upcalls.
This is the HVM-wide vector injected directly by the hypervisor
Expand Down Expand Up @@ -5636,6 +5650,21 @@ KVM_XEN_VCPU_ATTR_TYPE_VCPU_INFO
on dirty logging. Setting the gpa to KVM_XEN_INVALID_GPA will disable
the vcpu_info.

KVM_XEN_VCPU_ATTR_TYPE_VCPU_INFO_HVA
If the KVM_XEN_HVM_CONFIG_SHARED_INFO_HVA flag is also set in the
Xen capabilities, then this attribute may be used to set the
userspace address of the vcpu_info for a given vCPU. It should
only be used when the vcpu_info resides at the "default" location
in the shared_info page. In this case it is safe to assume the
userspace address will not change, because the shared_info page is
an overlay on guest memory and remains at a fixed host address
regardless of where it is mapped in guest physical address space
and hence unnecessary invalidation of an internal cache may be
avoided if the guest memory layout is modified.
If the vcpu_info does not reside at the "default" location then
it is not guaranteed to remain at the same host address and
hence the aforementioned cache invalidation is required.

KVM_XEN_VCPU_ATTR_TYPE_VCPU_TIME_INFO
Sets the guest physical address of an additional pvclock structure
for a given vCPU. This is typically used for guest vsyscall support.
Expand Down
2 changes: 1 addition & 1 deletion arch/s390/kvm/diag.c
Original file line number Diff line number Diff line change
Expand Up @@ -102,7 +102,7 @@ static int __diag_page_ref_service(struct kvm_vcpu *vcpu)
parm.token_addr & 7 || parm.zarch != 0x8000000000000000ULL)
return kvm_s390_inject_program_int(vcpu, PGM_SPECIFICATION);

if (kvm_is_error_gpa(vcpu->kvm, parm.token_addr))
if (!kvm_is_gpa_in_memslot(vcpu->kvm, parm.token_addr))
return kvm_s390_inject_program_int(vcpu, PGM_ADDRESSING);

vcpu->arch.pfault_token = parm.token_addr;
Expand Down
14 changes: 7 additions & 7 deletions arch/s390/kvm/gaccess.c
Original file line number Diff line number Diff line change
Expand Up @@ -664,7 +664,7 @@ static unsigned long guest_translate(struct kvm_vcpu *vcpu, unsigned long gva,
case ASCE_TYPE_REGION1: {
union region1_table_entry rfte;

if (kvm_is_error_gpa(vcpu->kvm, ptr))
if (!kvm_is_gpa_in_memslot(vcpu->kvm, ptr))
return PGM_ADDRESSING;
if (deref_table(vcpu->kvm, ptr, &rfte.val))
return -EFAULT;
Expand All @@ -682,7 +682,7 @@ static unsigned long guest_translate(struct kvm_vcpu *vcpu, unsigned long gva,
case ASCE_TYPE_REGION2: {
union region2_table_entry rste;

if (kvm_is_error_gpa(vcpu->kvm, ptr))
if (!kvm_is_gpa_in_memslot(vcpu->kvm, ptr))
return PGM_ADDRESSING;
if (deref_table(vcpu->kvm, ptr, &rste.val))
return -EFAULT;
Expand All @@ -700,7 +700,7 @@ static unsigned long guest_translate(struct kvm_vcpu *vcpu, unsigned long gva,
case ASCE_TYPE_REGION3: {
union region3_table_entry rtte;

if (kvm_is_error_gpa(vcpu->kvm, ptr))
if (!kvm_is_gpa_in_memslot(vcpu->kvm, ptr))
return PGM_ADDRESSING;
if (deref_table(vcpu->kvm, ptr, &rtte.val))
return -EFAULT;
Expand Down Expand Up @@ -728,7 +728,7 @@ static unsigned long guest_translate(struct kvm_vcpu *vcpu, unsigned long gva,
case ASCE_TYPE_SEGMENT: {
union segment_table_entry ste;

if (kvm_is_error_gpa(vcpu->kvm, ptr))
if (!kvm_is_gpa_in_memslot(vcpu->kvm, ptr))
return PGM_ADDRESSING;
if (deref_table(vcpu->kvm, ptr, &ste.val))
return -EFAULT;
Expand All @@ -748,7 +748,7 @@ static unsigned long guest_translate(struct kvm_vcpu *vcpu, unsigned long gva,
ptr = ste.fc0.pto * (PAGE_SIZE / 2) + vaddr.px * 8;
}
}
if (kvm_is_error_gpa(vcpu->kvm, ptr))
if (!kvm_is_gpa_in_memslot(vcpu->kvm, ptr))
return PGM_ADDRESSING;
if (deref_table(vcpu->kvm, ptr, &pte.val))
return -EFAULT;
Expand All @@ -770,7 +770,7 @@ static unsigned long guest_translate(struct kvm_vcpu *vcpu, unsigned long gva,
*prot = PROT_TYPE_IEP;
return PGM_PROTECTION;
}
if (kvm_is_error_gpa(vcpu->kvm, raddr.addr))
if (!kvm_is_gpa_in_memslot(vcpu->kvm, raddr.addr))
return PGM_ADDRESSING;
*gpa = raddr.addr;
return 0;
Expand Down Expand Up @@ -957,7 +957,7 @@ static int guest_range_to_gpas(struct kvm_vcpu *vcpu, unsigned long ga, u8 ar,
return rc;
} else {
gpa = kvm_s390_real_to_abs(vcpu, ga);
if (kvm_is_error_gpa(vcpu->kvm, gpa)) {
if (!kvm_is_gpa_in_memslot(vcpu->kvm, gpa)) {
rc = PGM_ADDRESSING;
prot = PROT_NONE;
}
Expand Down
4 changes: 2 additions & 2 deletions arch/s390/kvm/kvm-s390.c
Original file line number Diff line number Diff line change
Expand Up @@ -2878,7 +2878,7 @@ static int kvm_s390_vm_mem_op_abs(struct kvm *kvm, struct kvm_s390_mem_op *mop)

srcu_idx = srcu_read_lock(&kvm->srcu);

if (kvm_is_error_gpa(kvm, mop->gaddr)) {
if (!kvm_is_gpa_in_memslot(kvm, mop->gaddr)) {
r = PGM_ADDRESSING;
goto out_unlock;
}
Expand Down Expand Up @@ -2940,7 +2940,7 @@ static int kvm_s390_vm_mem_op_cmpxchg(struct kvm *kvm, struct kvm_s390_mem_op *m

srcu_idx = srcu_read_lock(&kvm->srcu);

if (kvm_is_error_gpa(kvm, mop->gaddr)) {
if (!kvm_is_gpa_in_memslot(kvm, mop->gaddr)) {
r = PGM_ADDRESSING;
goto out_unlock;
}
Expand Down
4 changes: 2 additions & 2 deletions arch/s390/kvm/priv.c
Original file line number Diff line number Diff line change
Expand Up @@ -149,7 +149,7 @@ static int handle_set_prefix(struct kvm_vcpu *vcpu)
* first page, since address is 8k aligned and memory pieces are always
* at least 1MB aligned and have at least a size of 1MB.
*/
if (kvm_is_error_gpa(vcpu->kvm, address))
if (!kvm_is_gpa_in_memslot(vcpu->kvm, address))
return kvm_s390_inject_program_int(vcpu, PGM_ADDRESSING);

kvm_s390_set_prefix(vcpu, address);
Expand Down Expand Up @@ -464,7 +464,7 @@ static int handle_test_block(struct kvm_vcpu *vcpu)
return kvm_s390_inject_prog_irq(vcpu, &vcpu->arch.pgm);
addr = kvm_s390_real_to_abs(vcpu, addr);

if (kvm_is_error_gpa(vcpu->kvm, addr))
if (!kvm_is_gpa_in_memslot(vcpu->kvm, addr))
return kvm_s390_inject_program_int(vcpu, PGM_ADDRESSING);
/*
* We don't expect errors on modern systems, and do not care
Expand Down
2 changes: 1 addition & 1 deletion arch/s390/kvm/sigp.c
Original file line number Diff line number Diff line change
Expand Up @@ -172,7 +172,7 @@ static int __sigp_set_prefix(struct kvm_vcpu *vcpu, struct kvm_vcpu *dst_vcpu,
* first page, since address is 8k aligned and memory pieces are always
* at least 1MB aligned and have at least a size of 1MB.
*/
if (kvm_is_error_gpa(vcpu->kvm, irq.u.prefix.address)) {
if (!kvm_is_gpa_in_memslot(vcpu->kvm, irq.u.prefix.address)) {
*reg &= 0xffffffff00000000UL;
*reg |= SIGP_STATUS_INVALID_PARAMETER;
return SIGP_CC_STATUS_STORED;
Expand Down
9 changes: 8 additions & 1 deletion arch/x86/include/uapi/asm/kvm.h
Original file line number Diff line number Diff line change
Expand Up @@ -549,6 +549,7 @@ struct kvm_x86_mce {
#define KVM_XEN_HVM_CONFIG_EVTCHN_SEND (1 << 5)
#define KVM_XEN_HVM_CONFIG_RUNSTATE_UPDATE_FLAG (1 << 6)
#define KVM_XEN_HVM_CONFIG_PVCLOCK_TSC_UNSTABLE (1 << 7)
#define KVM_XEN_HVM_CONFIG_SHARED_INFO_HVA (1 << 8)

struct kvm_xen_hvm_config {
__u32 flags;
Expand All @@ -567,9 +568,10 @@ struct kvm_xen_hvm_attr {
__u8 long_mode;
__u8 vector;
__u8 runstate_update_flag;
struct {
union {
__u64 gfn;
#define KVM_XEN_INVALID_GFN ((__u64)-1)
__u64 hva;
} shared_info;
struct {
__u32 send_port;
Expand Down Expand Up @@ -611,13 +613,16 @@ struct kvm_xen_hvm_attr {
#define KVM_XEN_ATTR_TYPE_XEN_VERSION 0x4
/* Available with KVM_CAP_XEN_HVM / KVM_XEN_HVM_CONFIG_RUNSTATE_UPDATE_FLAG */
#define KVM_XEN_ATTR_TYPE_RUNSTATE_UPDATE_FLAG 0x5
/* Available with KVM_CAP_XEN_HVM / KVM_XEN_HVM_CONFIG_SHARED_INFO_HVA */
#define KVM_XEN_ATTR_TYPE_SHARED_INFO_HVA 0x6

struct kvm_xen_vcpu_attr {
__u16 type;
__u16 pad[3];
union {
__u64 gpa;
#define KVM_XEN_INVALID_GPA ((__u64)-1)
__u64 hva;
__u64 pad[8];
struct {
__u64 state;
Expand Down Expand Up @@ -648,6 +653,8 @@ struct kvm_xen_vcpu_attr {
#define KVM_XEN_VCPU_ATTR_TYPE_VCPU_ID 0x6
#define KVM_XEN_VCPU_ATTR_TYPE_TIMER 0x7
#define KVM_XEN_VCPU_ATTR_TYPE_UPCALL_VECTOR 0x8
/* Available with KVM_CAP_XEN_HVM / KVM_XEN_HVM_CONFIG_SHARED_INFO_HVA */
#define KVM_XEN_VCPU_ATTR_TYPE_VCPU_INFO_HVA 0x9

/* Secure Encrypted Virtualization command */
enum sev_cmd_id {
Expand Down
5 changes: 4 additions & 1 deletion arch/x86/kvm/lapic.c
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@
#include "ioapic.h"
#include "trace.h"
#include "x86.h"
#include "xen.h"
#include "cpuid.h"
#include "hyperv.h"
#include "smm.h"
Expand Down Expand Up @@ -502,8 +503,10 @@ static inline void apic_set_spiv(struct kvm_lapic *apic, u32 val)
}

/* Check if there are APF page ready requests pending */
if (enabled)
if (enabled) {
kvm_make_request(KVM_REQ_APF_READY, apic->vcpu);
kvm_xen_sw_enable_lapic(apic->vcpu);
}
}

static inline void kvm_apic_set_xapic_id(struct kvm_lapic *apic, u8 id)
Expand Down
68 changes: 60 additions & 8 deletions arch/x86/kvm/x86.c
Original file line number Diff line number Diff line change
Expand Up @@ -2854,7 +2854,11 @@ static inline u64 vgettsc(struct pvclock_clock *clock, u64 *tsc_timestamp,
return v * clock->mult;
}

static int do_monotonic_raw(s64 *t, u64 *tsc_timestamp)
/*
* As with get_kvmclock_base_ns(), this counts from boot time, at the
* frequency of CLOCK_MONOTONIC_RAW (hence adding gtos->offs_boot).
*/
static int do_kvmclock_base(s64 *t, u64 *tsc_timestamp)
{
struct pvclock_gtod_data *gtod = &pvclock_gtod_data;
unsigned long seq;
Expand All @@ -2873,6 +2877,29 @@ static int do_monotonic_raw(s64 *t, u64 *tsc_timestamp)
return mode;
}

/*
* This calculates CLOCK_MONOTONIC at the time of the TSC snapshot, with
* no boot time offset.
*/
static int do_monotonic(s64 *t, u64 *tsc_timestamp)
{
struct pvclock_gtod_data *gtod = &pvclock_gtod_data;
unsigned long seq;
int mode;
u64 ns;

do {
seq = read_seqcount_begin(&gtod->seq);
ns = gtod->clock.base_cycles;
ns += vgettsc(&gtod->clock, tsc_timestamp, &mode);
ns >>= gtod->clock.shift;
ns += ktime_to_ns(gtod->clock.offset);
} while (unlikely(read_seqcount_retry(&gtod->seq, seq)));
*t = ns;

return mode;
}

static int do_realtime(struct timespec64 *ts, u64 *tsc_timestamp)
{
struct pvclock_gtod_data *gtod = &pvclock_gtod_data;
Expand All @@ -2894,18 +2921,42 @@ static int do_realtime(struct timespec64 *ts, u64 *tsc_timestamp)
return mode;
}

/* returns true if host is using TSC based clocksource */
/*
* Calculates the kvmclock_base_ns (CLOCK_MONOTONIC_RAW + boot time) and
* reports the TSC value from which it do so. Returns true if host is
* using TSC based clocksource.
*/
static bool kvm_get_time_and_clockread(s64 *kernel_ns, u64 *tsc_timestamp)
{
/* checked again under seqlock below */
if (!gtod_is_based_on_tsc(pvclock_gtod_data.clock.vclock_mode))
return false;

return gtod_is_based_on_tsc(do_monotonic_raw(kernel_ns,
tsc_timestamp));
return gtod_is_based_on_tsc(do_kvmclock_base(kernel_ns,
tsc_timestamp));
}

/* returns true if host is using TSC based clocksource */
/*
* Calculates CLOCK_MONOTONIC and reports the TSC value from which it did
* so. Returns true if host is using TSC based clocksource.
*/
bool kvm_get_monotonic_and_clockread(s64 *kernel_ns, u64 *tsc_timestamp)
{
/* checked again under seqlock below */
if (!gtod_is_based_on_tsc(pvclock_gtod_data.clock.vclock_mode))
return false;

return gtod_is_based_on_tsc(do_monotonic(kernel_ns,
tsc_timestamp));
}

/*
* Calculates CLOCK_REALTIME and reports the TSC value from which it did
* so. Returns true if host is using TSC based clocksource.
*
* DO NOT USE this for anything related to migration. You want CLOCK_TAI
* for that.
*/
static bool kvm_get_walltime_and_clockread(struct timespec64 *ts,
u64 *tsc_timestamp)
{
Expand Down Expand Up @@ -3152,7 +3203,7 @@ static void kvm_setup_guest_pvclock(struct kvm_vcpu *v,

guest_hv_clock->version = ++vcpu->hv_clock.version;

mark_page_dirty_in_slot(v->kvm, gpc->memslot, gpc->gpa >> PAGE_SHIFT);
kvm_gpc_mark_dirty_in_slot(gpc);
read_unlock_irqrestore(&gpc->lock, flags);

trace_kvm_pvclock_update(v->vcpu_id, &vcpu->hv_clock);
Expand Down Expand Up @@ -4674,7 +4725,8 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
KVM_XEN_HVM_CONFIG_SHARED_INFO |
KVM_XEN_HVM_CONFIG_EVTCHN_2LEVEL |
KVM_XEN_HVM_CONFIG_EVTCHN_SEND |
KVM_XEN_HVM_CONFIG_PVCLOCK_TSC_UNSTABLE;
KVM_XEN_HVM_CONFIG_PVCLOCK_TSC_UNSTABLE |
KVM_XEN_HVM_CONFIG_SHARED_INFO_HVA;
if (sched_info_on())
r |= KVM_XEN_HVM_CONFIG_RUNSTATE |
KVM_XEN_HVM_CONFIG_RUNSTATE_UPDATE_FLAG;
Expand Down Expand Up @@ -12027,7 +12079,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
vcpu->arch.regs_avail = ~0;
vcpu->arch.regs_dirty = ~0;

kvm_gpc_init(&vcpu->arch.pv_time, vcpu->kvm, vcpu, KVM_HOST_USES_PFN);
kvm_gpc_init(&vcpu->arch.pv_time, vcpu->kvm);

if (!irqchip_in_kernel(vcpu->kvm) || kvm_vcpu_is_reset_bsp(vcpu))
vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
Expand Down
Loading

0 comments on commit e9a2bba

Please sign in to comment.