Merge tag 'kvm-x86-xen-6.9' of https://github.com/kvm-x86/linux into …

…HEAD KVM Xen and pfncache changes for 6.9: - Rip out the half-baked support for using gfn_to_pfn caches to manage pages that are "mapped" into guests via physical addresses. - Add support for using gfn_to_pfn caches with only a host virtual address, i.e. to bypass the "gfn" stage of the cache. The primary use case is overlay pages, where the guest may change the gfn used to reference the overlay page, but the backing hva+pfn remains the same. - Add an ioctl() to allow mapping Xen's shared_info page using an hva instead of a gpa, so that userspace doesn't need to reconfigure and invalidate the cache/mapping if the guest changes the gpa (but userspace keeps the resolved hva the same). - When possible, use a single host TSC value when computing the deadline for Xen timers in order to improve the accuracy of the timer emulation. - Inject pending upcall events when the vCPU software-enables its APIC to fix a bug where an upcall can be lost (and to follow Xen's behavior). - Fall back to the slow path instead of warning if "fast" IRQ delivery of Xen events fails, e.g. if the guest has aliased xAPIC IDs. - Extend gfn_to_pfn_cache's mutex to cover (de)activation (in addition to refresh), and drop a now-redundant acquisition of xen_lock (that was protecting the shared_info cache) to fix a deadlock due to recursively acquiring xen_lock.
torvalds · Mar 11, 2024 · e9a2bba · e9a2bba
2 parents e9025cd + 7a36d68
commit e9a2bba
Show file tree

Hide file tree

Showing 16 changed files with 601 additions and 268 deletions.
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
@@ -372,7 +372,7 @@ The bits in the dirty bitmap are cleared before the ioctl returns, unless
 KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2 is enabled. For more information,
 see the description of the capability.
 
-Note that the Xen shared info page, if configured, shall always be assumed
+Note that the Xen shared_info page, if configured, shall always be assumed
 to be dirty. KVM will not explicitly mark it such.
 
 
@@ -5487,8 +5487,9 @@ KVM_PV_ASYNC_CLEANUP_PERFORM
  __u8 long_mode;
  __u8 vector;
  __u8 runstate_update_flag;
- struct {
+ union {
  __u64 gfn;
+ __u64 hva;
  } shared_info;
  struct {
  __u32 send_port;
@@ -5516,19 +5517,20 @@ type values:
 
 KVM_XEN_ATTR_TYPE_LONG_MODE
  Sets the ABI mode of the VM to 32-bit or 64-bit (long mode). This
- determines the layout of the shared info pages exposed to the VM.
+ determines the layout of the shared_info page exposed to the VM.
 
 KVM_XEN_ATTR_TYPE_SHARED_INFO
- Sets the guest physical frame number at which the Xen "shared info"
+ Sets the guest physical frame number at which the Xen shared_info
  page resides. Note that although Xen places vcpu_info for the first
  32 vCPUs in the shared_info page, KVM does not automatically do so
- and instead requires that KVM_XEN_VCPU_ATTR_TYPE_VCPU_INFO be used
- explicitly even when the vcpu_info for a given vCPU resides at the
- "default" location in the shared_info page. This is because KVM may
- not be aware of the Xen CPU id which is used as the index into the
- vcpu_info[] array, so may know the correct default location.
-
- Note that the shared info page may be constantly written to by KVM;
+ and instead requires that KVM_XEN_VCPU_ATTR_TYPE_VCPU_INFO or
+ KVM_XEN_VCPU_ATTR_TYPE_VCPU_INFO_HVA be used explicitly even when
+ the vcpu_info for a given vCPU resides at the "default" location
+ in the shared_info page. This is because KVM may not be aware of
+ the Xen CPU id which is used as the index into the vcpu_info[]
+ array, so may know the correct default location.
+
+ Note that the shared_info page may be constantly written to by KVM;
  it contains the event channel bitmap used to deliver interrupts to
  a Xen guest, amongst other things. It is exempt from dirty tracking
  mechanisms — KVM will not explicitly mark the page as dirty each
@@ -5537,9 +5539,21 @@ KVM_XEN_ATTR_TYPE_SHARED_INFO
  any vCPU has been running or any event channel interrupts can be
  routed to the guest.
 
- Setting the gfn to KVM_XEN_INVALID_GFN will disable the shared info
+ Setting the gfn to KVM_XEN_INVALID_GFN will disable the shared_info
  page.
 
+KVM_XEN_ATTR_TYPE_SHARED_INFO_HVA
+ If the KVM_XEN_HVM_CONFIG_SHARED_INFO_HVA flag is also set in the
+ Xen capabilities, then this attribute may be used to set the
+ userspace address at which the shared_info page resides, which
+ will always be fixed in the VMM regardless of where it is mapped
+ in guest physical address space. This attribute should be used in
+ preference to KVM_XEN_ATTR_TYPE_SHARED_INFO as it avoids
+ unnecessary invalidation of an internal cache when the page is
+ re-mapped in guest physcial address space.
+
+ Setting the hva to zero will disable the shared_info page.
+
 KVM_XEN_ATTR_TYPE_UPCALL_VECTOR
  Sets the exception vector used to deliver Xen event channel upcalls.
  This is the HVM-wide vector injected directly by the hypervisor
@@ -5636,6 +5650,21 @@ KVM_XEN_VCPU_ATTR_TYPE_VCPU_INFO
  on dirty logging. Setting the gpa to KVM_XEN_INVALID_GPA will disable
  the vcpu_info.
 
+KVM_XEN_VCPU_ATTR_TYPE_VCPU_INFO_HVA
+ If the KVM_XEN_HVM_CONFIG_SHARED_INFO_HVA flag is also set in the
+ Xen capabilities, then this attribute may be used to set the
+ userspace address of the vcpu_info for a given vCPU. It should
+ only be used when the vcpu_info resides at the "default" location
+ in the shared_info page. In this case it is safe to assume the
+ userspace address will not change, because the shared_info page is
+ an overlay on guest memory and remains at a fixed host address
+ regardless of where it is mapped in guest physical address space
+ and hence unnecessary invalidation of an internal cache may be
+ avoided if the guest memory layout is modified.
+ If the vcpu_info does not reside at the "default" location then
+ it is not guaranteed to remain at the same host address and
+ hence the aforementioned cache invalidation is required.
+
 KVM_XEN_VCPU_ATTR_TYPE_VCPU_TIME_INFO
  Sets the guest physical address of an additional pvclock structure
  for a given vCPU. This is typically used for guest vsyscall support.

diff --git a/arch/s390/kvm/diag.c b/arch/s390/kvm/diag.c
@@ -102,7 +102,7 @@ static int __diag_page_ref_service(struct kvm_vcpu *vcpu)
  parm.token_addr & 7 || parm.zarch != 0x8000000000000000ULL)
  return kvm_s390_inject_program_int(vcpu, PGM_SPECIFICATION);
 
- if (kvm_is_error_gpa(vcpu->kvm, parm.token_addr))
+ if (!kvm_is_gpa_in_memslot(vcpu->kvm, parm.token_addr))
  return kvm_s390_inject_program_int(vcpu, PGM_ADDRESSING);
 
  vcpu->arch.pfault_token = parm.token_addr;

diff --git a/arch/s390/kvm/gaccess.c b/arch/s390/kvm/gaccess.c
@@ -664,7 +664,7 @@ static unsigned long guest_translate(struct kvm_vcpu *vcpu, unsigned long gva,
  case ASCE_TYPE_REGION1: {
  union region1_table_entry rfte;
 
- if (kvm_is_error_gpa(vcpu->kvm, ptr))
+ if (!kvm_is_gpa_in_memslot(vcpu->kvm, ptr))
  return PGM_ADDRESSING;
  if (deref_table(vcpu->kvm, ptr, &rfte.val))
  return -EFAULT;
@@ -682,7 +682,7 @@ static unsigned long guest_translate(struct kvm_vcpu *vcpu, unsigned long gva,
  case ASCE_TYPE_REGION2: {
  union region2_table_entry rste;
 
- if (kvm_is_error_gpa(vcpu->kvm, ptr))
+ if (!kvm_is_gpa_in_memslot(vcpu->kvm, ptr))
  return PGM_ADDRESSING;
  if (deref_table(vcpu->kvm, ptr, &rste.val))
  return -EFAULT;
@@ -700,7 +700,7 @@ static unsigned long guest_translate(struct kvm_vcpu *vcpu, unsigned long gva,
  case ASCE_TYPE_REGION3: {
  union region3_table_entry rtte;
 
- if (kvm_is_error_gpa(vcpu->kvm, ptr))
+ if (!kvm_is_gpa_in_memslot(vcpu->kvm, ptr))
  return PGM_ADDRESSING;
  if (deref_table(vcpu->kvm, ptr, &rtte.val))
  return -EFAULT;
@@ -728,7 +728,7 @@ static unsigned long guest_translate(struct kvm_vcpu *vcpu, unsigned long gva,
  case ASCE_TYPE_SEGMENT: {
  union segment_table_entry ste;
 
- if (kvm_is_error_gpa(vcpu->kvm, ptr))
+ if (!kvm_is_gpa_in_memslot(vcpu->kvm, ptr))
  return PGM_ADDRESSING;
  if (deref_table(vcpu->kvm, ptr, &ste.val))
  return -EFAULT;
@@ -748,7 +748,7 @@ static unsigned long guest_translate(struct kvm_vcpu *vcpu, unsigned long gva,
  ptr = ste.fc0.pto * (PAGE_SIZE / 2) + vaddr.px * 8;
  }
  }
- if (kvm_is_error_gpa(vcpu->kvm, ptr))
+ if (!kvm_is_gpa_in_memslot(vcpu->kvm, ptr))
  return PGM_ADDRESSING;
  if (deref_table(vcpu->kvm, ptr, &pte.val))
  return -EFAULT;
@@ -770,7 +770,7 @@ static unsigned long guest_translate(struct kvm_vcpu *vcpu, unsigned long gva,
  *prot = PROT_TYPE_IEP;
  return PGM_PROTECTION;
  }
- if (kvm_is_error_gpa(vcpu->kvm, raddr.addr))
+ if (!kvm_is_gpa_in_memslot(vcpu->kvm, raddr.addr))
  return PGM_ADDRESSING;
  *gpa = raddr.addr;
  return 0;
@@ -957,7 +957,7 @@ static int guest_range_to_gpas(struct kvm_vcpu *vcpu, unsigned long ga, u8 ar,
  return rc;
  } else {
  gpa = kvm_s390_real_to_abs(vcpu, ga);
- if (kvm_is_error_gpa(vcpu->kvm, gpa)) {
+ if (!kvm_is_gpa_in_memslot(vcpu->kvm, gpa)) {
  rc = PGM_ADDRESSING;
  prot = PROT_NONE;
  }

diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
@@ -2878,7 +2878,7 @@ static int kvm_s390_vm_mem_op_abs(struct kvm *kvm, struct kvm_s390_mem_op *mop)
 
  srcu_idx = srcu_read_lock(&kvm->srcu);
 
- if (kvm_is_error_gpa(kvm, mop->gaddr)) {
+ if (!kvm_is_gpa_in_memslot(kvm, mop->gaddr)) {
  r = PGM_ADDRESSING;
  goto out_unlock;
  }
@@ -2940,7 +2940,7 @@ static int kvm_s390_vm_mem_op_cmpxchg(struct kvm *kvm, struct kvm_s390_mem_op *m
 
  srcu_idx = srcu_read_lock(&kvm->srcu);
 
- if (kvm_is_error_gpa(kvm, mop->gaddr)) {
+ if (!kvm_is_gpa_in_memslot(kvm, mop->gaddr)) {
  r = PGM_ADDRESSING;
  goto out_unlock;
  }

diff --git a/arch/s390/kvm/priv.c b/arch/s390/kvm/priv.c
@@ -149,7 +149,7 @@ static int handle_set_prefix(struct kvm_vcpu *vcpu)
  * first page, since address is 8k aligned and memory pieces are always
  * at least 1MB aligned and have at least a size of 1MB.
  */
- if (kvm_is_error_gpa(vcpu->kvm, address))
+ if (!kvm_is_gpa_in_memslot(vcpu->kvm, address))
  return kvm_s390_inject_program_int(vcpu, PGM_ADDRESSING);
 
  kvm_s390_set_prefix(vcpu, address);
@@ -464,7 +464,7 @@ static int handle_test_block(struct kvm_vcpu *vcpu)
  return kvm_s390_inject_prog_irq(vcpu, &vcpu->arch.pgm);
  addr = kvm_s390_real_to_abs(vcpu, addr);
 
- if (kvm_is_error_gpa(vcpu->kvm, addr))
+ if (!kvm_is_gpa_in_memslot(vcpu->kvm, addr))
  return kvm_s390_inject_program_int(vcpu, PGM_ADDRESSING);
  /*
  * We don't expect errors on modern systems, and do not care

diff --git a/arch/s390/kvm/sigp.c b/arch/s390/kvm/sigp.c
@@ -172,7 +172,7 @@ static int __sigp_set_prefix(struct kvm_vcpu *vcpu, struct kvm_vcpu *dst_vcpu,
  * first page, since address is 8k aligned and memory pieces are always
  * at least 1MB aligned and have at least a size of 1MB.
  */
- if (kvm_is_error_gpa(vcpu->kvm, irq.u.prefix.address)) {
+ if (!kvm_is_gpa_in_memslot(vcpu->kvm, irq.u.prefix.address)) {
  *reg &= 0xffffffff00000000UL;
  *reg |= SIGP_STATUS_INVALID_PARAMETER;
  return SIGP_CC_STATUS_STORED;

diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
@@ -549,6 +549,7 @@ struct kvm_x86_mce {
 #define KVM_XEN_HVM_CONFIG_EVTCHN_SEND (1 << 5)
 #define KVM_XEN_HVM_CONFIG_RUNSTATE_UPDATE_FLAG (1 << 6)
 #define KVM_XEN_HVM_CONFIG_PVCLOCK_TSC_UNSTABLE (1 << 7)
+#define KVM_XEN_HVM_CONFIG_SHARED_INFO_HVA (1 << 8)
 
 struct kvm_xen_hvm_config {
  __u32 flags;
@@ -567,9 +568,10 @@ struct kvm_xen_hvm_attr {
  __u8 long_mode;
  __u8 vector;
  __u8 runstate_update_flag;
- struct {
+ union {
  __u64 gfn;
 #define KVM_XEN_INVALID_GFN ((__u64)-1)
+ __u64 hva;
  } shared_info;
  struct {
  __u32 send_port;
@@ -611,13 +613,16 @@ struct kvm_xen_hvm_attr {
 #define KVM_XEN_ATTR_TYPE_XEN_VERSION 0x4
 /* Available with KVM_CAP_XEN_HVM / KVM_XEN_HVM_CONFIG_RUNSTATE_UPDATE_FLAG */
 #define KVM_XEN_ATTR_TYPE_RUNSTATE_UPDATE_FLAG 0x5
+/* Available with KVM_CAP_XEN_HVM / KVM_XEN_HVM_CONFIG_SHARED_INFO_HVA */
+#define KVM_XEN_ATTR_TYPE_SHARED_INFO_HVA 0x6
 
 struct kvm_xen_vcpu_attr {
  __u16 type;
  __u16 pad[3];
  union {
  __u64 gpa;
 #define KVM_XEN_INVALID_GPA ((__u64)-1)
+ __u64 hva;
  __u64 pad[8];
  struct {
  __u64 state;
@@ -648,6 +653,8 @@ struct kvm_xen_vcpu_attr {
 #define KVM_XEN_VCPU_ATTR_TYPE_VCPU_ID 0x6
 #define KVM_XEN_VCPU_ATTR_TYPE_TIMER 0x7
 #define KVM_XEN_VCPU_ATTR_TYPE_UPCALL_VECTOR 0x8
+/* Available with KVM_CAP_XEN_HVM / KVM_XEN_HVM_CONFIG_SHARED_INFO_HVA */
+#define KVM_XEN_VCPU_ATTR_TYPE_VCPU_INFO_HVA 0x9
 
 /* Secure Encrypted Virtualization command */
 enum sev_cmd_id {

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
@@ -41,6 +41,7 @@
 #include "ioapic.h"
 #include "trace.h"
 #include "x86.h"
+#include "xen.h"
 #include "cpuid.h"
 #include "hyperv.h"
 #include "smm.h"
@@ -502,8 +503,10 @@ static inline void apic_set_spiv(struct kvm_lapic *apic, u32 val)
  }
 
  /* Check if there are APF page ready requests pending */
- if (enabled)
+ if (enabled) {
  kvm_make_request(KVM_REQ_APF_READY, apic->vcpu);
+ kvm_xen_sw_enable_lapic(apic->vcpu);
+ }
 }
 
 static inline void kvm_apic_set_xapic_id(struct kvm_lapic *apic, u8 id)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
@@ -2854,7 +2854,11 @@ static inline u64 vgettsc(struct pvclock_clock *clock, u64 *tsc_timestamp,
  return v * clock->mult;
 }
 
-static int do_monotonic_raw(s64 *t, u64 *tsc_timestamp)
+/*
+ * As with get_kvmclock_base_ns(), this counts from boot time, at the
+ * frequency of CLOCK_MONOTONIC_RAW (hence adding gtos->offs_boot).
+ */
+static int do_kvmclock_base(s64 *t, u64 *tsc_timestamp)
 {
  struct pvclock_gtod_data *gtod = &pvclock_gtod_data;
  unsigned long seq;
@@ -2873,6 +2877,29 @@ static int do_monotonic_raw(s64 *t, u64 *tsc_timestamp)
  return mode;
 }
 
+/*
+ * This calculates CLOCK_MONOTONIC at the time of the TSC snapshot, with
+ * no boot time offset.
+ */
+static int do_monotonic(s64 *t, u64 *tsc_timestamp)
+{
+ struct pvclock_gtod_data *gtod = &pvclock_gtod_data;
+ unsigned long seq;
+ int mode;
+ u64 ns;
+
+ do {
+ seq = read_seqcount_begin(&gtod->seq);
+ ns = gtod->clock.base_cycles;
+ ns += vgettsc(&gtod->clock, tsc_timestamp, &mode);
+ ns >>= gtod->clock.shift;
+ ns += ktime_to_ns(gtod->clock.offset);
+ } while (unlikely(read_seqcount_retry(&gtod->seq, seq)));
+ *t = ns;
+
+ return mode;
+}
+
 static int do_realtime(struct timespec64 *ts, u64 *tsc_timestamp)
 {
  struct pvclock_gtod_data *gtod = &pvclock_gtod_data;
@@ -2894,18 +2921,42 @@ static int do_realtime(struct timespec64 *ts, u64 *tsc_timestamp)
  return mode;
 }
 
-/* returns true if host is using TSC based clocksource */
+/*
+ * Calculates the kvmclock_base_ns (CLOCK_MONOTONIC_RAW + boot time) and
+ * reports the TSC value from which it do so. Returns true if host is
+ * using TSC based clocksource.
+ */
 static bool kvm_get_time_and_clockread(s64 *kernel_ns, u64 *tsc_timestamp)
 {
  /* checked again under seqlock below */
  if (!gtod_is_based_on_tsc(pvclock_gtod_data.clock.vclock_mode))
  return false;
 
- return gtod_is_based_on_tsc(do_monotonic_raw(kernel_ns,
-  tsc_timestamp));
+ return gtod_is_based_on_tsc(do_kvmclock_base(kernel_ns,
+ tsc_timestamp));
 }
 
-/* returns true if host is using TSC based clocksource */
+/*
+ * Calculates CLOCK_MONOTONIC and reports the TSC value from which it did
+ * so. Returns true if host is using TSC based clocksource.
+ */
+bool kvm_get_monotonic_and_clockread(s64 *kernel_ns, u64 *tsc_timestamp)
+{
+ /* checked again under seqlock below */
+ if (!gtod_is_based_on_tsc(pvclock_gtod_data.clock.vclock_mode))
+ return false;
+
+ return gtod_is_based_on_tsc(do_monotonic(kernel_ns,
+ tsc_timestamp));
+}
+
+/*
+ * Calculates CLOCK_REALTIME and reports the TSC value from which it did
+ * so. Returns true if host is using TSC based clocksource.
+ *
+ * DO NOT USE this for anything related to migration. You want CLOCK_TAI
+ * for that.
+ */
 static bool kvm_get_walltime_and_clockread(struct timespec64 *ts,
  u64 *tsc_timestamp)
 {
@@ -3152,7 +3203,7 @@ static void kvm_setup_guest_pvclock(struct kvm_vcpu *v,
 
  guest_hv_clock->version = ++vcpu->hv_clock.version;
 
- mark_page_dirty_in_slot(v->kvm, gpc->memslot, gpc->gpa >> PAGE_SHIFT);
+ kvm_gpc_mark_dirty_in_slot(gpc);
  read_unlock_irqrestore(&gpc->lock, flags);
 
  trace_kvm_pvclock_update(v->vcpu_id, &vcpu->hv_clock);
@@ -4674,7 +4725,8 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
  KVM_XEN_HVM_CONFIG_SHARED_INFO |
  KVM_XEN_HVM_CONFIG_EVTCHN_2LEVEL |
  KVM_XEN_HVM_CONFIG_EVTCHN_SEND |
- KVM_XEN_HVM_CONFIG_PVCLOCK_TSC_UNSTABLE;
+ KVM_XEN_HVM_CONFIG_PVCLOCK_TSC_UNSTABLE |
+ KVM_XEN_HVM_CONFIG_SHARED_INFO_HVA;
  if (sched_info_on())
  r |= KVM_XEN_HVM_CONFIG_RUNSTATE |
  KVM_XEN_HVM_CONFIG_RUNSTATE_UPDATE_FLAG;
@@ -12027,7 +12079,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
  vcpu->arch.regs_avail = ~0;
  vcpu->arch.regs_dirty = ~0;
 
- kvm_gpc_init(&vcpu->arch.pv_time, vcpu->kvm, vcpu, KVM_HOST_USES_PFN);
+ kvm_gpc_init(&vcpu->arch.pv_time, vcpu->kvm);
 
  if (!irqchip_in_kernel(vcpu->kvm) || kvm_vcpu_is_reset_bsp(vcpu))
  vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;