1. 23 Feb, 2019 1 commit
    • Jann Horn's avatar
      kvm: fix kvm_ioctl_create_device() reference counting (CVE-2019-6974) · bc4db524
      Jann Horn authored
      commit cfa39381173d5f969daf43582c95ad679189cbc9 upstream.
      
      kvm_ioctl_create_device() does the following:
      
      1. creates a device that holds a reference to the VM object (with a borrowed
         reference, the VM's refcount has not been bumped yet)
      2. initializes the device
      3. transfers the reference to the device to the caller's file descriptor table
      4. calls kvm_get_kvm() to turn the borrowed reference to the VM into a real
         reference
      
      The ownership transfer in step 3 must not happen before the reference to the VM
      becomes a proper, non-borrowed reference, which only happens in step 4.
      After step 3, an attacker can close the file descriptor and drop the borrowed
      reference, which can cause the refcount of the kvm object to drop to zero.
      
      This means that we need to grab a reference for the device before
      anon_inode_getfd(), otherwise the VM can disappear from under us.
      
      Fixes: 852b6d57
      
       ("kvm: add device control API")
      Cc: stable@kernel.org
      Signed-off-by: default avatarJann Horn <jannh@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      bc4db524
  2. 17 Dec, 2018 1 commit
  3. 11 Mar, 2018 1 commit
    • Wanpeng Li's avatar
      KVM: mmu: Fix overlap between public and private memslots · 42f587fa
      Wanpeng Li authored
      commit b28676bb
      
       upstream.
      
      Reported by syzkaller:
      
          pte_list_remove: ffff9714eb1f8078 0->BUG
          ------------[ cut here ]------------
          kernel BUG at arch/x86/kvm/mmu.c:1157!
          invalid opcode: 0000 [#1] SMP
          RIP: 0010:pte_list_remove+0x11b/0x120 [kvm]
          Call Trace:
           drop_spte+0x83/0xb0 [kvm]
           mmu_page_zap_pte+0xcc/0xe0 [kvm]
           kvm_mmu_prepare_zap_page+0x81/0x4a0 [kvm]
           kvm_mmu_invalidate_zap_all_pages+0x159/0x220 [kvm]
           kvm_arch_flush_shadow_all+0xe/0x10 [kvm]
           kvm_mmu_notifier_release+0x6c/0xa0 [kvm]
           ? kvm_mmu_notifier_release+0x5/0xa0 [kvm]
           __mmu_notifier_release+0x79/0x110
           ? __mmu_notifier_release+0x5/0x110
           exit_mmap+0x15a/0x170
           ? do_exit+0x281/0xcb0
           mmput+0x66/0x160
           do_exit+0x2c9/0xcb0
           ? __context_tracking_exit.part.5+0x4a/0x150
           do_group_exit+0x50/0xd0
           SyS_exit_group+0x14/0x20
           do_syscall_64+0x73/0x1f0
           entry_SYSCALL64_slow_path+0x25/0x25
      
      The reason is that when creates new memslot, there is no guarantee for new
      memslot not overlap with private memslots. This can be triggered by the
      following program:
      
         #include <fcntl.h>
         #include <pthread.h>
         #include <setjmp.h>
         #include <signal.h>
         #include <stddef.h>
         #include <stdint.h>
         #include <stdio.h>
         #include <stdlib.h>
         #include <string.h>
         #include <sys/ioctl.h>
         #include <sys/stat.h>
         #include <sys/syscall.h>
         #include <sys/types.h>
         #include <unistd.h>
         #include <linux/kvm.h>
      
         long r[16];
      
         int main()
         {
      	void *p = valloc(0x4000);
      
      	r[2] = open("/dev/kvm", 0);
      	r[3] = ioctl(r[2], KVM_CREATE_VM, 0x0ul);
      
      	uint64_t addr = 0xf000;
      	ioctl(r[3], KVM_SET_IDENTITY_MAP_ADDR, &addr);
      	r[6] = ioctl(r[3], KVM_CREATE_VCPU, 0x0ul);
      	ioctl(r[3], KVM_SET_TSS_ADDR, 0x0ul);
      	ioctl(r[6], KVM_RUN, 0);
      	ioctl(r[6], KVM_RUN, 0);
      
      	struct kvm_userspace_memory_region mr = {
      		.slot = 0,
      		.flags = KVM_MEM_LOG_DIRTY_PAGES,
      		.guest_phys_addr = 0xf000,
      		.memory_size = 0x4000,
      		.userspace_addr = (uintptr_t) p
      	};
      	ioctl(r[3], KVM_SET_USER_MEMORY_REGION, &mr);
      	return 0;
         }
      
      This patch fixes the bug by not adding a new memslot even if it
      overlaps with private memslots.
      Reported-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Eric Biggers <ebiggers3@gmail.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarWanpeng Li <wanpeng.li@hotmail.com>
      42f587fa
  4. 25 Dec, 2017 1 commit
  5. 08 Apr, 2017 2 commits
  6. 20 Aug, 2016 1 commit
    • Jim Mattson's avatar
      KVM: nVMX: Fix memory corruption when using VMCS shadowing · 144941bd
      Jim Mattson authored
      commit 2f1fe811
      
       upstream.
      
      When freeing the nested resources of a vcpu, there is an assumption that
      the vcpu's vmcs01 is the current VMCS on the CPU that executes
      nested_release_vmcs12(). If this assumption is violated, the vcpu's
      vmcs01 may be made active on multiple CPUs at the same time, in
      violation of Intel's specification. Moreover, since the vcpu's vmcs01 is
      not VMCLEARed on every CPU on which it is active, it can linger in a
      CPU's VMCS cache after it has been freed and potentially
      repurposed. Subsequent eviction from the CPU's VMCS cache on a capacity
      miss can result in memory corruption.
      
      It is not sufficient for vmx_free_vcpu() to call vmx_load_vmcs01(). If
      the vcpu in question was last loaded on a different CPU, it must be
      migrated to the current CPU before calling vmx_load_vmcs01().
      Signed-off-by: default avatarJim Mattson <jmattson@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      144941bd
  7. 27 Jul, 2016 1 commit
    • Xiubo Li's avatar
      kvm: Fix irq route entries exceeding KVM_MAX_IRQ_ROUTES · 54f87e16
      Xiubo Li authored
      commit caf1ff26
      
       upstream.
      
      These days, we experienced one guest crash with 8 cores and 3 disks,
      with qemu error logs as bellow:
      
      qemu-system-x86_64: /build/qemu-2.0.0/kvm-all.c:984:
      kvm_irqchip_commit_routes: Assertion `ret == 0' failed.
      
      And then we found one patch(bdf026317d) in qemu tree, which said
      could fix this bug.
      
      Execute the following script will reproduce the BUG quickly:
      
      irq_affinity.sh
      ========================================================================
      
      vda_irq_num=25
      vdb_irq_num=27
      while [ 1 ]
      do
          for irq in {1,2,4,8,10,20,40,80}
              do
                  echo $irq > /proc/irq/$vda_irq_num/smp_affinity
                  echo $irq > /proc/irq/$vdb_irq_num/smp_affinity
                  dd if=/dev/vda of=/dev/zero bs=4K count=100 iflag=direct
                  dd if=/dev/vdb of=/dev/zero bs=4K count=100 iflag=direct
              done
      done
      ========================================================================
      
      The following qemu log is added in the qemu code and is displayed when
      this bug reproduced:
      
      kvm_irqchip_commit_routes: max gsi: 1008, nr_allocated_irq_routes: 1024,
      irq_routes->nr: 1024, gsi_count: 1024.
      
      That's to say when irq_routes->nr == 1024, there are 1024 routing entries,
      but in the kernel code when routes->nr >= 1024, will just return -EINVAL;
      
      The nr is the number of the routing entries which is in of
      [1 ~ KVM_MAX_IRQ_ROUTES], not the index in [0 ~ KVM_MAX_IRQ_ROUTES - 1].
      
      This patch fix the BUG above.
      Signed-off-by: default avatarXiubo Li <lixiubo@cmss.chinamobile.com>
      Signed-off-by: default avatarWei Tang <tangwei@cmss.chinamobile.com>
      Signed-off-by: default avatarZhang Zhuoyu <zhangzhuoyu@cmss.chinamobile.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      54f87e16
  8. 12 Apr, 2016 1 commit
    • Paolo Bonzini's avatar
      KVM: fix spin_lock_init order on x86 · 4e2fa4bb
      Paolo Bonzini authored
      commit e9ad4ec8
      
       upstream.
      
      Moving the initialization earlier is needed in 4.6 because
      kvm_arch_init_vm is now using mmu_lock, causing lockdep to
      complain:
      
      [  284.440294] INFO: trying to register non-static key.
      [  284.445259] the code is fine but needs lockdep annotation.
      [  284.450736] turning off the locking correctness validator.
      ...
      [  284.528318]  [<ffffffff810aecc3>] lock_acquire+0xd3/0x240
      [  284.533733]  [<ffffffffa0305aa0>] ? kvm_page_track_register_notifier+0x20/0x60 [kvm]
      [  284.541467]  [<ffffffff81715581>] _raw_spin_lock+0x41/0x80
      [  284.546960]  [<ffffffffa0305aa0>] ? kvm_page_track_register_notifier+0x20/0x60 [kvm]
      [  284.554707]  [<ffffffffa0305aa0>] kvm_page_track_register_notifier+0x20/0x60 [kvm]
      [  284.562281]  [<ffffffffa02ece70>] kvm_mmu_init_vm+0x20/0x30 [kvm]
      [  284.568381]  [<ffffffffa02dbf7a>] kvm_arch_init_vm+0x1ea/0x200 [kvm]
      [  284.574740]  [<ffffffffa02bff3f>] kvm_dev_ioctl+0xbf/0x4d0 [kvm]
      
      However, it also helps fixing a preexisting problem, which is why this
      patch is also good for stable kernels: kvm_create_vm was incrementing
      current->mm->mm_count but not decrementing it at the out_err label (in
      case kvm_init_mmu_notifier failed).  The new initialization order makes
      it possible to add the required mmdrop without adding a new error label.
      Reported-by: default avatarBorislav Petkov <bp@alien8.de>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4e2fa4bb
  9. 16 Mar, 2016 1 commit
  10. 22 Oct, 2015 1 commit
  11. 01 Oct, 2015 3 commits
  12. 25 Sep, 2015 1 commit
  13. 16 Sep, 2015 1 commit
    • Paolo Bonzini's avatar
      KVM: add halt_attempted_poll to VCPU stats · 62bea5bf
      Paolo Bonzini authored
      
      
      This new statistic can help diagnosing VCPUs that, for any reason,
      trigger bad behavior of halt_poll_ns autotuning.
      
      For example, say halt_poll_ns = 480000, and wakeups are spaced exactly
      like 479us, 481us, 479us, 481us. Then KVM always fails polling and wastes
      10+20+40+80+160+320+480 = 1110 microseconds out of every
      479+481+479+481+479+481+479 = 3359 microseconds. The VCPU then
      is consuming about 30% more CPU than it would use without
      polling.  This would show as an abnormally high number of
      attempted polling compared to the successful polls.
      
      Acked-by: Christian Borntraeger <borntraeger@de.ibm.com<
      Reviewed-by: default avatarDavid Matlack <dmatlack@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      62bea5bf
  14. 15 Sep, 2015 1 commit
  15. 14 Sep, 2015 1 commit
  16. 10 Sep, 2015 1 commit
    • Vladimir Davydov's avatar
      mmu-notifier: add clear_young callback · 1d7715c6
      Vladimir Davydov authored
      
      
      In the scope of the idle memory tracking feature, which is introduced by
      the following patch, we need to clear the referenced/accessed bit not only
      in primary, but also in secondary ptes.  The latter is required in order
      to estimate wss of KVM VMs.  At the same time we want to avoid flushing
      tlb, because it is quite expensive and it won't really affect the final
      result.
      
      Currently, there is no function for clearing pte young bit that would meet
      our requirements, so this patch introduces one.  To achieve that we have
      to add a new mmu-notifier callback, clear_young, since there is no method
      for testing-and-clearing a secondary pte w/o flushing tlb.  The new method
      is not mandatory and currently only implemented by KVM.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Reviewed-by: default avatarAndres Lagar-Cavilla <andreslc@google.com>
      Acked-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1d7715c6
  17. 06 Sep, 2015 3 commits
    • Wanpeng Li's avatar
      KVM: trace kvm_halt_poll_ns grow/shrink · 2cbd7824
      Wanpeng Li authored
      
      
      Tracepoint for dynamic halt_pool_ns, fired on every potential change.
      Signed-off-by: default avatarWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      2cbd7824
    • Wanpeng Li's avatar
      KVM: dynamic halt-polling · aca6ff29
      Wanpeng Li authored
      
      
      There is a downside of always-poll since poll is still happened for idle
      vCPUs which can waste cpu usage. This patchset add the ability to adjust
      halt_poll_ns dynamically, to grow halt_poll_ns when shot halt is detected,
      and to shrink halt_poll_ns when long halt is detected.
      
      There are two new kernel parameters for changing the halt_poll_ns:
      halt_poll_ns_grow and halt_poll_ns_shrink.
      
                              no-poll      always-poll    dynamic-poll
      -----------------------------------------------------------------------
      Idle (nohz) vCPU %c0     0.15%        0.3%            0.2%
      Idle (250HZ) vCPU %c0    1.1%         4.6%~14%        1.2%
      TCP_RR latency           34us         27us            26.7us
      
      "Idle (X) vCPU %c0" is the percent of time the physical cpu spent in
      c0 over 60 seconds (each vCPU is pinned to a pCPU). (nohz) means the
      guest was tickless. (250HZ) means the guest was ticking at 250HZ.
      
      The big win is with ticking operating systems. Running the linux guest
      with nohz=off (and HZ=250), we save 3.4%~12.8% CPUs/second and get close
      to no-polling overhead levels by using the dynamic-poll. The savings
      should be even higher for higher frequency ticks.
      Suggested-by: default avatarDavid Matlack <dmatlack@google.com>
      Signed-off-by: default avatarWanpeng Li <wanpeng.li@hotmail.com>
      [Simplify the patch. - Paolo]
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      aca6ff29
    • Wanpeng Li's avatar
      KVM: make halt_poll_ns per-vCPU · 19020f8a
      Wanpeng Li authored
      
      
      Change halt_poll_ns into per-VCPU variable, seeded from module parameter,
      to allow greater flexibility.
      Signed-off-by: default avatarWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      19020f8a
  18. 30 Jul, 2015 1 commit
  19. 29 Jul, 2015 1 commit
  20. 03 Jul, 2015 1 commit
  21. 05 Jun, 2015 2 commits
    • Paolo Bonzini's avatar
      KVM: implement multiple address spaces · f481b069
      Paolo Bonzini authored
      
      
      Only two ioctls have to be modified; the address space id is
      placed in the higher 16 bits of their slot id argument.
      
      As of this patch, no architecture defines more than one
      address space; x86 will be the first.
      Reviewed-by: default avatarRadim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      f481b069
    • Paolo Bonzini's avatar
      KVM: add vcpu-specific functions to read/write/translate GFNs · 8e73485c
      Paolo Bonzini authored
      
      
      We need to hide SMRAM from guests not running in SMM.  Therefore, all
      uses of kvm_read_guest* and kvm_write_guest* must be changed to use
      different address spaces, depending on whether the VCPU is in system
      management mode.  We need to introduce a new family of functions for
      this purpose.
      
      For now, the VCPU-based functions have the same behavior as the
      existing per-VM ones, they just accept a different type for the
      first argument.  Later however they will be changed to use one of many
      "struct kvm_memslots" stored in struct kvm, through an architecture hook.
      VM-based functions will unconditionally use the first memslots pointer.
      
      Whenever possible, this patch introduces slot-based functions with an
      __ prefix, with two wrappers for generic and vcpu-based actions.
      The exceptions are kvm_read_guest and kvm_write_guest, which are copied
      into the new functions kvm_vcpu_read_guest and kvm_vcpu_write_guest.
      Reviewed-by: default avatarRadim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      8e73485c
  22. 28 May, 2015 4 commits
  23. 26 May, 2015 4 commits
  24. 19 May, 2015 1 commit
    • Paolo Bonzini's avatar
      KVM: export __gfn_to_pfn_memslot, drop gfn_to_pfn_async · 3520469d
      Paolo Bonzini authored
      
      
      gfn_to_pfn_async is used in just one place, and because of x86-specific
      treatment that place will need to look at the memory slot.  Hence inline
      it into try_async_pf and export __gfn_to_pfn_memslot.
      
      The patch also switches the subsequent call to gfn_to_pfn_prot to use
      __gfn_to_pfn_memslot.  This is a small optimization.  Finally, remove
      the now-unused async argument of __gfn_to_pfn.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      3520469d
  25. 08 May, 2015 2 commits
  26. 21 Apr, 2015 1 commit
    • Paul Mackerras's avatar
      KVM: PPC: Book3S HV: Create debugfs file for each guest's HPT · e23a808b
      Paul Mackerras authored
      
      
      This creates a debugfs directory for each HV guest (assuming debugfs
      is enabled in the kernel config), and within that directory, a file
      by which the contents of the guest's HPT (hashed page table) can be
      read.  The directory is named vmnnnn, where nnnn is the PID of the
      process that created the guest.  The file is named "htab".  This is
      intended to help in debugging problems in the host's management
      of guest memory.
      
      The contents of the file consist of a series of lines like this:
      
        3f48 4000d032bf003505 0000000bd7ff1196 00000003b5c71196
      
      The first field is the index of the entry in the HPT, the second and
      third are the HPT entry, so the third entry contains the real page
      number that is mapped by the entry if the entry's valid bit is set.
      The fourth field is the guest's view of the second doubleword of the
      entry, so it contains the guest physical address.  (The format of the
      second through fourth fields are described in the Power ISA and also
      in arch/powerpc/include/asm/mmu-hash64.h.)
      Signed-off-by: default avatarPaul Mackerras <paulus@samba.org>
      Signed-off-by: default avatarAlexander Graf <agraf@suse.de>
      e23a808b
  27. 10 Apr, 2015 1 commit