1. 16 Jul, 2016 2 commits
  2. 09 Jul, 2016 2 commits
  3. 02 Mar, 2016 1 commit
  4. 28 Feb, 2016 3 commits
  5. 15 Feb, 2016 1 commit
  6. 31 Jan, 2016 3 commits
    • Ulrich Weigand's avatar
      powerpc/module: Handle R_PPC64_ENTRY relocations · a33b8ff3
      Ulrich Weigand authored
      commit a61674bd
      
       upstream.
      
      GCC 6 will include changes to generated code with -mcmodel=large,
      which is used to build kernel modules on powerpc64le.  This was
      necessary because the large model is supposed to allow arbitrary
      sizes and locations of the code and data sections, but the ELFv2
      global entry point prolog still made the unconditional assumption
      that the TOC associated with any particular function can be found
      within 2 GB of the function entry point:
      
      func:
      	addis r2,r12,(.TOC.-func)@ha
      	addi  r2,r2,(.TOC.-func)@l
      	.localentry func, .-func
      
      To remove this assumption, GCC will now generate instead this global
      entry point prolog sequence when using -mcmodel=large:
      
      	.quad .TOC.-func
      func:
      	.reloc ., R_PPC64_ENTRY
      	ld    r2, -8(r12)
      	add   r2, r2, r12
      	.localentry func, .-func
      
      The new .reloc triggers an optimization in the linker that will
      replace this new prolog with the original code (see above) if the
      linker determines that the distance between .TOC. and func is in
      range after all.
      
      Since this new relocation is now present in module object files,
      the kernel module loader is required to handle them too.  This
      patch adds support for the new relocation and implements the
      same optimization done by the GNU linker.
      Signed-off-by: default avatarUlrich Weigand <ulrich.weigand@de.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a33b8ff3
    • Michael Neuling's avatar
      powerpc/tm: Check for already reclaimed tasks · a54d3a42
      Michael Neuling authored
      commit 7f821fc9 upstream.
      
      Currently we can hit a scenario where we'll tm_reclaim() twice.  This
      results in a TM bad thing exception because the second reclaim occurs
      when not in suspend mode.
      
      The scenario in which this can happen is the following.  We attempt to
      deliver a signal to userspace.  To do this we need obtain the stack
      pointer to write the signal context.  To get this stack pointer we
      must tm_reclaim() in case we need to use the checkpointed stack
      pointer (see get_tm_stackpointer()).  Normally we'd then return
      directly to userspace to deliver the signal without going through
      __switch_to().
      
      Unfortunatley, if at this point we get an error (such as a bad
      userspace stack pointer), we need to exit the process.  The exit will
      result in a __switch_to().  __switch_to() will attempt to save the
      process state which results in another tm_reclaim().  This
      tm_reclaim() now causes a TM Bad Thing exception as this state has
      already been saved and the processor is no longer in TM suspend mode.
      Whee!
      
      This patch checks the state of the MSR to ensure we are TM suspended
      before we attempt the tm_reclaim().  If we've already saved the state
      away, we should no longer be in TM suspend mode.  This has the
      additional advantage of checking for a potential TM Bad Thing
      exception.
      
      Found using syscall fuzzer.
      
      Fixes: fb09692e
      
       ("powerpc: Add reclaim and recheckpoint functions for context switching transactional memory processes")
      Signed-off-by: default avatarMichael Neuling <mikey@neuling.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a54d3a42
    • Michael Neuling's avatar
      powerpc/tm: Block signal return setting invalid MSR state · 567a215d
      Michael Neuling authored
      commit d2b9d2a5 upstream.
      
      Currently we allow both the MSR T and S bits to be set by userspace on
      a signal return.  Unfortunately this is a reserved configuration and
      will cause a TM Bad Thing exception if attempted (via rfid).
      
      This patch checks for this case in both the 32 and 64 bit signals
      code.  If both T and S are set, we mark the context as invalid.
      
      Found using a syscall fuzzer.
      
      Fixes: 2b0a576d
      
       ("powerpc: Add new transactional memory state to the signal context")
      Signed-off-by: default avatarMichael Neuling <mikey@neuling.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      567a215d
  7. 09 Nov, 2015 1 commit
    • Vasant Hegde's avatar
      powerpc/rtas: Validate rtas.entry before calling enter_rtas() · 29589707
      Vasant Hegde authored
      commit 8832317f upstream.
      
      Currently we do not validate rtas.entry before calling enter_rtas(). This
      leads to a kernel oops when user space calls rtas system call on a powernv
      platform (see below). This patch adds code to validate rtas.entry before
      making enter_rtas() call.
      
        Oops: Exception in kernel mode, sig: 4 [#1]
        SMP NR_CPUS=1024 NUMA PowerNV
        task: c000000004294b80 ti: c0000007e1a78000 task.ti: c0000007e1a78000
        NIP: 0000000000000000 LR: 0000000000009c14 CTR: c000000000423140
        REGS: c0000007e1a7b920 TRAP: 0e40   Not tainted  (3.18.17-340.el7_1.pkvm3_1_0.2400.1.ppc64le)
        MSR: 1000000000081000 <HV,ME>  CR: 00000000  XER: 00000000
        CFAR: c000000000009c0c SOFTE: 0
        NIP [0000000000000000]           (null)
        LR [0000000000009c14] 0x9c14
        Call Trace:
        [c0000007e1a7bba0] [c00000000041a7f4] avc_has_perm_noaudit+0x54/0x110 (unreliable)
        [c0000007e1a7bd80] [c00000000002ddc0] ppc_rtas+0x150/0x2d0
        [c0000007e1a7be30] [c000000000009358] syscall_exit+0x0/0x98
      
      Fixes: 55190f88
      
       ("powerpc: Add skeleton PowerNV platform")
      Reported-by: default avatarNAGESWARA R. SASTRY <nasastry@in.ibm.com>
      Signed-off-by: default avatarVasant Hegde <hegdevasant@linux.vnet.ibm.com>
      [mpe: Reword change log, trim oops, and add stable + fixes]
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      29589707
  8. 29 Sep, 2015 4 commits
    • Leonidas Da Silva Barbosa's avatar
      powerpc: Uncomment and make enable_kernel_vsx() routine available · b4092436
      Leonidas Da Silva Barbosa authored
      commit 72cd7b44
      
       upstream.
      
      enable_kernel_vsx() function was commented since anything was using
      it. However, vmx-crypto driver uses VSX instructions which are
      only available if VSX is enable. Otherwise it rises an exception oops.
      
      This patch uncomment enable_kernel_vsx() routine and makes it available.
      Signed-off-by: default avatarLeonidas S. Barbosa <leosilva@linux.vnet.ibm.com>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b4092436
    • Thomas Huth's avatar
      powerpc/rtas: Introduce rtas_get_sensor_fast() for IRQ handlers · ce813f1f
      Thomas Huth authored
      commit 1c2cb594 upstream.
      
      The EPOW interrupt handler uses rtas_get_sensor(), which in turn
      uses rtas_busy_delay() to wait for RTAS becoming ready in case it
      is necessary. But rtas_busy_delay() is annotated with might_sleep()
      and thus may not be used by interrupts handlers like the EPOW handler!
      This leads to the following BUG when CONFIG_DEBUG_ATOMIC_SLEEP is
      enabled:
      
       BUG: sleeping function called from invalid context at arch/powerpc/kernel/rtas.c:496
       in_atomic(): 1, irqs_disabled(): 1, pid: 0, name: swapper/1
       CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.2.0-rc2-thuth #6
       Call Trace:
       [c00000007ffe7b90] [c000000000807670] dump_stack+0xa0/0xdc (unreliable)
       [c00000007ffe7bc0] [c0000000000e1f14] ___might_sleep+0x134/0x180
       [c00000007ffe7c20] [c00000000002aec0] rtas_busy_delay+0x30/0xd0
       [c00000007ffe7c50] [c00000000002bde4] rtas_get_sensor+0x74/0xe0
       [c00000007ffe7ce0] [c000000000083264] ras_epow_interrupt+0x44/0x450
       [c00000007ffe7d90] [c000000000120260] handle_irq_event_percpu+0xa0/0x300
       [c00000007ffe7e70] [c000000000120524] handle_irq_event+0x64/0xc0
       [c00000007ffe7eb0] [c000000000124dbc] handle_fasteoi_irq+0xec/0x260
       [c00000007ffe7ef0] [c00000000011f4f0] generic_handle_irq+0x50/0x80
       [c00000007ffe7f20] [c000000000010f3c] __do_irq+0x8c/0x200
       [c00000007ffe7f90] [c0000000000236cc] call_do_irq+0x14/0x24
       [c00000007e6f39e0] [c000000000011144] do_IRQ+0x94/0x110
       [c00000007e6f3a30] [c000000000002594] hardware_interrupt_common+0x114/0x180
      
      Fix this issue by introducing a new rtas_get_sensor_fast() function
      that does not use rtas_busy_delay() - and thus can only be used for
      sensors that do not cause a BUSY condition - known as "fast" sensors.
      
      The EPOW sensor is defined to be "fast" in sPAPR - mpe.
      
      Fixes: 587f83e8
      
       ("powerpc/pseries: Use rtas_get_sensor in RAS code")
      Signed-off-by: default avatarThomas Huth <thuth@redhat.com>
      Reviewed-by: default avatarNathan Fontenot <nfont@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ce813f1f
    • Gavin Shan's avatar
      powerpc/eeh: Fix fenced PHB caused by eeh_slot_error_detail() · f1ab3c04
      Gavin Shan authored
      commit 25980013 upstream.
      
      The config space of some PCI devices can't be accessed when their
      PEs are in frozen state. Otherwise, fenced PHB might be seen.
      Those PEs are identified with flag EEH_PE_CFG_RESTRICTED, meaing
      EEH_PE_CFG_BLOCKED is set automatically when the PE is put to
      frozen state (EEH_PE_ISOLATED). eeh_slot_error_detail() restores
      PCI device BARs with eeh_pe_restore_bars(), which then calls
      eeh_ops->restore_config() to reinitialize the PCI device in
      (OPAL) firmware. eeh_ops->restore_config() produces PCI config
      access that causes fenced PHB. The problem was reported on below
      adapter:
      
         0001:01:00.0 0200: 14e4:168e (rev 10)
         0001:01:00.0 Ethernet controller: Broadcom Corporation \
                      NetXtreme II BCM57810 10 Gigabit Ethernet (rev 10)
      
      This fixes the issue by skipping eeh_pe_restore_bars() in
      eeh_slot_error_detail() when EEH_PE_CFG_BLOCKED is set for the PE.
      
      Fixes: b6541db1
      
       ("powerpc/eeh: Block PCI config access upon frozen PE")
      Reported-by: default avatarManvanthara B. Puttashankar <mputtash@in.ibm.com>
      Signed-off-by: default avatarGavin Shan <gwshan@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f1ab3c04
    • Daniel Axtens's avatar
      powerpc/eeh: Probe after unbalanced kref check · 91552f87
      Daniel Axtens authored
      commit e642d11b upstream.
      
      In the complete hotplug case, EEH PEs are supposed to be released
      and set to NULL. Normally, this is done by eeh_remove_device(),
      which is called from pcibios_release_device().
      
      However, if something is holding a kref to the device, it will not
      be released, and the PE will remain. eeh_add_device_late() has
      a check for this which will explictly destroy the PE in this case.
      
      This check in eeh_add_device_late() occurs after a call to
      eeh_ops->probe(). On PowerNV, probe is a pointer to pnv_eeh_probe(),
      which will exit without probing if there is an existing PE.
      
      This means that on PowerNV, devices with outstanding krefs will not
      be rediscovered by EEH correctly after a complete hotplug. This is
      affecting CXL (CAPI) devices in the field.
      
      Put the probe after the kref check so that the PE is destroyed
      and affected devices are correctly rediscovered by EEH.
      
      Fixes: d91dafc0
      
       ("powerpc/eeh: Delay probing EEH device during hotplug")
      Cc: Gavin Shan <gwshan@linux.vnet.ibm.com>
      Signed-off-by: default avatarDaniel Axtens <dja@axtens.net>
      Acked-by: default avatarGavin Shan <gwshan@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      91552f87
  9. 17 Aug, 2015 1 commit
    • Amanieu d'Antras's avatar
      signal: fix information leak in copy_siginfo_from_user32 · 52124831
      Amanieu d'Antras authored
      commit 3c00cb5e
      
       upstream.
      
      This function can leak kernel stack data when the user siginfo_t has a
      positive si_code value.  The top 16 bits of si_code descibe which fields
      in the siginfo_t union are active, but they are treated inconsistently
      between copy_siginfo_from_user32, copy_siginfo_to_user32 and
      copy_siginfo_to_user.
      
      copy_siginfo_from_user32 is called from rt_sigqueueinfo and
      rt_tgsigqueueinfo in which the user has full control overthe top 16 bits
      of si_code.
      
      This fixes the following information leaks:
      x86:   8 bytes leaked when sending a signal from a 32-bit process to
             itself. This leak grows to 16 bytes if the process uses x32.
             (si_code = __SI_CHLD)
      x86:   100 bytes leaked when sending a signal from a 32-bit process to
             a 64-bit process. (si_code = -1)
      sparc: 4 bytes leaked when sending a signal from a 32-bit process to a
             64-bit process. (si_code = any)
      
      parsic and s390 have similar bugs, but they are not vulnerable because
      rt_[tg]sigqueueinfo have checks that prevent sending a positive si_code
      to a different process.  These bugs are also fixed for consistency.
      Signed-off-by: default avatarAmanieu d'Antras <amanieu@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      52124831
  10. 10 Aug, 2015 1 commit
    • Shreyas B. Prabhu's avatar
      powerpc/powernv: Fix race in updating core_idle_state · 5f0f854a
      Shreyas B. Prabhu authored
      commit b32aadc1
      
       upstream.
      
      core_idle_state is maintained for each core. It uses 0-7 bits to track
      whether a thread in the core has entered fastsleep or winkle. 8th bit is
      used as a lock bit.
      The lock bit is set in these 2 scenarios-
       - The thread is first in subcore to wakeup from sleep/winkle.
       - If its the last thread in the core about to enter sleep/winkle
      
      While the lock bit is set, if any other thread in the core wakes up, it
      loops until the lock bit is cleared before proceeding in the wakeup
      path. This helps prevent race conditions w.r.t fastsleep workaround and
      prevents threads from switching to process context before core/subcore
      resources are restored.
      
      But, in the path to sleep/winkle entry, we currently don't check for
      lock-bit. This exposes us to following race when running with subcore
      on-
      
      First thread in the subcorea		Another thread in the same
      waking up		   		core entering sleep/winkle
      
      lwarx   r15,0,r14
      ori     r15,r15,PNV_CORE_IDLE_LOCK_BIT
      stwcx.  r15,0,r14
      [Code to restore subcore state]
      
      						lwarx   r15,0,r14
      						[clear thread bit]
      						stwcx.  r15,0,r14
      
      andi.   r15,r15,PNV_CORE_IDLE_THREAD_BITS
      stw     r15,0(r14)
      
      Here, after the thread entering sleep clears its thread bit in
      core_idle_state, the value is overwritten by the thread waking up.
      In such cases when the core enters fastsleep, code mistakes an idle
      thread as running. Because of this, the first thread waking up from
      fastsleep which is supposed to resync timebase skips it. So we can
      end up having a core with stale timebase value.
      
      This patch fixes the above race by looping on the lock bit even while
      entering the idle states.
      Signed-off-by: default avatarShreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
      Fixes: 7b54e9f213f76 'powernv/powerpc: Add winkle support for offline cpus'
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5f0f854a
  11. 14 May, 2015 1 commit
  12. 12 May, 2015 1 commit
    • Daniel Axtens's avatar
      powerpc/mce: fix off by one errors in mce event handling · ffb2d78e
      Daniel Axtens authored
      Before 69111bac ("powerpc: Replace __get_cpu_var uses"), in
      save_mce_event, index got the value of mce_nest_count, and
      mce_nest_count was incremented *after* index was set.
      
      However, that patch changed the behaviour so that mce_nest count was
      incremented *before* setting index.
      
      This causes an off-by-one error, as get_mce_event sets index as
      mce_nest_count - 1 before reading mce_event.  Thus get_mce_event reads
      bogus data, causing warnings like
      "Machine Check Exception, Unknown event version 0 !"
      and breaking MCEs handling.
      
      Restore the old behaviour and unbreak MCE handling by subtracting one
      from the newly incremented value.
      
      The same broken change occured in machine_check_queue_event (which set
      a queue read by machine_check_process_queued_event).  Fix that too,
      unbreaking printing of MCE information.
      
      Fixes: 69111bac
      
       ("powerpc: Replace __get_cpu_var uses")
      CC: stable@vger.kernel.org
      CC: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
      CC: Christoph Lameter <cl@linux.com>
      Signed-off-by: default avatarDaniel Axtens <dja@axtens.net>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      ffb2d78e
  13. 01 May, 2015 3 commits
    • Sam Bobroff's avatar
      powerpc/powernv: Restore non-volatile CRs after nap · 0aab3747
      Sam Bobroff authored
      Patches 7cba160a "powernv/cpuidle: Redesign idle states management"
      and 77b54e9f "powernv/powerpc: Add winkle support for offline cpus"
      use non-volatile condition registers (cr2, cr3 and cr4) early in the system
      reset interrupt handler (system_reset_pSeries()) before it has been determined
      if state loss has occurred. If state loss has not occurred, control returns via
      the power7_wakeup_noloss() path which does not restore those condition
      registers, leaving them corrupted.
      
      Fix this by restoring the condition registers in the power7_wakeup_noloss()
      case.
      
      This is apparent when running a KVM guest on hardware that does not
      support winkle or sleep and the guest makes use of secondary threads. In
      practice this means Power7 machines, though some early unreleased Power8
      machines may also be susceptible.
      
      The secondary CPUs are taken off line before the guest is started and
      they call pnv_smp_cpu_kill_self(). This checks support for sleep
      states (in this case there is no support) and power7_nap() is called.
      
      When the CPU is woken, power7_nap() returns and because the CPU is
      still off line, the main while loop executes again. The sleep states
      support test is executed again, but because the tested values cannot
      have changed, the compiler has optimized the test away and instead we
      rely on the result of the first test, which has been left in cr3
      and/or cr4. With the result overwritten, the wrong branch is taken and
      power7_winkle() is called on a CPU that does not support it, leading
      to it stalling.
      
      Fixes: 7cba160a ("powernv/cpuidle: Redesign idle states management")
      Fixes: 77b54e9f
      
       ("powernv/powerpc: Add winkle support for offline cpus")
      [mpe: Massage change log a bit more]
      Signed-off-by: default avatarSam Bobroff <sam.bobroff@au1.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      0aab3747
    • Gavin Shan's avatar
      powerpc/eeh: Delay probing EEH device during hotplug · d91dafc0
      Gavin Shan authored
      Commit 1c509148b ("powerpc/eeh: Do probe on pci_dn") probes EEH
      devices in early stage, which is reasonable to pSeries platform.
      However, it's wrong for PowerNV platform because the PE# isn't
      determined until the resources (IO and MMIO) are assigned to
      PE in hotplug case. So we have to delay probing EEH devices
      for PowerNV platform until the PE# is assigned.
      
      Fixes: ff57b454
      
       ("powerpc/eeh: Do probe on pci_dn")
      Signed-off-by: default avatarGavin Shan <gwshan@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      d91dafc0
    • Gavin Shan's avatar
      powerpc/eeh: Fix race condition in pcibios_set_pcie_reset_state() · 1ae79b78
      Gavin Shan authored
      When asserting reset in pcibios_set_pcie_reset_state(), the PE
      is enforced to (hardware) frozen state in order to drop unexpected
      PCI transactions (except PCI config read/write) automatically by
      hardware during reset, which would cause recursive EEH error.
      However, the (software) frozen state EEH_PE_ISOLATED is missed.
      When users get 0xFF from PCI config or MMIO read, EEH_PE_ISOLATED
      is set in PE state retrival backend. Unfortunately, nobody (the
      reset handler or the EEH recovery functinality in host) will clear
      EEH_PE_ISOLATED when the PE has been passed through to guest.
      
      The patch sets and clears EEH_PE_ISOLATED properly during reset
      in function pcibios_set_pcie_reset_state() to fix the issue.
      
      Fixes: 28158cd1
      
       ("Enhance pcibios_set_pcie_reset_state()")
      Reported-by: default avatarCarol L. Soto <clsoto@us.ibm.com>
      Signed-off-by: default avatarGavin Shan <gwshan@linux.vnet.ibm.com>
      Tested-by: default avatarCarol L. Soto <clsoto@us.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      1ae79b78
  14. 30 Apr, 2015 1 commit
    • Michael Ellerman's avatar
      Revert "powerpc/tm: Abort syscalls in active transactions" · 68fc378c
      Michael Ellerman authored
      This reverts commit feba4036.
      
      Although the principle of this change is good, the implementation has a
      few issues.
      
      Firstly we can sometimes fail to abort a syscall because r12 may have
      been clobbered by C code if we went down the virtual CPU accounting
      path, or if syscall tracing was enabled.
      
      Secondly we have decided that it is safer to abort the syscall even
      earlier in the syscall entry path, so that we avoid the syscall tracing
      path when we are transactional.
      
      So that we have time to thoroughly test those changes we have decided to
      revert this for this merge window and will merge the fixed version in
      the next window.
      
      NB. Rather than reverting the selftest we just drop tm-syscall from
      TEST_PROGS so that it's not run by default.
      
      Fixes: feba4036
      
       ("powerpc/tm: Abort syscalls in active transactions")
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      68fc378c
  15. 21 Apr, 2015 5 commits
    • Paul Mackerras's avatar
      KVM: PPC: Book3S HV: Use msgsnd for signalling threads on POWER8 · 66feed61
      Paul Mackerras authored
      
      
      This uses msgsnd where possible for signalling other threads within
      the same core on POWER8 systems, rather than IPIs through the XICS
      interrupt controller.  This includes waking secondary threads to run
      the guest, the interrupts generated by the virtual XICS, and the
      interrupts to bring the other threads out of the guest when exiting.
      
      Aggregated statistics from debugfs across vcpus for a guest with 32
      vcpus, 8 threads/vcore, running on a POWER8, show this before the
      change:
      
       rm_entry:     3387.6ns (228 - 86600, 1008969 samples)
        rm_exit:     4561.5ns (12 - 3477452, 1009402 samples)
        rm_intr:     1660.0ns (12 - 553050, 3600051 samples)
      
      and this after the change:
      
       rm_entry:     3060.1ns (212 - 65138, 953873 samples)
        rm_exit:     4244.1ns (12 - 9693408, 954331 samples)
        rm_intr:     1342.3ns (12 - 1104718, 3405326 samples)
      
      for a test of booting Fedora 20 big-endian to the login prompt.
      
      The time taken for a H_PROD hcall (which is handled in the host
      kernel) went down from about 35 microseconds to about 16 microseconds
      with this change.
      
      The noinline added to kvmppc_run_core turned out to be necessary for
      good performance, at least with gcc 4.9.2 as packaged with Fedora 21
      and a little-endian POWER8 host.
      Signed-off-by: default avatarPaul Mackerras <paulus@samba.org>
      Signed-off-by: default avatarAlexander Graf <agraf@suse.de>
      66feed61
    • Paul Mackerras's avatar
      KVM: PPC: Book3S HV: Use bitmap of active threads rather than count · 7d6c40da
      Paul Mackerras authored
      
      
      Currently, the entry_exit_count field in the kvmppc_vcore struct
      contains two 8-bit counts, one of the threads that have started entering
      the guest, and one of the threads that have started exiting the guest.
      This changes it to an entry_exit_map field which contains two bitmaps
      of 8 bits each.  The advantage of doing this is that it gives us a
      bitmap of which threads need to be signalled when exiting the guest.
      That means that we no longer need to use the trick of setting the
      HDEC to 0 to pull the other threads out of the guest, which led in
      some cases to a spurious HDEC interrupt on the next guest entry.
      Signed-off-by: default avatarPaul Mackerras <paulus@samba.org>
      Signed-off-by: default avatarAlexander Graf <agraf@suse.de>
      7d6c40da
    • Paul Mackerras's avatar
      KVM: PPC: Book3S HV: Get rid of vcore nap_count and n_woken · 5d5b99cd
      Paul Mackerras authored
      
      
      We can tell when a secondary thread has finished running a guest by
      the fact that it clears its kvm_hstate.kvm_vcpu pointer, so there
      is no real need for the nap_count field in the kvmppc_vcore struct.
      This changes kvmppc_wait_for_nap to poll the kvm_hstate.kvm_vcpu
      pointers of the secondary threads rather than polling vc->nap_count.
      Besides reducing the size of the kvmppc_vcore struct by 8 bytes,
      this also means that we can tell which secondary threads have got
      stuck and thus print a more informative error message.
      Signed-off-by: default avatarPaul Mackerras <paulus@samba.org>
      Signed-off-by: default avatarAlexander Graf <agraf@suse.de>
      5d5b99cd
    • Paul Mackerras's avatar
      KVM: PPC: Book3S HV: Minor cleanups · 1f09c3ed
      Paul Mackerras authored
      
      
      * Remove unused kvmppc_vcore::n_busy field.
      * Remove setting of RMOR, since it was only used on PPC970 and the
        PPC970 KVM support has been removed.
      * Don't use r1 or r2 in setting the runlatch since they are
        conventionally reserved for other things; use r0 instead.
      * Streamline the code a little and remove the ext_interrupt_to_host
        label.
      * Add some comments about register usage.
      * hcall_try_real_mode doesn't need to be global, and can't be
        called from C code anyway.
      Signed-off-by: default avatarPaul Mackerras <paulus@samba.org>
      Signed-off-by: default avatarAlexander Graf <agraf@suse.de>
      1f09c3ed
    • Paul Mackerras's avatar
      KVM: PPC: Book3S HV: Accumulate timing information for real-mode code · b6c295df
      Paul Mackerras authored
      
      
      This reads the timebase at various points in the real-mode guest
      entry/exit code and uses that to accumulate total, minimum and
      maximum time spent in those parts of the code.  Currently these
      times are accumulated per vcpu in 5 parts of the code:
      
      * rm_entry - time taken from the start of kvmppc_hv_entry() until
        just before entering the guest.
      * rm_intr - time from when we take a hypervisor interrupt in the
        guest until we either re-enter the guest or decide to exit to the
        host.  This includes time spent handling hcalls in real mode.
      * rm_exit - time from when we decide to exit the guest until the
        return from kvmppc_hv_entry().
      * guest - time spend in the guest
      * cede - time spent napping in real mode due to an H_CEDE hcall
        while other threads in the same vcore are active.
      
      These times are exposed in debugfs in a directory per vcpu that
      contains a file called "timings".  This file contains one line for
      each of the 5 timings above, with the name followed by a colon and
      4 numbers, which are the count (number of times the code has been
      executed), the total time, the minimum time, and the maximum time,
      all in nanoseconds.
      
      The overhead of the extra code amounts to about 30ns for an hcall that
      is handled in real mode (e.g. H_SET_DABR), which is about 25%.  Since
      production environments may not wish to incur this overhead, the new
      code is conditional on a new config symbol,
      CONFIG_KVM_BOOK3S_HV_EXIT_TIMING.
      Signed-off-by: default avatarPaul Mackerras <paulus@samba.org>
      Signed-off-by: default avatarAlexander Graf <agraf@suse.de>
      b6c295df
  16. 17 Apr, 2015 1 commit
    • Aneesh Kumar K.V's avatar
      powerpc/mm/thp: Make page table walk safe against thp split/collapse · 691e95fd
      Aneesh Kumar K.V authored
      
      
      We can disable a THP split or a hugepage collapse by disabling irq.
      We do send IPI to all the cpus in the early part of split/collapse,
      and disabling local irq ensure we don't make progress with
      split/collapse. If the THP is getting split we return NULL from
      find_linux_pte_or_hugepte(). For all the current callers it should be ok.
      We need to be careful if we want to use returned pte_t pointer outside
      the irq disabled region. W.r.t to THP split, the pfn remains the same,
      but then a hugepage collapse will result in a pfn change. There are
      few steps we can take to avoid a hugepage collapse.One way is to take page
      reference inside the irq disable region. Other option is to take
      mmap_sem so that a parallel collapse will not happen. We can also
      disable collapse by taking pmd_lock. Another method used by kvm
      subsystem is to check whether we had a mmu_notifer update in between
      using mmu_notifier_retry().
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      691e95fd
  17. 14 Apr, 2015 1 commit
  18. 11 Apr, 2015 8 commits