1. 09 Aug, 2016 2 commits
  2. 01 Aug, 2016 1 commit
  3. 17 Jul, 2016 1 commit
  4. 15 Jul, 2016 7 commits
  5. 20 Jun, 2016 1 commit
    • Mahesh Salgaonkar's avatar
      KVM: PPC: Book3S HV: Fix TB corruption in guest exit path on HMI interrupt · fd7bacbc
      Mahesh Salgaonkar authored
      When a guest is assigned to a core it converts the host Timebase (TB)
      into guest TB by adding guest timebase offset before entering into
      guest. During guest exit it restores the guest TB to host TB. This means
      under certain conditions (Guest migration) host TB and guest TB can differ.
      When we get an HMI for TB related issues the opal HMI handler would
      try fixing errors and restore the correct host TB value. With no guest
      running, we don't have any issues. But with guest running on the core
      we run into TB corruption issues.
      If we get an HMI while in the guest, the current HMI handler invokes opal
      hmi handler before forcing guest to exit. The guest exit path subtracts
      the guest TB offset from the current TB value which may have already
      been restored with host value by opal hmi handler. This leads to incorrect
      host and guest TB values.
      With split-core, things become more complex. With split-core, TB also gets
      split and each subcore gets its own TB register. When a hmi handler fixes
      a TB error and restores the TB value, it affects all the TB values of
      sibling subcores on the same core. On TB errors all the thread in the core
      gets HMI. With existing code, the individual threads call opal hmi handle
      independently which can easily throw TB out of sync if we have guest
      running on subcores. Hence we will need to co-ordinate with all the
      threads before making opal hmi handler call followed by TB resync.
      This patch introduces a sibling subcore state structure (shared by all
      threads in the core) in paca which holds information about whether sibling
      subcores are in Guest mode or host mode. An array in_guest[] of size
      MAX_SUBCORE_PER_CORE=4 is used to maintain the state of each subcore.
      The subcore id is used as index into in_guest[] array. Only primary
      thread entering/exiting the guest is responsible to set/unset its
      designated array element.
      On TB error, we get HMI interrupt on every thread on the core. Upon HMI,
      this patch will now force guest to vacate the core/subcore. Primary
      thread from each subcore will then turn off its respective bit
      from the above bitmap during the guest exit path just after the
      guest->host partition switch is complete.
      All other threads that have just exited the guest OR were already in host
      will wait until all other subcores clears their respective bit.
      Once all the subcores turn off their respective bit, all threads will
      will make call to opal hmi handler.
      It is not necessary that opal hmi handler would resync the TB value for
      every HMI interrupts. It would do so only for the HMI caused due to
      TB errors. For rest, it would not touch TB value. Hence to make things
      simpler, primary thread would call TB resync explicitly once for each
      core immediately after opal hmi handler instead of subtracting guest
      offset from TB. TB resync call will restore the TB with host value.
      Thus we can be sure about the TB state.
      One of the primary threads exiting the guest will take up the
      responsibility of calling TB resync. It will use one of the top bits
      (bit 63) from subcore state flags bitmap to make the decision. The first
      primary thread (among the subcores) that is able to set the bit will
      have to call the TB resync. Rest all other threads will wait until TB
      resync is complete.  Once TB resync is complete all threads will then
      Signed-off-by: default avatarMahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
  6. 03 Mar, 2016 1 commit
  7. 01 Dec, 2015 1 commit
  8. 07 Jul, 2015 1 commit
    • Shreyas B. Prabhu's avatar
      powerpc/powernv: Fix race in updating core_idle_state · b32aadc1
      Shreyas B. Prabhu authored
      core_idle_state is maintained for each core. It uses 0-7 bits to track
      whether a thread in the core has entered fastsleep or winkle. 8th bit is
      used as a lock bit.
      The lock bit is set in these 2 scenarios-
       - The thread is first in subcore to wakeup from sleep/winkle.
       - If its the last thread in the core about to enter sleep/winkle
      While the lock bit is set, if any other thread in the core wakes up, it
      loops until the lock bit is cleared before proceeding in the wakeup
      path. This helps prevent race conditions w.r.t fastsleep workaround and
      prevents threads from switching to process context before core/subcore
      resources are restored.
      But, in the path to sleep/winkle entry, we currently don't check for
      lock-bit. This exposes us to following race when running with subcore
      First thread in the subcorea		Another thread in the same
      waking up		   		core entering sleep/winkle
      lwarx   r15,0,r14
      ori     r15,r15,PNV_CORE_IDLE_LOCK_BIT
      stwcx.  r15,0,r14
      [Code to restore subcore state]
      						lwarx   r15,0,r14
      						[clear thread bit]
      						stwcx.  r15,0,r14
      andi.   r15,r15,PNV_CORE_IDLE_THREAD_BITS
      stw     r15,0(r14)
      Here, after the thread entering sleep clears its thread bit in
      core_idle_state, the value is overwritten by the thread waking up.
      In such cases when the core enters fastsleep, code mistakes an idle
      thread as running. Because of this, the first thread waking up from
      fastsleep which is supposed to resync timebase skips it. So we can
      end up having a core with stale timebase value.
      This patch fixes the above race by looping on the lock bit even while
      entering the idle states.
      Signed-off-by: default avatarShreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
      Fixes: 7b54e9f213f76 'powernv/powerpc: Add winkle support for offline cpus'
      Cc: stable@vger.kernel.org # 3.19+
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
  9. 01 May, 2015 1 commit
    • Sam Bobroff's avatar
      powerpc/powernv: Restore non-volatile CRs after nap · 0aab3747
      Sam Bobroff authored
      Patches 7cba160a "powernv/cpuidle: Redesign idle states management"
      and 77b54e9f "powernv/powerpc: Add winkle support for offline cpus"
      use non-volatile condition registers (cr2, cr3 and cr4) early in the system
      reset interrupt handler (system_reset_pSeries()) before it has been determined
      if state loss has occurred. If state loss has not occurred, control returns via
      the power7_wakeup_noloss() path which does not restore those condition
      registers, leaving them corrupted.
      Fix this by restoring the condition registers in the power7_wakeup_noloss()
      This is apparent when running a KVM guest on hardware that does not
      support winkle or sleep and the guest makes use of secondary threads. In
      practice this means Power7 machines, though some early unreleased Power8
      machines may also be susceptible.
      The secondary CPUs are taken off line before the guest is started and
      they call pnv_smp_cpu_kill_self(). This checks support for sleep
      states (in this case there is no support) and power7_nap() is called.
      When the CPU is woken, power7_nap() returns and because the CPU is
      still off line, the main while loop executes again. The sleep states
      support test is executed again, but because the tested values cannot
      have changed, the compiler has optimized the test away and instead we
      rely on the result of the first test, which has been left in cr3
      and/or cr4. With the result overwritten, the wrong branch is taken and
      power7_winkle() is called on a CPU that does not support it, leading
      to it stalling.
      Fixes: 7cba160a ("powernv/cpuidle: Redesign idle states management")
      Fixes: 77b54e9f
       ("powernv/powerpc: Add winkle support for offline cpus")
      [mpe: Massage change log a bit more]
      Signed-off-by: default avatarSam Bobroff <sam.bobroff@au1.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
  10. 23 Mar, 2015 1 commit
    • Paul Mackerras's avatar
      powerpc/powernv: Fix return value from power7_nap() et al. · f57333a7
      Paul Mackerras authored
      The power7_nap(), power7_sleep() and power7_winkle() functions are
      called from pnv_smp_cpu_kill_self(), which expects them to return the
      SRR1 value set by the hardware on wakeup, or 0 if no nap/sleep/winkle
      occurred.  However, in the case where an interrupt needs to be
      replayed, the logic in power7_powersave_common (the common code for
      power7_nap et al.) doesn't set r3 to 0 in this case.  Instead what we
      get as the return value is the selector for the type of power-saving
      mode requested (1, 2 or 3).  In fact this should not affect the
      operation of pnv_smp_cpu_kill_self(), but it is better to get this
      correct, so this adds an instruction to set r3 to 0 in this case.
      Signed-off-by: default avatarPaul Mackerras <paulus@samba.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
  11. 14 Dec, 2014 3 commits
    • Shreyas B. Prabhu's avatar
      powernv/powerpc: Add winkle support for offline cpus · 77b54e9f
      Shreyas B. Prabhu authored
      Winkle is a deep idle state supported in power8 chips. A core enters
      winkle when all the threads of the core enter winkle. In this state
      power supply to the entire chiplet i.e core, private L2 and private L3
      is turned off. As a result it gives higher powersavings compared to
      But entering winkle results in a total hypervisor state loss. Hence the
      hypervisor context has to be preserved before entering winkle and
      restored upon wake up.
      Power-on Reset Engine (PORE) is a dedicated engine which is responsible
      for powering on the chiplet during wake up. It can be programmed to
      restore the register contests of a few specific registers. This patch
      uses PORE to restore register state wherever possible and uses stack to
      save and restore rest of the necessary registers.
      With hypervisor state restore things fall under three categories-
      per-core state, per-subcore state and per-thread state. To manage this,
      extend the infrastructure introduced for sleep. Mainly we add a paca
      variable subcore_sibling_mask. Using this and the core_idle_state we can
      distingush first thread in core and subcore.
      Signed-off-by: default avatarShreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: linuxppc-dev@lists.ozlabs.org
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
    • Shreyas B. Prabhu's avatar
      powernv/cpuidle: Redesign idle states management · 7cba160a
      Shreyas B. Prabhu authored
      Deep idle states like sleep and winkle are per core idle states. A core
      enters these states only when all the threads enter either the
      particular idle state or a deeper one. There are tasks like fastsleep
      hardware bug workaround and hypervisor core state save which have to be
      done only by the last thread of the core entering deep idle state and
      similarly tasks like timebase resync, hypervisor core register restore
      that have to be done only by the first thread waking up from these
      The current idle state management does not have a way to distinguish the
      first/last thread of the core waking/entering idle states. Tasks like
      timebase resync are done for all the threads. This is not only is
      suboptimal, but can cause functionality issues when subcores and kvm is
      This patch adds the necessary infrastructure to track idle states of
      threads in a per-core structure. It uses this info to perform tasks like
      fastsleep workaround and timebase resync only once per core.
      Signed-off-by: default avatarShreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
      Originally-by: default avatarPreeti U. Murthy <preeti@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
      Cc: linux-pm@vger.kernel.org
      Cc: linuxppc-dev@lists.ozlabs.org
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
    • Paul Mackerras's avatar
      powerpc/powernv: Switch off MMU before entering nap/sleep/rvwinkle mode · 8117ac6a
      Paul Mackerras authored
      Currently, when going idle, we set the flag indicating that we are in
      nap mode (paca->kvm_hstate.hwthread_state) and then execute the nap
      (or sleep or rvwinkle) instruction, all with the MMU on.  This is bad
      for two reasons: (a) the architecture specifies that those instructions
      must be executed with the MMU off, and in fact with only the SF, HV, ME
      and possibly RI bits set, and (b) this introduces a race, because as
      soon as we set the flag, another thread can switch the MMU to a guest
      context.  If the race is lost, this thread will typically start looping
      on relocation-on ISIs at 0xc...4400.
      This fixes it by setting the MSR as required by the architecture before
      setting the flag or executing the nap/sleep/rvwinkle instruction.
      Cc: stable@vger.kernel.org
      [ shreyas@linux.vnet.ibm.com: Edited to handle LE ]
      Signed-off-by: default avatarPaul Mackerras <paulus@samba.org>
      Signed-off-by: default avatarShreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: linuxppc-dev@lists.ozlabs.org
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
  12. 08 Dec, 2014 1 commit
    • Paul Mackerras's avatar
      powerpc/powernv: Return to cpu offline loop when finished in KVM guest · 56548fc0
      Paul Mackerras authored
      When a secondary hardware thread has finished running a KVM guest, we
      currently put that thread into nap mode using a nap instruction in
      the KVM code.  This changes the code so that instead of doing a nap
      instruction directly, we instead cause the call to power7_nap() that
      put the thread into nap mode to return.  The reason for doing this is
      to avoid having the KVM code having to know what low-power mode to
      put the thread into.
      In the case of a secondary thread used to run a KVM guest, the thread
      will be offline from the point of view of the host kernel, and the
      relevant power7_nap() call is the one in pnv_smp_cpu_disable().
      In this case we don't want to clear pending IPIs in the offline loop
      in that function, since that might cause us to miss the wakeup for
      the next time the thread needs to run a guest.  To tell whether or
      not to clear the interrupt, we use the SRR1 value returned from
      power7_nap(), and check if it indicates an external interrupt.  We
      arrange that the return from power7_nap() when we have finished running
      a guest returns 0, so pending interrupts don't get flushed in that
      Note that it is important a secondary thread that has finished
      executing in the guest, or that didn't have a guest to run, should
      not return to power7_nap's caller while the kvm_hstate.hwthread_req
      flag in the PACA is non-zero, because the return from power7_nap
      will reenable the MMU, and the MMU might still be in guest context.
      In this situation we spin at low priority in real mode waiting for
      hwthread_req to become zero.
      Signed-off-by: default avatarPaul Mackerras <paulus@samba.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
  13. 25 Sep, 2014 1 commit
    • Paul Mackerras's avatar
      powerpc/powernv: Don't call generic code on offline cpus · d6a4f709
      Paul Mackerras authored
      On PowerNV platforms, when a CPU is offline, we put it into nap mode.
      It's possible that the CPU wakes up from nap mode while it is still
      offline due to a stray IPI.  A misdirected device interrupt could also
      potentially cause it to wake up.  In that circumstance, we need to clear
      the interrupt so that the CPU can go back to nap mode.
      In the past the clearing of the interrupt was accomplished by briefly
      enabling interrupts and allowing the normal interrupt handling code
      (do_IRQ() etc.) to handle the interrupt.  This has the problem that
      this code calls irq_enter() and irq_exit(), which call functions such
      as account_system_vtime() which use RCU internally.  Use of RCU is not
      permitted on offline CPUs and will trigger errors if RCU checking is
      To avoid calling into any generic code which might use RCU, we adopt
      a different method of clearing interrupts on offline CPUs.  Since we
      are on the PowerNV platform, we know that the system interrupt
      controller is a XICS being driven directly (i.e. not via hcalls) by
      the kernel.  Hence this adds a new icp_native_flush_interrupt()
      function to the native-mode XICS driver and arranges to call that
      when an offline CPU is woken from nap.  This new function reads the
      interrupt from the XICS.  If it is an IPI, it clears the IPI; if it
      is a device interrupt, it prints a warning and disables the source.
      Then it does the end-of-interrupt processing for the interrupt.
      The other thing that briefly enabling interrupts did was to check and
      clear the irq_happened flag in this CPU's PACA.  Therefore, after
      flushing the interrupt from the XICS, we also clear all bits except
      the PACA_IRQ_HARD_DIS (interrupts are hard disabled) bit from the
      irq_happened flag.  The PACA_IRQ_HARD_DIS flag is set by power7_nap()
      and is left set to indicate that interrupts are hard disabled.  This
      means we then have to ignore that flag in power7_nap(), which is
      reasonable since it doesn't indicate that any interrupt event needs
      Signed-off-by: default avatarPaul Mackerras <paulus@samba.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
  14. 05 Aug, 2014 2 commits
  15. 11 Jul, 2014 1 commit
    • Preeti U Murthy's avatar
      powerpc/powernv: Check for IRQHAPPENED before sleeping · c733cf83
      Preeti U Murthy authored
      Commit 8d6f7c5a
      : "powerpc/powernv: Make it possible to skip the IRQHAPPENED
      check in power7_nap()" added code that prevents cpus from checking for
      pending interrupts just before entering sleep state, which is wrong. These
      interrupts are delivered during the soft irq disabled state of the cpu.
      A cpu cannot enter any idle state with pending interrupts because they will
      never be serviced until the next time the cpu is woken up by some other
      interrupt. Its only then that the pending interrupts are replayed. This can result
      in device timeouts or warnings about this cpu being stuck.
      This patch fixes ths issue by ensuring that cpus check for pending interrupts
      just before entering any idle state as long as they are not in the path of split
      core operations.
      Signed-off-by: default avatarPreeti U Murthy <preeti@linux.vnet.ibm.com>
      Acked-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
  16. 28 May, 2014 1 commit
  17. 23 Apr, 2014 1 commit
  18. 05 Mar, 2014 2 commits
  19. 05 Dec, 2013 1 commit
  20. 17 Oct, 2013 1 commit
  21. 05 Sep, 2012 1 commit
    • Paul Mackerras's avatar
      powerpc/powernv: Always go into nap mode when CPU is offline · 375f561a
      Paul Mackerras authored
      The CPU hotplug code for the powernv platform currently only puts
      offline CPUs into nap mode if the powersave_nap variable is set.
      However, HV-style KVM on this platform requires secondary CPU threads
      to be offline and in nap mode.  Since we know nap mode works just
      fine on all POWER7 machines, and the only machines that support the
      powernv platform are POWER7 machines, this changes the code to
      always put offline CPUs into nap mode, regardless of powersave_nap.
      Powersave_nap still controls whether or not CPUs go into nap mode
      when idle, as before.
      Signed-off-by: default avatarPaul Mackerras <paulus@samba.org>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
  22. 08 Apr, 2012 1 commit
    • Paul Mackerras's avatar
      KVM: PPC: Book3S HV: Make secondary threads more robust against stray IPIs · f0888f70
      Paul Mackerras authored
      Currently on POWER7, if we are running the guest on a core and we don't
      need all the hardware threads, we do nothing to ensure that the unused
      threads aren't executing in the kernel (other than checking that they
      are offline).  We just assume they're napping and we don't do anything
      to stop them trying to enter the kernel while the guest is running.
      This means that a stray IPI can wake up the hardware thread and it will
      then try to enter the kernel, but since the core is in guest context,
      it will execute code from the guest in hypervisor mode once it turns the
      MMU on, which tends to lead to crashes or hangs in the host.
      This fixes the problem by adding two new one-byte flags in the
      kvmppc_host_state structure in the PACA which are used to interlock
      between the primary thread and the unused secondary threads when entering
      the guest.  With these flags, the primary thread can ensure that the
      unused secondaries are not already in kernel mode (i.e. handling a stray
      IPI) and then indicate that they should not try to enter the kernel
      if they do get woken for any reason.  Instead they will go into KVM code,
      find that there is no vcpu to run, acknowledge and clear the IPI and go
      back to nap mode.
      Signed-off-by: default avatarPaul Mackerras <paulus@samba.org>
      Signed-off-by: default avatarAlexander Graf <agraf@suse.de>
      Signed-off-by: default avatarAvi Kivity <avi@redhat.com>
  23. 09 Mar, 2012 1 commit
    • Benjamin Herrenschmidt's avatar
      powerpc: Rework lazy-interrupt handling · 7230c564
      Benjamin Herrenschmidt authored
      The current implementation of lazy interrupts handling has some
      issues that this tries to address.
      We don't do the various workarounds we need to do when re-enabling
      interrupts in some cases such as when returning from an interrupt
      and thus we may still lose or get delayed decrementer or doorbell
      The current scheme also makes it much harder to handle the external
      "edge" interrupts provided by some BookE processors when using the
      EPR facility (External Proxy) and the Freescale Hypervisor.
      Additionally, we tend to keep interrupts hard disabled in a number
      of cases, such as decrementer interrupts, external interrupts, or
      when a masked decrementer interrupt is pending. This is sub-optimal.
      This is an attempt at fixing it all in one go by reworking the way
      we do the lazy interrupt disabling from the ground up.
      The base idea is to replace the "hard_enabled" field with a
      "irq_happened" field in which we store a bit mask of what interrupt
      occurred while soft-disabled.
      When re-enabling, either via arch_local_irq_restore() or when returning
      from an interrupt, we can now decide what to do by testing bits in that
      We then implement replaying of the missed interrupts either by
      re-using the existing exception frame (in exception exit case) or via
      the creation of a new one from an assembly trampoline (in the
      arch_local_irq_enable case).
      This removes the need to play with the decrementer to try to create
      fake interrupts, among others.
      In addition, this adds a few refinements:
       - We no longer  hard disable decrementer interrupts that occur
      while soft-disabled. We now simply bump the decrementer back to max
      (on BookS) or leave it stopped (on BookE) and continue with hard interrupts
      enabled, which means that we'll potentially get better sample quality from
      performance monitor interrupts.
       - Timer, decrementer and doorbell interrupts now hard-enable
      shortly after removing the source of the interrupt, which means
      they no longer run entirely hard disabled. Again, this will improve
      perf sample quality.
       - On Book3E 64-bit, we now make the performance monitor interrupt
      act as an NMI like Book3S (the necessary C code for that to work
      appear to already be present in the FSL perf code, notably calling
      nmi_enter instead of irq_enter). (This also fixes a bug where BookE
      perfmon interrupts could clobber r14 ... oops)
       - We could make "masked" decrementer interrupts act as NMIs when doing
      timer-based perf sampling to improve the sample quality.
      Signed-off-by-yet: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      - Add hard-enable to decrementer, timer and doorbells
      - Fix CR clobber in masked irq handling on BookE
      - Make embedded perf interrupt act as an NMI
      - Add a PACA_HAPPENED_EE_EDGE for use by FSL if they want
        to retrigger an interrupt without preventing hard-enable
       - Fix or vs. ori bug on Book3E
       - Fix enabling of interrupts for some exceptions on Book3E
       - Fix resend of doorbells on return from interrupt on Book3E
       - Rebased on top of my latest series, which involves some significant
      rework of some aspects of the patch.
       - 32-bit compile fix
       - more compile fixes with various .config combos
       - factor out the asm code to soft-disable interrupts
       - remove the C wrapper around preempt_schedule_irq
       - Fix a bug with hard irq state tracking on native power7
  24. 08 Dec, 2011 1 commit
    • Paul Mackerras's avatar
      powerpc: Provide a way for KVM to indicate that NV GPR values are lost · 2fde6d20
      Paul Mackerras authored
      This fixes a problem where a CPU thread coming out of nap mode can
      think it has valid values in the nonvolatile GPRs (r14 - r31) as saved
      away in power7_idle, but in fact the values have been trashed because
      the thread was used for KVM in the mean time.  The result is that the
      thread crashes because code that called power7_idle (e.g.,
      pnv_smp_cpu_kill_self()) goes to use values in registers that have
      been trashed.
      The bit field in SRR1 that tells whether state was lost only reflects
      the most recent nap, which may not have been the nap instruction in
      power7_idle.  So we need an extra PACA field to indicate that state
      has been lost even if SRR1 indicates that the most recent nap didn't
      lose state.  We clear this field when saving the state in power7_idle,
      we set it to a non-zero value when we use the thread for KVM, and we
      test it in power7_wakeup_noloss.
      Signed-off-by: default avatarPaul Mackerras <paulus@samba.org>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
  25. 12 Jul, 2011 1 commit
    • Paul Mackerras's avatar
      KVM: PPC: Allow book3s_hv guests to use SMT processor modes · 371fefd6
      Paul Mackerras authored
      This lifts the restriction that book3s_hv guests can only run one
      hardware thread per core, and allows them to use up to 4 threads
      per core on POWER7.  The host still has to run single-threaded.
      This capability is advertised to qemu through a new KVM_CAP_PPC_SMT
      capability.  The return value of the ioctl querying this capability
      is the number of vcpus per virtual CPU core (vcore), currently 4.
      To use this, the host kernel should be booted with all threads
      active, and then all the secondary threads should be offlined.
      This will put the secondary threads into nap mode.  KVM will then
      wake them from nap mode and use them for running guest code (while
      they are still offline).  To wake the secondary threads, we send
      them an IPI using a new xics_wake_cpu() function, implemented in
      arch/powerpc/sysdev/xics/icp-native.c.  In other words, at this stage
      we assume that the platform has a XICS interrupt controller and
      we are using icp-native.c to drive it.  Since the woken thread will
      need to acknowledge and clear the IPI, we also export the base
      physical address of the XICS registers using kvmppc_set_xics_phys()
      for use in the low-level KVM book3s code.
      When a vcpu is created, it is assigned to a virtual CPU core.
      The vcore number is obtained by dividing the vcpu number by the
      number of threads per core in the host.  This number is exported
      to userspace via the KVM_CAP_PPC_SMT capability.  If qemu wishes
      to run the guest in single-threaded mode, it should make all vcpu
      numbers be multiples of the number of threads per core.
      We distinguish three states of a vcpu: runnable (i.e., ready to execute
      the guest), blocked (that is, idle), and busy in host.  We currently
      implement a policy that the vcore can run only when all its threads
      are runnable or blocked.  This way, if a vcpu needs to execute elsewhere
      in the kernel or in qemu, it can do so without being starved of CPU
      by the other vcpus.
      When a vcore starts to run, it executes in the context of one of the
      vcpu threads.  The other vcpu threads all go to sleep and stay asleep
      until something happens requiring the vcpu thread to return to qemu,
      or to wake up to run the vcore (this can happen when another vcpu
      thread goes from busy in host state to blocked).
      It can happen that a vcpu goes from blocked to runnable state (e.g.
      because of an interrupt), and the vcore it belongs to is already
      running.  In that case it can start to run immediately as long as
      the none of the vcpus in the vcore have started to exit the guest.
      We send the next free thread in the vcore an IPI to get it to start
      to execute the guest.  It synchronizes with the other threads via
      the vcore->entry_exit_count field to make sure that it doesn't go
      into the guest if the other vcpus are exiting by the time that it
      is ready to actually enter the guest.
      Note that there is no fixed relationship between the hardware thread
      number and the vcpu number.  Hardware threads are assigned to vcpus
      as they become runnable, so we will always use the lower-numbered
      hardware threads in preference to higher-numbered threads if not all
      the vcpus in the vcore are runnable, regardless of which vcpus are
      Signed-off-by: default avatarPaul Mackerras <paulus@samba.org>
      Signed-off-by: default avatarAlexander Graf <agraf@suse.de>
  26. 20 Apr, 2011 1 commit