• Hidetoshi Seto's avatar
    [IA64] kdump: Mask MCA/INIT on frozen cpus · 4295ab34
    Hidetoshi Seto authored
      INIT asserted on kdump kernel invokes INIT handler not only on a
      cpu that running on the kdump kernel, but also BSP of the panicked
      kernel, because the (badly) frozen BSP can be thawed by INIT.
      The kdump_cpu_freeze() is called on cpus except one that initiates
      panic and/or kdump, to stop/offline the cpu (on ia64, it means we
      pass control of cpus to SAL, or put them in spinloop).  Note that
      CPU0(BSP) always go to spinloop, so if panic was happened on an AP,
      there are at least 2cpus (= the AP and BSP) which not back to SAL.
      On the spinning cpus, interrupts are disabled (rsm psr.i), but INIT
      is still interruptible because psr.mc for mask them is not set unless
      kdump_cpu_freeze() is not called from MCA/INIT context.
      Therefore, assume that a panic was happened on an AP, kdump was
      invoked, new INIT handlers for kdump kernel was registered and then
      an INIT is asserted.  From the viewpoint of SAL, there are 2 online
      cpus, so INIT will be delivered to both of them.  It likely means
      that not only the AP (= a cpu executing kdump) enters INIT handler
      which is newly registered, but also BSP (= another cpu spinning in
      panicked kernel) enters the same INIT handler.  Of course setting of
      registers in BSP are still old (for panicked kernel), so what happen
      with running handler with wrong setting will be extremely unexpected.
      I believe this is not desirable behavior.
    How to Reproduce:
      Start kdump on one of APs (e.g. cpu1)
        # taskset 0x2 echo c > /proc/sysrq-trigger
      Then assert INIT after kdump kernel is booted, after new INIT handler
      for kdump kernel is registered.
    Expected results:
      An INIT handler is invoked only on the AP.
    Actual results:
      An INIT handler is invoked on the AP and BSP.
    Sample of results:
      I got following console log by asserting INIT after prompt "root:/>".
      It seems that two monarchs appeared by one INIT, and one panicked at
      last.  And it also seems that the panicked one supposed there were
      4 online cpus and no one did rendezvous:
        [  0 %]dropping to initramfs shell
        exiting this shell will reboot your system
        root:/> Entered OS INIT handler. PSP=fff301a0 cpu=0 monarch=0
        ia64_init_handler: Promoting cpu 0 to monarch.
        Delaying for 5 seconds...
        All OS INIT slaves have reached rendezvous
        Processes interrupted by INIT - 0 (cpu 0 task 0xa000000100af0000)
        Entered OS INIT handler. PSP=fff301a0 cpu=0 monarch=1
        Delaying for 5 seconds...
        mlogbuf_finish: printing switched to urgent mode, MCA/INIT might be dodgy or fail.
        OS INIT slave did not rendezvous on cpu 1 2 3
        INIT swapper 0[0]: bugcheck! 0 [1]
        Kernel panic - not syncing: Attempted to kill the idle task!
    Proposed fix:
      To avoid this problem, this patch inserts ia64_set_psr_mc() to mask
      INIT on cpus going to be frozen.  This masking have no effect if the
      kdump_cpu_freeze() is called from INIT handler when kdump_on_init == 1,
      because psr.mc is already turned on to 1 before entering OS_INIT.
      I confirmed that weird log like above are disappeared after applying
      this patch.
    Signed-off-by: default avatarHidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
    Cc: Vivek Goyal <vgoyal@redhat.com>
    Cc: Haren Myneni <hbabu@us.ibm.com>
    Cc: kexec@lists.infradead.org
    Acked-by: default avatarFenghua Yu <fenghua.yu@intel.com>
    Signed-off-by: default avatarTony Luck <tony.luck@intel.com>
crash.c 5.84 KB