• Steven Rostedt's avatar
    tracing: Do not record user stack trace from NMI context · b6345879
    Steven Rostedt authored
    
    
    A bug was found with Li Zefan's ftrace_stress_test that caused applications
    to segfault during the test.
    
    Placing a tracing_off() in the segfault code, and examining several
    traces, I found that the following was always the case. The lock tracer
    was enabled (lockdep being required) and userstack was enabled. Testing
    this out, I just enabled the two, but that was not good enough. I needed
    to run something else that could trigger it. Running a load like hackbench
    did not work, but executing a new program would. The following would
    trigger the segfault within seconds:
    
      # echo 1 > /debug/tracing/options/userstacktrace
      # echo 1 > /debug/tracing/events/lock/enable
      # while :; do ls > /dev/null ; done
    
    Enabling the function graph tracer and looking at what was happening
    I finally noticed that all cashes happened just after an NMI.
    
     1)               |    copy_user_handle_tail() {
     1)               |      bad_area_nosemaphore() {
     1)               |        __bad_area_nosemaphore() {
     1)               |          no_context() {
     1)               |            fixup_exception() {
     1)   0.319 us    |              search_exception_tables();
     1)   0.873 us    |            }
    [...]
     1)   0.314 us    |  __rcu_read_unlock();
     1)   0.325 us    |    native_apic_mem_write();
     1)   0.943 us    |  }
     1)   0.304 us    |  rcu_nmi_exit();
    [...]
     1)   0.479 us    |  find_vma();
     1)               |  bad_area() {
     1)               |    __bad_area() {
    
    After capturing several traces of failures, all of them happened
    after an NMI. Curious about this, I added a trace_printk() to the NMI
    handler to read the regs->ip to see where the NMI happened. In which I
    found out it was here:
    
    ffffffff8135b660 <page_fault>:
    ffffffff8135b660:       48 83 ec 78             sub    $0x78,%rsp
    ffffffff8135b664:       e8 97 01 00 00          callq  ffffffff8135b800 <error_entry>
    
    What was happening is that the NMI would happen at the place that a page
    fault occurred. It would call rcu_read_lock() which was traced by
    the lock events, and the user_stack_trace would run. This would trigger
    a page fault inside the NMI. I do not see where the CR2 register is
    saved or restored in NMI handling. This means that it would corrupt
    the page fault handling that the NMI interrupted.
    
    The reason the while loop of ls helped trigger the bug, was that
    each execution of ls would cause lots of pages to be faulted in, and
    increase the chances of the race happening.
    
    The simple solution is to not allow user stack traces in NMI context.
    After this patch, I ran the above "ls" test for a couple of hours
    without any issues. Without this patch, the bug would trigger in less
    than a minute.
    
    Cc: stable@kernel.org
    Reported-by: default avatarLi Zefan <lizf@cn.fujitsu.com>
    Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
    b6345879