1. 27 Apr, 2018 4 commits
    • Will Deacon's avatar
      locking/qspinlock: Kill cmpxchg() loop when claiming lock from head of queue · c61da58d
      Will Deacon authored
      
      
      When a queued locker reaches the head of the queue, it claims the lock
      by setting _Q_LOCKED_VAL in the lockword. If there isn't contention, it
      must also clear the tail as part of this operation so that subsequent
      lockers can avoid taking the slowpath altogether.
      
      Currently this is expressed as a cmpxchg() loop that practically only
      runs up to two iterations. This is confusing to the reader and unhelpful
      to the compiler. Rewrite the cmpxchg() loop without the loop, so that a
      failed cmpxchg() implies that there is contention and we just need to
      write to _Q_LOCKED_VAL without considering the rest of the lockword.
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarWaiman Long <longman@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: boqun.feng@gmail.com
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: paulmck@linux.vnet.ibm.com
      Link: http://lkml.kernel.org/r/1524738868-31318-7-git-send-email-will.deacon@arm.com
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      c61da58d
    • Will Deacon's avatar
      locking/qspinlock: Remove unbounded cmpxchg() loop from locking slowpath · 59fb586b
      Will Deacon authored
      
      
      The qspinlock locking slowpath utilises a "pending" bit as a simple form
      of an embedded test-and-set lock that can avoid the overhead of explicit
      queuing in cases where the lock is held but uncontended. This bit is
      managed using a cmpxchg() loop which tries to transition the uncontended
      lock word from (0,0,0) -> (0,0,1) or (0,0,1) -> (0,1,1).
      
      Unfortunately, the cmpxchg() loop is unbounded and lockers can be starved
      indefinitely if the lock word is seen to oscillate between unlocked
      (0,0,0) and locked (0,0,1). This could happen if concurrent lockers are
      able to take the lock in the cmpxchg() loop without queuing and pass it
      around amongst themselves.
      
      This patch fixes the problem by unconditionally setting _Q_PENDING_VAL
      using atomic_fetch_or, and then inspecting the old value to see whether
      we need to spin on the current lock owner, or whether we now effectively
      hold the lock. The tricky scenario is when concurrent lockers end up
      queuing on the lock and the lock becomes available, causing us to see
      a lockword of (n,0,0). With pending now set, simply queuing could lead
      to deadlock as the head of the queue may not have observed the pending
      flag being cleared. Conversely, if the head of the queue did observe
      pending being cleared, then it could transition the lock from (n,0,0) ->
      (0,0,1) meaning that any attempt to "undo" our setting of the pending
      bit could race with a concurrent locker trying to set it.
      
      We handle this race by preserving the pending bit when taking the lock
      after reaching the head of the queue and leaving the tail entry intact
      if we saw pending set, because we know that the tail is going to be
      updated shortly.
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarWaiman Long <longman@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: boqun.feng@gmail.com
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: paulmck@linux.vnet.ibm.com
      Link: http://lkml.kernel.org/r/1524738868-31318-6-git-send-email-will.deacon@arm.com
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      59fb586b
    • Will Deacon's avatar
      locking/qspinlock: Bound spinning on pending->locked transition in slowpath · 6512276d
      Will Deacon authored
      
      
      If a locker taking the qspinlock slowpath reads a lock value indicating
      that only the pending bit is set, then it will spin whilst the
      concurrent pending->locked transition takes effect.
      
      Unfortunately, there is no guarantee that such a transition will ever be
      observed since concurrent lockers could continuously set pending and
      hand over the lock amongst themselves, leading to starvation. Whilst
      this would probably resolve in practice, it means that it is not
      possible to prove liveness properties about the lock and means that lock
      acquisition time is unbounded.
      
      Rather than removing the pending->locked spinning from the slowpath
      altogether (which has been shown to heavily penalise a 2-threaded
      locking stress test on x86), this patch replaces the explicit spinning
      with a call to atomic_cond_read_relaxed and allows the architecture to
      provide a bound on the number of spins. For architectures that can
      respond to changes in cacheline state in their smp_cond_load implementation,
      it should be sufficient to use the default bound of 1.
      Suggested-by: default avatarWaiman Long <longman@redhat.com>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarWaiman Long <longman@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: boqun.feng@gmail.com
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: paulmck@linux.vnet.ibm.com
      Link: http://lkml.kernel.org/r/1524738868-31318-4-git-send-email-will.deacon@arm.com
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      6512276d
    • Will Deacon's avatar
      locking/qspinlock: Merge 'struct __qspinlock' into 'struct qspinlock' · 625e88be
      Will Deacon authored
      
      
      'struct __qspinlock' provides a handy union of fields so that
      subcomponents of the lockword can be accessed by name, without having to
      manage shifts and masks explicitly and take endianness into account.
      
      This is useful in qspinlock.h and also potentially in arch headers, so
      move the 'struct __qspinlock' into 'struct qspinlock' and kill the extra
      definition.
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarWaiman Long <longman@redhat.com>
      Acked-by: default avatarBoqun Feng <boqun.feng@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: paulmck@linux.vnet.ibm.com
      Link: http://lkml.kernel.org/r/1524738868-31318-3-git-send-email-will.deacon@arm.com
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      625e88be
  2. 25 Apr, 2018 3 commits
    • Peter Xu's avatar
      tracing: Fix missing tab for hwlat_detector print format · 9a0fd675
      Peter Xu authored
      It's been missing for a while but no one is touching that up.  Fix it.
      
      Link: http://lkml.kernel.org/r/20180315060639.9578-1-peterx@redhat.com
      
      CC: Ingo Molnar <mingo@kernel.org>
      Cc:stable@vger.kernel.org
      Fixes: 7b2c8625
      
       ("tracing: Add NMI tracing in hwlat detector")
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      9a0fd675
    • Thomas Richter's avatar
      kprobes: Fix random address output of blacklist file · bcbd385b
      Thomas Richter authored
      File /sys/kernel/debug/kprobes/blacklist displays random addresses:
      
      [root@s8360046 linux]# cat /sys/kernel/debug/kprobes/blacklist
      0x0000000047149a90-0x00000000bfcb099a	print_type_x8
      ....
      
      This breaks 'perf probe' which uses the blacklist file to prohibit
      probes on certain functions by checking the address range.
      
      Fix this by printing the correct (unhashed) address.
      
      The file mode is read all but this is not an issue as the file
      hierarchy points out:
       # ls -ld /sys/ /sys/kernel/ /sys/kernel/debug/ /sys/kernel/debug/kprobes/
      	/sys/kernel/debug/kprobes/blacklist
      dr-xr-xr-x 12 root root 0 Apr 19 07:56 /sys/
      drwxr-xr-x  8 root root 0 Apr 19 07:56 /sys/kernel/
      drwx------ 16 root root 0 Apr 19 06:56 /sys/kernel/debug/
      drwxr-xr-x  2 root root 0 Apr 19 06:56 /sys/kernel/debug/kprobes/
      -r--r--r--  1 root root 0 Apr 19 06:56 /sys/kernel/debug/kprobes/blacklist
      
      Everything in and below /sys/kernel/debug is rwx to root only,
      no group or others have access.
      
      Background:
      Directory /sys/kernel/debug/kprobes is created by debugfs_create_dir()
      which sets the mode bits to rwxr-xr-x. Maybe change that to use the
      parent's directory mode bits instead?
      
      Link: http://lkml.kernel.org/r/20180419105556.86664-1-tmricht@linux.ibm.com
      
      Fixes: ad67b74d
      
       ("printk: hash addresses printed with %p")
      Cc: stable@vger.kernel.org
      Cc: <stable@vger.kernel.org> # v4.15+
      Cc: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
      Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
      Cc: David S Miller <davem@davemloft.net>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: acme@kernel.org
      Signed-off-by: default avatarThomas Richter <tmricht@linux.ibm.com>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      bcbd385b
    • Ravi Bangoria's avatar
      tracing: Fix kernel crash while using empty filter with perf · ba16293d
      Ravi Bangoria authored
      Kernel is crashing when user tries to record 'ftrace:function' event
      with empty filter:
      
        # perf record -e ftrace:function --filter="" ls
      
        # dmesg
        BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
        Oops: 0000 [#1] SMP PTI
        ...
        RIP: 0010:ftrace_profile_set_filter+0x14b/0x2d0
        RSP: 0018:ffffa4a7c0da7d20 EFLAGS: 00010246
        RAX: ffffa4a7c0da7d64 RBX: 0000000000000000 RCX: 0000000000000006
        RDX: 0000000000000000 RSI: 0000000000000092 RDI: ffff8c48ffc968f0
        ...
        Call Trace:
         _perf_ioctl+0x54a/0x6b0
         ? rcu_all_qs+0x5/0x30
        ...
      
      After patch:
        # perf record -e ftrace:function --filter="" ls
        failed to set filter "" on event ftrace:function with 22 (Invalid argument)
      
      Also, if user tries to echo "" > filter, it used to throw an error.
      This behavior got changed by commit 80765597 ("tracing: Rewrite
      filter logic to be simpler and faster"). This patch restores the
      behavior as a side effect:
      
      Before patch:
        # echo "" > filter
        #
      
      After patch:
        # echo "" > filter
        bash: echo: write error: Invalid argument
        #
      
      Link: http://lkml.kernel.org/r/20180420150758.19787-1-ravi.bangoria@linux.ibm.com
      
      Fixes: 80765597
      
       ("tracing: Rewrite filter logic to be simpler and faster")
      Signed-off-by: default avatarRavi Bangoria <ravi.bangoria@linux.ibm.com>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      ba16293d
  3. 21 Apr, 2018 1 commit
    • Kees Cook's avatar
      fork: unconditionally clear stack on fork · e01e8063
      Kees Cook authored
      One of the classes of kernel stack content leaks[1] is exposing the
      contents of prior heap or stack contents when a new process stack is
      allocated.  Normally, those stacks are not zeroed, and the old contents
      remain in place.  In the face of stack content exposure flaws, those
      contents can leak to userspace.
      
      Fixing this will make the kernel no longer vulnerable to these flaws, as
      the stack will be wiped each time a stack is assigned to a new process.
      There's not a meaningful change in runtime performance; it almost looks
      like it provides a benefit.
      
      Performing back-to-back kernel builds before:
      	Run times: 157.86 157.09 158.90 160.94 160.80
      	Mean: 159.12
      	Std Dev: 1.54
      
      and after:
      	Run times: 159.31 157.34 156.71 158.15 160.81
      	Mean: 158.46
      	Std Dev: 1.46
      
      Instead of making this a build or runtime config, Andy Lutomirski
      recommended this just be enabled by default.
      
      [1] A noisy search for many kinds of stack content leaks can be seen here:
      https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=linux+kernel+stack+leak
      
      I did some more with perf and cycle counts on running 100,000 execs of
      /bin/true.
      
      before:
      Cycles: 218858861551 218853036130 214727610969 227656844122 224980542841
      Mean:  221015379122.60
      Std Dev: 4662486552.47
      
      after:
      Cycles: 213868945060 213119275204 211820169456 224426673259 225489986348
      Mean:  217745009865.40
      Std Dev: 5935559279.99
      
      It continues to look like it's faster, though the deviation is rather
      wide, but I'm not sure what I could do that would be less noisy.  I'm
      open to ideas!
      
      Link: http://lkml.kernel.org/r/20180221021659.GA37073@beast
      
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Rasmus Villemoes <rasmus.villemoes@prevas.dk>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e01e8063
  4. 20 Apr, 2018 1 commit
  5. 19 Apr, 2018 1 commit
  6. 17 Apr, 2018 9 commits
  7. 14 Apr, 2018 14 commits
  8. 12 Apr, 2018 1 commit
  9. 11 Apr, 2018 6 commits