1. 16 Jan, 2018 2 commits
    • Petr Mladek's avatar
      printk: Hide console waiter logic into helpers · c162d5b4
      Petr Mladek authored
      The commit ("printk: Add console owner and waiter logic to load balance
      console writes") made vprintk_emit() and console_unlock() even more
      This patch extracts the new code into 3 helper functions. They should
      help to keep it rather self-contained. It will be easier to use and
      This patch just shuffles the existing code. It does not change
      the functionality.
      Link: http://lkml.kernel.org/r/20180112160837.GD24497@linux.suse
      Cc: akpm@linux-foundation.org
      Cc: linux-mm@kvack.org
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: rostedt@home.goodmis.org
      Cc: Byungchul Park <byungchul.park@lge.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: linux-kernel@vger.kernel.org
      Reviewed-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      Acked-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: default avatarPetr Mladek <pmladek@suse.com>
    • Steven Rostedt (VMware)'s avatar
      printk: Add console owner and waiter logic to load balance console writes · dbdda842
      Steven Rostedt (VMware) authored
      This patch implements what I discussed in Kernel Summit. I added
      lockdep annotation (hopefully correctly), and it hasn't had any splats
      (since I fixed some bugs in the first iterations). It did catch
      problems when I had the owner covering too much. But now that the owner
      is only set when actively calling the consoles, lockdep has stayed
      Here's the design again:
      I added a "console_owner" which is set to a task that is actively
      writing to the consoles. It is *not* the same as the owner of the
      console_lock. It is only set when doing the calls to the console
      functions. It is protected by a console_owner_lock which is a raw spin
      There is a console_waiter. This is set when there is an active console
      owner that is not current, and waiter is not set. This too is protected
      by console_owner_lock.
      In printk() when it tries to write to the consoles, we have:
      	if (console_trylock())
      Now I added an else, which will check if there is an active owner, and
      no current waiter. If that is the case, then console_waiter is set, and
      the task goes into a spin until it is no longer set.
      When the active console owner finishes writing the current message to
      the consoles, it grabs the console_owner_lock and sees if there is a
      waiter, and clears console_owner.
      If there is a waiter, then it breaks out of the loop, clears the waiter
      flag (because that will release the waiter from its spin), and exits.
      Note, it does *not* release the console semaphore. Because it is a
      semaphore, there is no owner. Another task may release it. This means
      that the waiter is guaranteed to be the new console owner! Which it
      Then the waiter calls console_unlock() and continues to write to the
      If another task comes along and does a printk() it too can become the
      new waiter, and we wash rinse and repeat!
      By Petr Mladek about possible new deadlocks:
      The thing is that we move console_sem only to printk() call
      that normally calls console_unlock() as well. It means that
      the transferred owner should not bring new type of dependencies.
      As Steven said somewhere: "If there is a deadlock, it was
      there even before."
      We could look at it from this side. The possible deadlock would
      look like:
      CPU0                            CPU1
        console_owner = current;
      				    spin = true;
      				    while (...)
      This would be a deadlock. CPU0 would wait for the lock A.
      While CPU1 would own the lockA and would wait for CPU0
      to finish calling the console drivers and pass the console_sem
      But if the above is true than the following scenario was
      already possible before:
      By other words, this deadlock was there even before. Such
      deadlocks are prevented by using printk_deferred() in
      the sections guarded by the lock A.
      By Steven Rostedt:
      To demonstrate the issue, this module has been shown to lock up a
      system with 4 CPUs and a slow console (like a serial console). It is
      also able to lock up a 8 CPU system with only a fast (VGA) console, by
      passing in "loops=100". The changes in this commit prevent this module
      from locking up the system.
       #include <linux/module.h>
       #include <linux/delay.h>
       #include <linux/sched.h>
       #include <linux/mutex.h>
       #include <linux/workqueue.h>
       #include <linux/hrtimer.h>
       static bool stop_testing;
       static unsigned int loops = 1;
       static void preempt_printk_workfn(struct work_struct *work)
       	int i;
       	while (!READ_ONCE(stop_testing)) {
       		for (i = 0; i < loops && !READ_ONCE(stop_testing); i++) {
       			pr_emerg("%5d%-75s\n", smp_processor_id(),
       				 " XXX NOPREEMPT");
       static struct work_struct __percpu *works;
       static void finish(void)
       	int cpu;
       	WRITE_ONCE(stop_testing, true);
       		flush_work(per_cpu_ptr(works, cpu));
       static int __init test_init(void)
       	int cpu;
       	works = alloc_percpu(struct work_struct);
       	if (!works)
       		return -ENOMEM;
       	 * This is just a test module. This will break if you
       	 * do any CPU hot plugging between loading and
       	 * unloading the module.
       	for_each_online_cpu(cpu) {
       		struct work_struct *work = per_cpu_ptr(works, cpu);
       		INIT_WORK(work, &preempt_printk_workfn);
       		schedule_work_on(cpu, work);
       	return 0;
       static void __exit test_exit(void)
       module_param(loops, uint, 0);
      Link: http://lkml.kernel.org/r/20180110132418.7080-2-pmladek@suse.com
      Cc: akpm@linux-foundation.org
      Cc: linux-mm@kvack.org
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Byungchul Park <byungchul.park@lge.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      [pmladek@suse.com: Commit message about possible deadlocks]
      Acked-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: default avatarPetr Mladek <pmladek@suse.com>
  2. 18 Nov, 2017 19 commits
  3. 16 Nov, 2017 8 commits
  4. 15 Nov, 2017 1 commit
    • Eric Dumazet's avatar
      bpf: fix lockdep splat · 89ad2fa3
      Eric Dumazet authored
      pcpu_freelist_pop() needs the same lockdep awareness than
      pcpu_freelist_populate() to avoid a false positive.
       [ INFO: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected ]
       switchto-defaul/12508 [HC0[0]:SC0[6]:HE0:SE0] is trying to acquire:
        (&htab->buckets[i].lock){......}, at: [<ffffffff9dc099cb>] __htab_percpu_map_update_elem+0x1cb/0x300
       and this task is already holding:
        (dev_queue->dev->qdisc_class ?: &qdisc_tx_lock#2){+.-...}, at: [<ffffffff9e135848>] __dev_queue_xmit+0
       which would create a new lock dependency:
        (dev_queue->dev->qdisc_class ?: &qdisc_tx_lock#2){+.-...} -> (&htab->buckets[i].lock){......}
       but this new dependency connects a SOFTIRQ-irq-safe lock:
        (dev_queue->dev->qdisc_class ?: &qdisc_tx_lock#2){+.-...}
       ... which became SOFTIRQ-irq-safe at:
         [<ffffffff9db5931b>] __lock_acquire+0x42b/0x1f10
         [<ffffffff9db5b32c>] lock_acquire+0xbc/0x1b0
         [<ffffffff9da05e38>] _raw_spin_lock+0x38/0x50
         [<ffffffff9e135848>] __dev_queue_xmit+0x868/0x1240
         [<ffffffff9e136240>] dev_queue_xmit+0x10/0x20
         [<ffffffff9e1965d9>] ip_finish_output2+0x439/0x590
         [<ffffffff9e197410>] ip_finish_output+0x150/0x2f0
         [<ffffffff9e19886d>] ip_output+0x7d/0x260
         [<ffffffff9e19789e>] ip_local_out+0x5e/0xe0
         [<ffffffff9e197b25>] ip_queue_xmit+0x205/0x620
         [<ffffffff9e1b8398>] tcp_transmit_skb+0x5a8/0xcb0
         [<ffffffff9e1ba152>] tcp_write_xmit+0x242/0x1070
         [<ffffffff9e1baffc>] __tcp_push_pending_frames+0x3c/0xf0
         [<ffffffff9e1b3472>] tcp_rcv_established+0x312/0x700
         [<ffffffff9e1c1acc>] tcp_v4_do_rcv+0x11c/0x200
         [<ffffffff9e1c3dc2>] tcp_v4_rcv+0xaa2/0xc30
         [<ffffffff9e191107>] ip_local_deliver_finish+0xa7/0x240
         [<ffffffff9e191a36>] ip_local_deliver+0x66/0x200
         [<ffffffff9e19137d>] ip_rcv_finish+0xdd/0x560
         [<ffffffff9e191e65>] ip_rcv+0x295/0x510
         [<ffffffff9e12ff88>] __netif_receive_skb_core+0x988/0x1020
         [<ffffffff9e130641>] __netif_receive_skb+0x21/0x70
         [<ffffffff9e1306ff>] process_backlog+0x6f/0x230
         [<ffffffff9e132129>] net_rx_action+0x229/0x420
         [<ffffffff9da07ee8>] __do_softirq+0xd8/0x43d
         [<ffffffff9e282bcc>] do_softirq_own_stack+0x1c/0x30
         [<ffffffff9dafc2f5>] do_softirq+0x55/0x60
         [<ffffffff9dafc3a8>] __local_bh_enable_ip+0xa8/0xb0
         [<ffffffff9db4c727>] cpu_startup_entry+0x1c7/0x500
         [<ffffffff9daab333>] start_secondary+0x113/0x140
       to a SOFTIRQ-irq-unsafe lock:
       ... which became SOFTIRQ-irq-unsafe at:
       ...  [<ffffffff9db5971f>] __lock_acquire+0x82f/0x1f10
         [<ffffffff9db5b32c>] lock_acquire+0xbc/0x1b0
         [<ffffffff9da05e38>] _raw_spin_lock+0x38/0x50
         [<ffffffff9dc0b7fa>] pcpu_freelist_pop+0x7a/0xb0
         [<ffffffff9dc08b2c>] htab_map_alloc+0x50c/0x5f0
         [<ffffffff9dc00dc5>] SyS_bpf+0x265/0x1200
         [<ffffffff9e28195f>] entry_SYSCALL_64_fastpath+0x12/0x17
       other info that might help us debug this:
       Chain exists of:
         dev_queue->dev->qdisc_class ?: &qdisc_tx_lock#2 --> &htab->buckets[i].lock --> &head->lock
        Possible interrupt unsafe locking scenario:
              CPU0                    CPU1
              ----                    ----
                                      lock(dev_queue->dev->qdisc_class ?: &qdisc_tx_lock#2);
           lock(dev_queue->dev->qdisc_class ?: &qdisc_tx_lock#2);
        *** DEADLOCK ***
      Fixes: e19494ed
       ("bpf: introduce percpu_freelist")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  5. 14 Nov, 2017 2 commits
    • Yonghong Song's avatar
      bpf: change helper bpf_probe_read arg2 type to ARG_CONST_SIZE_OR_ZERO · 9c019e2b
      Yonghong Song authored
      The helper bpf_probe_read arg2 type is changed
      from ARG_CONST_SIZE to ARG_CONST_SIZE_OR_ZERO to permit
      size-0 buffer. Together with newer ARG_CONST_SIZE_OR_ZERO
      semantics which allows non-NULL buffer with size 0,
      this allows simpler bpf programs with verifier acceptance.
      The previous commit which changes ARG_CONST_SIZE_OR_ZERO semantics
      has details on examples.
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    • Yonghong Song's avatar
      bpf: improve verifier ARG_CONST_SIZE_OR_ZERO semantics · 9fd29c08
      Yonghong Song authored
      For helpers, the argument type ARG_CONST_SIZE_OR_ZERO permits the
      access size to be 0 when accessing the previous argument (arg).
      Right now, it requires the arg needs to be NULL when size passed
      is 0 or could be 0. It also requires a non-NULL arg when the size
      is proved to be non-0.
      This patch changes verifier ARG_CONST_SIZE_OR_ZERO behavior
      such that for size-0 or possible size-0, it is not required
      the arg equal to NULL.
      There are a couple of reasons for this semantics change, and
      all of them intends to simplify user bpf programs which
      may improve user experience and/or increase chances of
      verifier acceptance. Together with the next patch which
      changes bpf_probe_read arg2 type from ARG_CONST_SIZE to
      ARG_CONST_SIZE_OR_ZERO, the following two examples, which
      fail the verifier currently, are able to get verifier acceptance.
      Example 1:
         unsigned long len = pend - pstart;
         len = len > MAX_PAYLOAD_LEN ? MAX_PAYLOAD_LEN : len;
         len &= MAX_PAYLOAD_LEN;
         bpf_probe_read(data->payload, len, pstart);
      It does not have test for "len > 0" and it failed the verifier.
      Users may not be aware that they have to add this test.
      Converting the bpf_probe_read helper to have
      ARG_CONST_SIZE_OR_ZERO helps the above code get
      verifier acceptance.
      Example 2:
        Here is one example where llvm "messed up" the code and
        the verifier fails.
         unsigned long len = pend - pstart;
         if (len > 0 && len <= MAX_PAYLOAD_LEN)
           bpf_probe_read(data->payload, len, pstart);
      The compiler generates the following code and verifier fails:
      39: (79) r2 = *(u64 *)(r10 -16)
      40: (1f) r2 -= r8
      41: (bf) r1 = r2
      42: (07) r1 += -1
      43: (25) if r1 > 0xffe goto pc+3
        R0=inv(id=0) R1=inv(id=0,umax_value=4094,var_off=(0x0; 0xfff))
        R2=inv(id=0) R6=map_value(id=0,off=0,ks=4,vs=4095,imm=0) R7=inv(id=0)
        R8=inv(id=0) R9=inv0 R10=fp0
      44: (bf) r1 = r6
      45: (bf) r3 = r8
      46: (85) call bpf_probe_read#45
      R2 min value is negative, either use unsigned or 'var &= const'
      The compiler optimization is correct. If r1 = 0,
      r1 - 1 = 0xffffffffffffffff > 0xffe.  If r1 != 0, r1 - 1 will not wrap.
      r1 > 0xffe at insn #43 can actually capture
      both "r1 > 0" and "len <= MAX_PAYLOAD_LEN".
      This however causes an issue in verifier as the value range of arg2
      "r2" does not properly get refined and lead to verification failure.
      Relaxing bpf_prog_read arg2 from ARG_CONST_SIZE to ARG_CONST_SIZE_OR_ZERO
      allows the following simplied code:
         unsigned long len = pend - pstart;
         if (len <= MAX_PAYLOAD_LEN)
           bpf_probe_read(data->payload, len, pstart);
      The llvm compiler will generate less complex code and the
      verifier is able to verify that the program is okay.
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  6. 13 Nov, 2017 3 commits
  7. 12 Nov, 2017 5 commits
    • Rasmus Villemoes's avatar
      genirq: Fix type of shifting literal 1 in __setup_irq() · ffc661c9
      Rasmus Villemoes authored
      If ffz() ever returns a value >= 31 then the following shift is undefined
      behaviour because the literal 1 which gets shifted is treated as signed
      In practice, the bug is probably harmless, since the first undefined shift
      count is 31 which results - ignoring UB - in (int)(0x80000000). This gets
      sign extended so bit 32-63 will be set as well and all subsequent
      __setup_irq() calls would just end up hitting the -EBUSY branch.
      However, a sufficiently aggressive optimizer may use the UB of 1<<31
      to decide that doesn't happen, and hence elide the sign-extension
      code, so that subsequent calls can indeed get ffz > 31.
      In any case, the right thing to do is to make the literal 1UL.
      [ tglx: For this to happen a single interrupt would have to be shared by 32
        	devices. Hardware like that does not exist and would have way more
        	problems than that. ]
      Signed-off-by: default avatarRasmus Villemoes <linux@rasmusvillemoes.dk>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/20171030213548.16831-1-linux@rasmusvillemoes.dk
    • Rasmus Villemoes's avatar
      irqdomain: Drop pointless NULL check in virq_debug_show_one · 306eb5a3
      Rasmus Villemoes authored
      data has been already derefenced unconditionally, so it's pointless to do a
      NULL pointer check on it afterwards. Drop it.
      [ tglx: Depersonify changelog. ]
      Signed-off-by: default avatarRasmus Villemoes <linux@rasmusvillemoes.dk>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Link: https://lkml.kernel.org/r/20171112212904.28574-1-linux@rasmusvillemoes.dk
    • Wen Yaxng's avatar
      genirq/proc: Return proper error code when irq_set_affinity() fails · 6714796e
      Wen Yaxng authored
      write_irq_affinity() returns the number of written bytes, which means
      success, unconditionally whether the actual irq_set_affinity() call
      succeeded or not.
      Add proper error handling and pass the error code returned from
      irq_set_affinity() back to user space in case of failure.
      [ tglx: Fixed coding style and massaged changelog ]
      Signed-off-by: default avatarWen Yang <wen.yang99@zte.com.cn>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarJiang Biao <jiang.biao2@zte.com.cn>
      Cc: zhong.weidong@zte.com.cn
      Link: https://lkml.kernel.org/r/1510106103-184761-1-git-send-email-wen.yang99@zte.com.cn
    • David Howells's avatar
      timers: Add a function to start/reduce a timer · b24591e2
      David Howells authored
      Add a function, similar to mod_timer(), that will start a timer if it isn't
      running and will modify it if it is running and has an expiry time longer
      than the new time.  If the timer is running with an expiry time that's the
      same or sooner, no change is made.
      The function looks like:
      	int timer_reduce(struct timer_list *timer, unsigned long expires);
      This can be used by code such as networking code to make it easier to share
      a timer for multiple timeouts.  For instance, in upcoming AF_RXRPC code,
      the rxrpc_call struct will maintain a number of timeouts:
      	unsigned long	ack_at;
      	unsigned long	resend_at;
      	unsigned long	ping_at;
      	unsigned long	expect_rx_by;
      	unsigned long	expect_req_by;
      	unsigned long	expect_term_by;
      each of which is set independently of the others.  With timer reduction
      available, when the code needs to set one of the timeouts, it only needs to
      look at that timeout and then call timer_reduce() to modify the timer,
      starting it or bringing it forward if necessary.  There is no need to refer
      to the other timeouts to see which is earliest and no need to take any lock
      other than, potentially, the timer lock inside timer_reduce().
      Note, that this does not protect against concurrent invocations of any of
      the timer functions.
      As an example, the expect_rx_by timeout above, which terminates a call if
      we don't get a packet from the server within a certain time window, would
      be set something like this:
      	unsigned long now = jiffies;
      	unsigned long expect_rx_by = now + packet_receive_timeout;
      	WRITE_ONCE(call->expect_rx_by, expect_rx_by);
      	timer_reduce(&call->timer, expect_rx_by);
      The timer service code (which might, say, be in a work function) would then
      check all the timeouts to see which, if any, had triggered, deal with
      	t = READ_ONCE(call->ack_at);
      	if (time_after_eq(now, t)) {
      		cmpxchg(&call->ack_at, t, now + MAX_JIFFY_OFFSET);
      		set_bit(RXRPC_CALL_EV_ACK, &call->events);
      and then restart the timer if necessary by finding the soonest timeout that
      hasn't yet passed and then calling timer_reduce().
      The disadvantage of doing things this way rather than comparing the timers
      each time and calling mod_timer() is that you *will* take timer events
      unless you can finish what you're doing and delete the timer in time.
      The advantage of doing things this way is that you don't need to use a lock
      to work out when the next timer should be set, other than the timer's own
      lock - which you might not have to take.
      [ tglx: Fixed weird formatting and adopted it to pending changes ]
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: keyrings@vger.kernel.org
      Cc: linux-afs@lists.infradead.org
      Link: https://lkml.kernel.org/r/151023090769.23050.1801643667223880753.stgit@warthog.procyon.org.uk
    • Arnd Bergmann's avatar
      pstore: Use ktime_get_real_fast_ns() instead of __getnstimeofday() · df27067e
      Arnd Bergmann authored
      __getnstimeofday() is a rather odd interface, with a number of quirks:
      - The caller may come from NMI context, but the implementation is not NMI safe,
        one way to get there from NMI is
            NMI handler:
              something bad
      - The calling conventions are different from any other timekeeping functions,
        to deal with returning an error code during suspended timekeeping.
      Address the above issues by using a completely different method to get the
      time: ktime_get_real_fast_ns() is NMI safe and has a reasonable behavior
      when timekeeping is suspended: it returns the time at which it got
      suspended. As Thomas Gleixner explained, this is safe, as
      ktime_get_real_fast_ns() does not call into the clocksource driver that
      might be suspended.
      The result can easily be transformed into a timespec structure. Since
      ktime_get_real_fast_ns() was not exported to modules, add the export.
      The pstore behavior for the suspended case changes slightly, as it now
      stores the timestamp at which timekeeping was suspended instead of storing
      a zero timestamp.
      This change is not addressing y2038-safety, that's subject to a more
      complex follow up patch.
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Acked-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Anton Vorontsov <anton@enomsg.org>
      Cc: Stephen Boyd <sboyd@codeaurora.org>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Colin Cross <ccross@android.com>
      Link: https://lkml.kernel.org/r/20171110152530.1926955-1-arnd@arndb.de