1. 07 Mar, 2021 18 commits
    • Philippe Gerum's avatar
      clockevents: ipipe: connect clock chips to abstract tick device · bf6640d8
      Philippe Gerum authored and Jan Kiszka's avatar Jan Kiszka committed
      Announce all clock event chips as they are registered to the
      out-of-band tick device infrastructure, so that we can interpose on
      key handlers in their descriptors.
      bf6640d8
    • Philippe Gerum's avatar
      ipipe: add kernel event notifiers · 48858891
      Philippe Gerum authored and Jan Kiszka's avatar Jan Kiszka committed
      Add the core API for enabling (regular) kernel event notifications to
      a co-kernel running over the head domain. For instance, such a
      co-kernel may need to know when a task is about to be resumed upon
      signal receipt, or when it gets an access fault trap.
      
      This commit adds the client-side API for enabling such notification
      for class of events, but does not provide the notification points per
      se, which comes later.
      48858891
    • Philippe Gerum's avatar
      printk: ipipe: add raw console channel · b61bde48
      Philippe Gerum authored and Jan Kiszka's avatar Jan Kiszka committed
      A raw output handler (.write_raw) is added to the console descriptor
      for writing (short) text output unmodified, without any logging,
      header or preparation whatsoever, usable from any pipeline domain.
      
      The dedicated raw_printk() variant formats the output message then
      passes it on to the handler holding a hard spinlock, irqs off.
      
      This is a very basic debug channel for situations when resorting to
      the fairly complex printk() handling is not an option. Unlike early
      consoles, regular consoles can provide a raw output service past the
      boot sequence. Raw output handlers are typically provided by serial
      console devices.
      b61bde48
    • Philippe Gerum's avatar
      dump_stack: ipipe: make dump_stack() domain-aware · 7a0e3ce8
      Philippe Gerum authored and Jan Kiszka's avatar Jan Kiszka committed
      When dumping a stack backtrace, we neither need nor want to disable
      root stage IRQs over the head stage, where CPU migration can't
      happen.
      
      Conversely, we neither need nor want to disable hard IRQs from the
      head stage, so that latency won't skyrocket either.
      7a0e3ce8
    • Philippe Gerum's avatar
      driver core: ipipe: defer dev_printk() from head domain · 374cff39
      Philippe Gerum authored and Jan Kiszka's avatar Jan Kiszka committed
      Just like printk(), dev_printk() cannot run from the head domain
      and/or with hard IRQs disabled. In such a case, log the output
      directly into the staging buffer we use for printk().
      
      NOTE: when redirected to the buffer, the output does not include the
      dev_printk() header text but is merely sent as-is to the log.
      374cff39
    • Philippe Gerum's avatar
      printk: ipipe: defer printk() from head domain · 799c0d98
      Philippe Gerum authored and Jan Kiszka's avatar Jan Kiszka committed
      The printk() machinery cannot immediately invoke the console driver(s)
      when called from the head domain, since such driver code belongs to
      the root domain and cannot be shared between domains.
      
      Output issued from the head domain is formatted then logged into a
      staging buffer, and a dedicated virtual IRQ is posted to the root
      domain for notification. When the virtual IRQ handler runs, the
      contents of the staging buffer is flushed to the printk() interface
      anew, which may eventually pass the output on to the console drivers
      from such a context.
      799c0d98
    • Philippe Gerum's avatar
      PM / hibernate: ipipe: protect against out-of-band interrupts · d1b692b1
      Philippe Gerum authored and Jan Kiszka's avatar Jan Kiszka committed
      We must not allow out-of-band activity to resume while we are busy
      suspending the devices in the system, until the PM sleep state has
      been fully entered.
      
      Pair existing virtual IRQ disabling calls which only apply to the root
      domain with hard ones.
      d1b692b1
    • Philippe Gerum's avatar
      module: ipipe: enable try_module_get() from hard atomic context · 60fcca2f
      Philippe Gerum authored and Jan Kiszka's avatar Jan Kiszka committed
      We might have out-of-band code calling try_module_get() from the head
      domain, or from a section covered by a hard spinlock where the root
      domain must not reschedule. This requires the preemption management
      calls in try_module_get() (and the converse module_put()) to be
      converted to their hard variant.
      
      REVISIT: using try_module_get() from such contexts is questionable,
      client domains should be fixed.
      60fcca2f
    • Philippe Gerum's avatar
      KGDB: ipipe: enable debugging over the head domain · 6ede0ff0
      Philippe Gerum authored and Jan Kiszka's avatar Jan Kiszka committed
      Make the KGDB stub runnable over the head domain since we may take
      traps and interrupts from that context too, by converting the locks to
      hard spinlocks.
      6ede0ff0
    • Philippe Gerum's avatar
      context_tracking: ipipe: do not track over the head domain · 3f0532dd
      Philippe Gerum authored and Jan Kiszka's avatar Jan Kiszka committed
      Context tracking is a prerequisite for FULL_NOHZ, so that the RCU
      subsystem can detect CPU idleness without relying on the (regular)
      timer tick.
      
      Out-of-band activity running over the head domain should by definition
      not be involved in such detection logic, as the root domain has no
      knowledge of what happens - and when - on the head domain whatsoever.
      3f0532dd
    • Philippe Gerum's avatar
      lib/smp_processor_id: ipipe: exclude head domain from preemption check · cf2f0f40
      Philippe Gerum authored and Jan Kiszka's avatar Jan Kiszka committed
      There can be no CPU migration from the head stage, however the
      out-of-band code currently running smp_processor_id() might have
      preempted the regular kernel code from within a preemptible section,
      which might cause false positive in the end.
      
      These are the two reasons why we certainly neither need nor want to do
      the preemption check in that case.
      cf2f0f40
    • Philippe Gerum's avatar
      atomic: ipipe: keep atomic when pipelining IRQs · b61ed2ed
      Philippe Gerum authored and Jan Kiszka's avatar Jan Kiszka committed
      Because of the virtualization of interrupt masking for the regular
      kernel code when the pipeline is enabled, atomic helpers relying on
      common interrupt disabling helpers such as local_irq_save/restore
      pairs would not be atomic anymore, leading to data corruption.
      
      This commit restores true atomicity for the atomic helpers that would
      be otherwise affected by interrupt virtualization.
      b61ed2ed
    • Philippe Gerum's avatar
      preempt: ipipe: : add preemption-safe hard_preempt_{enable, disable}() ops · 15c424da
      Philippe Gerum authored and Jan Kiszka's avatar Jan Kiszka committed
      Some inner code of the interrupt pipeline may have to traverse regular
      kernel code which manipulates the preemption count, expecting full
      serialization including with out-of-band contexts.
      
      The hard_preempt_*() variants are substituted to the original
      preempt_{enable, disable}() calls in these cases.
      15c424da
    • Philippe Gerum's avatar
      genirq: ipipe: protect generic chip against domain preemption · aadb1279
      Philippe Gerum authored and Jan Kiszka's avatar Jan Kiszka committed
      As described in Documentation/ipipe.rst, irq_chip drivers need to be
      specifically adapted for dealing with interrupt pipelining safely.
      
      The basic issue to address is proper serialization between some
      irq_chip handlers which may be called from out-of-band context
      immediately upon receipt of an IRQ, and the rest of the driver which
      may access the same data / IO registers from the regular - in-band -
      context on the same CPU.
      
      This commit converts the generic irq_chip lock to a hard spinlock,
      which ensures such serialization.
      aadb1279
    • Philippe Gerum's avatar
      ipipe: add latency tracer · 9fb9e230
      Philippe Gerum authored and Jan Kiszka's avatar Jan Kiszka committed
      The latency tracer is a variant of ftrace's 'function' tracer
      providing detailed information about the current interrupt state at
      each function entry (i.e. virtual interrupt flag and CPU interrupt
      disable bit). This commit introduces the generic tracer code, which
      builds upon the regular ftrace API.
      
      The arch-specific code should provide for ipipe_read_tsc(), a helper
      routine returning a 64bit monotonic time value for timestamping
      purpose. HAVE_IPIPE_TRACER_SUPPORT should be selected by the
      arch-specific code for enabling the tracer, which in turn makes
      CONFIG_IPIPE_TRACE available from the Kconfig interface.
      9fb9e230
    • Philippe Gerum's avatar
      ipipe: add out-of-band tick device · e0cecb8d
      Philippe Gerum authored and Jan Kiszka's avatar Jan Kiszka committed
      
      
      The out-of-band tick device manages the timer hardware by interposing
      on selected clockevent handlers transparently, so that a client domain
      (e.g. a co-kernel) eventually controls such hardware for scheduling
      the high-precision timer events it needs to. Those events are
      delivered to out-of-hand activities running on the head stage,
      unimpeded by (only virtually) interrupt-free sections of the regular
      kernel code.
      
      This commit introduces the generic API for controlling the out-of-band
      tick device from a co-kernel. It also provides for the internal API
      clock event chip drivers should use for enabling high-precision
      timing for their hardware.
      Signed-off-by: Jan Kiszka's avatarJan Kiszka <jan.kiszka@siemens.com>
      e0cecb8d
    • Philippe Gerum's avatar
      locking: ipipe: add hard lock alternative to regular spinlocks · 491dd705
      Philippe Gerum authored and Jan Kiszka's avatar Jan Kiszka committed
      Hard spinlocks manipulate the CPU interrupt mask, without affecting
      the kernel preemption state in locking/unlocking operations.
      
      This type of spinlock is useful for implementing a critical section to
      serialize concurrent accesses from both in-band and out-of-band
      contexts, i.e. from root and head stages.
      
      Hard spinlocks exclusively depend on the pre-existing arch-specific
      bits which implement regular spinlocks. They can be seen as basic
      spinlocks still affecting the CPU's interrupt state when all other
      spinlock types only deal with the virtual interrupt flag managed by
      the pipeline core - i.e. only disable interrupts for the regular
      in-band kernel activity.
      491dd705
    • Philippe Gerum's avatar
      genirq: add generic I-pipe core · fd239330
      Philippe Gerum authored and Jan Kiszka's avatar Jan Kiszka committed
      This commit provides the arch-independent bits for implementing the
      interrupt pipeline core, a lightweight layer introducing a separate,
      high-priority execution stage for handling all IRQs in pseudo-NMI
      mode, which cannot be delayed by the regular kernel code. See
      Documentation/ipipe.rst for details about interrupt pipelining.
      
      Architectures which support interrupt pipelining should select
      HAVE_IPIPE_SUPPORT, along with implementing the required arch-specific
      code. In such a case, CONFIG_IPIPE becomes available to the user via
      the Kconfig interface for enabling the feature.
      fd239330
  2. 04 Mar, 2021 22 commits
    • Greg Kroah-Hartman's avatar
    • John Wang's avatar
      ARM: dts: aspeed: Add LCLK to lpc-snoop · 07c4c2e2
      John Wang authored
      
      
      commit d050d049f8b8077025292c1ecf456c4ee7f96861 upstream.
      Signed-off-by: default avatarJohn Wang <wangzhiqiang.bj@bytedance.com>
      Reviewed-by: default avatarJoel Stanley <joel@jms.id.au>
      Link: https://lore.kernel.org/r/20201202051634.490-2-wangzhiqiang.bj@bytedance.com
      
      Signed-off-by: default avatarJoel Stanley <joel@jms.id.au>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      07c4c2e2
    • Takeshi Misawa's avatar
      net: qrtr: Fix memory leak in qrtr_tun_open · 39be7b97
      Takeshi Misawa authored
      commit fc0494ead6398609c49afa37bc949b61c5c16b91 upstream.
      
      If qrtr_endpoint_register() failed, tun is leaked.
      Fix this, by freeing tun in error path.
      
      syzbot report:
      BUG: memory leak
      unreferenced object 0xffff88811848d680 (size 64):
        comm "syz-executor684", pid 10171, jiffies 4294951561 (age 26.070s)
        hex dump (first 32 bytes):
          80 dd 0a 84 ff ff ff ff 00 00 00 00 00 00 00 00  ................
          90 d6 48 18 81 88 ff ff 90 d6 48 18 81 88 ff ff  ..H.......H.....
        backtrace:
          [<0000000018992a50>] kmalloc include/linux/slab.h:552 [inline]
          [<0000000018992a50>] kzalloc include/linux/slab.h:682 [inline]
          [<0000000018992a50>] qrtr_tun_open+0x22/0x90 net/qrtr/tun.c:35
          [<0000000003a453ef>] misc_open+0x19c/0x1e0 drivers/char/misc.c:141
          [<00000000dec38ac8>] chrdev_open+0x10d/0x340 fs/char_dev.c:414
          [<0000000079094996>] do_dentry_open+0x1e6/0x620 fs/open.c:817
          [<000000004096d290>] do_open fs/namei.c:3252 [inline]
          [<000000004096d290>] path_openat+0x74a/0x1b00 fs/namei.c:3369
          [<00000000b8e64241>] do_filp_open+0xa0/0x190 fs/namei.c:3396
          [<00000000a3299422>] do_sys_openat2+0xed/0x230 fs/open.c:1172
          [<000000002c1bdcef>] do_sys_open fs/open.c:1188 [inline]
          [<000000002c1bdcef>] __do_sys_openat fs/open.c:1204 [inline]
          [<000000002c1bdcef>] __se_sys_openat fs/open.c:1199 [inline]
          [<000000002c1bdcef>] __x64_sys_openat+0x7f/0xe0 fs/open.c:1199
          [<00000000f3a5728f>] do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
          [<000000004b38b7ec>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Fixes: 28fb4e59
      
       ("net: qrtr: Expose tunneling endpoint to user space")
      Reported-by: syzbot+5d6e4af21385f5cfc56a@syzkaller.appspotmail.com
      Signed-off-by: default avatarTakeshi Misawa <jeliantsurux@gmail.com>
      Link: https://lore.kernel.org/r/20210221234427.GA2140@DESKTOP
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      39be7b97
    • Nikos Tsironis's avatar
      dm era: Update in-core bitset after committing the metadata · 7b518508
      Nikos Tsironis authored
      commit 2099b145d77c1d53f5711f029c37cc537897cee6 upstream.
      
      In case of a system crash, dm-era might fail to mark blocks as written
      in its metadata, although the corresponding writes to these blocks were
      passed down to the origin device and completed successfully.
      
      Consider the following sequence of events:
      
      1. We write to a block that has not been yet written in the current era
      2. era_map() checks the in-core bitmap for the current era and sees
         that the block is not marked as written.
      3. The write is deferred for submission after the metadata have been
         updated and committed.
      4. The worker thread processes the deferred write
         (process_deferred_bios()) and marks the block as written in the
         in-core bitmap, **before** committing the metadata.
      5. The worker thread starts committing the metadata.
      6. We do more writes that map to the same block as the write of step (1)
      7. era_map() checks the in-core bitmap and sees that the block is marked
         as written, **although the metadata have not been committed yet**.
      8. These writes are passed down to the origin device immediately and the
         device reports them as completed.
      9. The system crashes, e.g., power failure, before the commit from step
         (5) finishes.
      
      When the system recovers and we query the dm-era target for the list of
      written blocks it doesn't report the aforementioned block as written,
      although the writes of step (6) completed successfully.
      
      The issue is that era_map() decides whether to defer or not a write
      based on non committed information. The root cause of the bug is that we
      update the in-core bitmap, **before** committing the metadata.
      
      Fix this by updating the in-core bitmap **after** successfully
      committing the metadata.
      
      Fixes: eec40579
      
       ("dm: add era target")
      Cc: stable@vger.kernel.org # v3.15+
      Signed-off-by: default avatarNikos Tsironis <ntsironis@arrikto.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7b518508
    • Vlad Buslov's avatar
      net: sched: fix police ext initialization · 976ee31e
      Vlad Buslov authored
      
      
      commit 396d7f23adf9e8c436dd81a69488b5b6a865acf8 upstream.
      
      When police action is created by cls API tcf_exts_validate() first
      conditional that calls tcf_action_init_1() directly, the action idr is not
      updated according to latest changes in action API that require caller to
      commit newly created action to idr with tcf_idr_insert_many(). This results
      such action not being accessible through act API and causes crash reported
      by syzbot:
      
      ==================================================================
      BUG: KASAN: null-ptr-deref in instrument_atomic_read include/linux/instrumented.h:71 [inline]
      BUG: KASAN: null-ptr-deref in atomic_read include/asm-generic/atomic-instrumented.h:27 [inline]
      BUG: KASAN: null-ptr-deref in __tcf_idr_release net/sched/act_api.c:178 [inline]
      BUG: KASAN: null-ptr-deref in tcf_idrinfo_destroy+0x129/0x1d0 net/sched/act_api.c:598
      Read of size 4 at addr 0000000000000010 by task kworker/u4:5/204
      
      CPU: 0 PID: 204 Comm: kworker/u4:5 Not tainted 5.11.0-rc7-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Workqueue: netns cleanup_net
      Call Trace:
       __dump_stack lib/dump_stack.c:79 [inline]
       dump_stack+0x107/0x163 lib/dump_stack.c:120
       __kasan_report mm/kasan/report.c:400 [inline]
       kasan_report.cold+0x5f/0xd5 mm/kasan/report.c:413
       check_memory_region_inline mm/kasan/generic.c:179 [inline]
       check_memory_region+0x13d/0x180 mm/kasan/generic.c:185
       instrument_atomic_read include/linux/instrumented.h:71 [inline]
       atomic_read include/asm-generic/atomic-instrumented.h:27 [inline]
       __tcf_idr_release net/sched/act_api.c:178 [inline]
       tcf_idrinfo_destroy+0x129/0x1d0 net/sched/act_api.c:598
       tc_action_net_exit include/net/act_api.h:151 [inline]
       police_exit_net+0x168/0x360 net/sched/act_police.c:390
       ops_exit_list+0x10d/0x160 net/core/net_namespace.c:190
       cleanup_net+0x4ea/0xb10 net/core/net_namespace.c:604
       process_one_work+0x98d/0x15f0 kernel/workqueue.c:2275
       worker_thread+0x64c/0x1120 kernel/workqueue.c:2421
       kthread+0x3b1/0x4a0 kernel/kthread.c:292
       ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:296
      ==================================================================
      Kernel panic - not syncing: panic_on_warn set ...
      CPU: 0 PID: 204 Comm: kworker/u4:5 Tainted: G    B             5.11.0-rc7-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Workqueue: netns cleanup_net
      Call Trace:
       __dump_stack lib/dump_stack.c:79 [inline]
       dump_stack+0x107/0x163 lib/dump_stack.c:120
       panic+0x306/0x73d kernel/panic.c:231
       end_report+0x58/0x5e mm/kasan/report.c:100
       __kasan_report mm/kasan/report.c:403 [inline]
       kasan_report.cold+0x67/0xd5 mm/kasan/report.c:413
       check_memory_region_inline mm/kasan/generic.c:179 [inline]
       check_memory_region+0x13d/0x180 mm/kasan/generic.c:185
       instrument_atomic_read include/linux/instrumented.h:71 [inline]
       atomic_read include/asm-generic/atomic-instrumented.h:27 [inline]
       __tcf_idr_release net/sched/act_api.c:178 [inline]
       tcf_idrinfo_destroy+0x129/0x1d0 net/sched/act_api.c:598
       tc_action_net_exit include/net/act_api.h:151 [inline]
       police_exit_net+0x168/0x360 net/sched/act_police.c:390
       ops_exit_list+0x10d/0x160 net/core/net_namespace.c:190
       cleanup_net+0x4ea/0xb10 net/core/net_namespace.c:604
       process_one_work+0x98d/0x15f0 kernel/workqueue.c:2275
       worker_thread+0x64c/0x1120 kernel/workqueue.c:2421
       kthread+0x3b1/0x4a0 kernel/kthread.c:292
       ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:296
      Kernel Offset: disabled
      
      Fix the issue by calling tcf_idr_insert_many() after successful action
      initialization.
      
      Fixes: 0fedc63fadf0 ("net_sched: commit action insertions together")
      Reported-by: syzbot+151e3e714d34ae4ce7e8@syzkaller.appspotmail.com
      Signed-off-by: default avatarVlad Buslov <vladbu@nvidia.com>
      Reviewed-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      976ee31e
    • Jason A. Donenfeld's avatar
      net: icmp: pass zeroed opts from icmp{,v6}_ndo_send before sending · 9875cb3c
      Jason A. Donenfeld authored
      commit ee576c47db60432c37e54b1e2b43a8ca6d3a8dca upstream.
      
      The icmp{,v6}_send functions make all sorts of use of skb->cb, casting
      it with IPCB or IP6CB, assuming the skb to have come directly from the
      inet layer. But when the packet comes from the ndo layer, especially
      when forwarded, there's no telling what might be in skb->cb at that
      point. As a result, the icmp sending code risks reading bogus memory
      contents, which can result in nasty stack overflows such as this one
      reported by a user:
      
          panic+0x108/0x2ea
          __stack_chk_fail+0x14/0x20
          __icmp_send+0x5bd/0x5c0
          icmp_ndo_send+0x148/0x160
      
      In icmp_send, skb->cb is cast with IPCB and an ip_options struct is read
      from it. The optlen parameter there is of particular note, as it can
      induce writes beyond bounds. There are quite a few ways that can happen
      in __ip_options_echo. For example:
      
          // sptr/skb are attacker-controlled skb bytes
          sptr = skb_network_header(skb);
          // dptr/dopt points to stack memory allocated by __icmp_send
          dptr = dopt->__data;
          // sopt is the corrupt skb->cb in question
          if (sopt->rr) {
              optlen  = sptr[sopt->rr+1]; // corrupt skb->cb + skb->data
              soffset = sptr[sopt->rr+2]; // corrupt skb->cb + skb->data
      	// this now writes potentially attacker-controlled data, over
      	// flowing the stack:
              memcpy(dptr, sptr+sopt->rr, optlen);
          }
      
      In the icmpv6_send case, the story is similar, but not as dire, as only
      IP6CB(skb)->iif and IP6CB(skb)->dsthao are used. The dsthao case is
      worse than the iif case, but it is passed to ipv6_find_tlv, which does
      a bit of bounds checking on the value.
      
      This is easy to simulate by doing a `memset(skb->cb, 0x41,
      sizeof(skb->cb));` before calling icmp{,v6}_ndo_send, and it's only by
      good fortune and the rarity of icmp sending from that context that we've
      avoided reports like this until now. For example, in KASAN:
      
          BUG: KASAN: stack-out-of-bounds in __ip_options_echo+0xa0e/0x12b0
          Write of size 38 at addr ffff888006f1f80e by task ping/89
          CPU: 2 PID: 89 Comm: ping Not tainted 5.10.0-rc7-debug+ #5
          Call Trace:
           dump_stack+0x9a/0xcc
           print_address_description.constprop.0+0x1a/0x160
           __kasan_report.cold+0x20/0x38
           kasan_report+0x32/0x40
           check_memory_region+0x145/0x1a0
           memcpy+0x39/0x60
           __ip_options_echo+0xa0e/0x12b0
           __icmp_send+0x744/0x1700
      
      Actually, out of the 4 drivers that do this, only gtp zeroed the cb for
      the v4 case, while the rest did not. So this commit actually removes the
      gtp-specific zeroing, while putting the code where it belongs in the
      shared infrastructure of icmp{,v6}_ndo_send.
      
      This commit fixes the issue by passing an empty IPCB or IP6CB along to
      the functions that actually do the work. For the icmp_send, this was
      already trivial, thanks to __icmp_send providing the plumbing function.
      For icmpv6_send, this required a tiny bit of refactoring to make it
      behave like the v4 case, after which it was straight forward.
      
      Fixes: a2b78e9b
      
       ("sunvnet: generate ICMP PTMUD messages for smaller port MTUs")
      Reported-by: default avatarSinYu <liuxyon@gmail.com>
      Reviewed-by: default avatarWillem de Bruijn <willemb@google.com>
      Link: https://lore.kernel.org/netdev/CAF=yD-LOF116aHub6RMe8vB8ZpnrrnoTdqhobEx+bvoA8AsP0w@mail.gmail.com/T/
      
      Signed-off-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      Link: https://lore.kernel.org/r/20210223131858.72082-1-Jason@zx2c4.com
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9875cb3c
    • Leon Romanovsky's avatar
      ipv6: silence compilation warning for non-IPV6 builds · 354fb724
      Leon Romanovsky authored
      
      
      commit 1faba27f11c8da244e793546a1b35a9b1da8208e upstream.
      
      The W=1 compilation of allmodconfig generates the following warning:
      
      net/ipv6/icmp.c:448:6: warning: no previous prototype for 'icmp6_send' [-Wmissing-prototypes]
        448 | void icmp6_send(struct sk_buff *skb, u8 type, u8 code, __u32 info,
            |      ^~~~~~~~~~
      
      Fix it by providing function declaration for builds with ipv6 as a module.
      Signed-off-by: default avatarLeon Romanovsky <leonro@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      354fb724
    • Eric Dumazet's avatar
      ipv6: icmp6: avoid indirect call for icmpv6_send() · e528edf1
      Eric Dumazet authored
      
      
      commit cc7a21b6fbd945f8d8f61422ccd27203c1fafeb7 upstream.
      
      If IPv6 is builtin, we do not need an expensive indirect call
      to reach icmp6_send().
      
      v2: put inline keyword before the type to avoid sparse warnings.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e528edf1
    • Jason A. Donenfeld's avatar
      xfrm: interface: use icmp_ndo_send helper · c30e93ee
      Jason A. Donenfeld authored
      
      
      commit 45942ba890e6f35232727a5fa33d732681f4eb9f upstream.
      
      Because xfrmi is calling icmp from network device context, it should use
      the ndo helper so that the rate limiting applies correctly.
      Signed-off-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
      Cc: Steffen Klassert <steffen.klassert@secunet.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c30e93ee
    • Jason A. Donenfeld's avatar
      sunvnet: use icmp_ndo_send helper · e1ec06b8
      Jason A. Donenfeld authored
      
      
      commit 67c9a7e1e3ac491b5df018803639addc36f154ba upstream.
      
      Because sunvnet is calling icmp from network device context, it should use
      the ndo helper so that the rate limiting applies correctly. While we're
      at it, doing the additional route lookup before calling icmp_ndo_send is
      superfluous, since this is the job of the icmp code in the first place.
      Signed-off-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      Cc: Shannon Nelson <shannon.nelson@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e1ec06b8
    • Jason A. Donenfeld's avatar
      gtp: use icmp_ndo_send helper · d8d268ce
      Jason A. Donenfeld authored
      
      
      commit e0fce6f945a26d4e953a147fe7ca11410322c9fe upstream.
      
      Because gtp is calling icmp from network device context, it should use
      the ndo helper so that the rate limiting applies correctly.
      Signed-off-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      Cc: Harald Welte <laforge@gnumonks.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d8d268ce
    • Jason A. Donenfeld's avatar
      icmp: allow icmpv6_ndo_send to work with CONFIG_IPV6=n · dd28e735
      Jason A. Donenfeld authored
      
      
      commit a8e41f6033a0c5633d55d6e35993c9e2005d872f upstream.
      
      The icmpv6_send function has long had a static inline implementation
      with an empty body for CONFIG_IPV6=n, so that code calling it doesn't
      need to be ifdef'd. The new icmpv6_ndo_send function, which is intended
      for drivers as a drop-in replacement with an identical function
      signature, should follow the same pattern. Without this patch, drivers
      that used to work with CONFIG_IPV6=n now result in a linker error.
      
      Cc: Chen Zhou <chenzhou10@huawei.com>
      Reported-by: default avatarHulk Robot <hulkci@huawei.com>
      Fixes: 0b41713b6066 ("icmp: introduce helper for nat'd source address in network device context")
      Signed-off-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      dd28e735
    • Jason A. Donenfeld's avatar
      icmp: introduce helper for nat'd source address in network device context · 2019554f
      Jason A. Donenfeld authored
      
      
      commit 0b41713b606694257b90d61ba7e2712d8457648b upstream.
      
      This introduces a helper function to be called only by network drivers
      that wraps calls to icmp[v6]_send in a conntrack transformation, in case
      NAT has been used. We don't want to pollute the non-driver path, though,
      so we introduce this as a helper to be called by places that actually
      make use of this, as suggested by Florian.
      Signed-off-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      Cc: Florian Westphal <fw@strlen.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2019554f
    • Ville Syrjälä's avatar
      drm/i915: Reject 446-480MHz HDMI clock on GLK · 0a35ff98
      Ville Syrjälä authored
      commit 7a6c6243b44a439bda4bf099032be35ebcf53406 upstream.
      
      The BXT/GLK DPLL can't generate certain frequencies. We already
      reject the 233-240MHz range on both. But on GLK the DPLL max
      frequency was bumped from 300MHz to 594MHz, so now we get to
      also worry about the 446-480MHz range (double the original
      problem range). Reject any frequency within the higher
      problematic range as well.
      
      Cc: stable@vger.kernel.org
      Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/3000
      
      Signed-off-by: default avatarVille Syrjälä <ville.syrjala@linux.intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20210203093044.30532-1-ville.syrjala@linux.intel.com
      
      Reviewed-by: default avatarMika Kahola <mika.kahola@intel.com>
      (cherry picked from commit 41751b3e5c1ac656a86f8d45a8891115281b729e)
      Signed-off-by: default avatarRodrigo Vivi <rodrigo.vivi@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0a35ff98
    • Nikos Tsironis's avatar
      dm era: only resize metadata in preresume · 467214dd
      Nikos Tsironis authored
      commit cca2c6aebe86f68103a8615074b3578e854b5016 upstream.
      
      Metadata resize shouldn't happen in the ctr. The ctr loads a temporary
      (inactive) table that will only become active upon resume. That is why
      resize should always be done in terms of resume. Otherwise a load (ctr)
      whose inactive table never becomes active will incorrectly resize the
      metadata.
      
      Also, perform the resize directly in preresume, instead of using the
      worker to do it.
      
      The worker might run other metadata operations, e.g., it could start
      digestion, before resizing the metadata. These operations will end up
      using the old size.
      
      This could lead to errors, like:
      
        device-mapper: era: metadata_digest_transcribe_writeset: dm_array_set_value failed
        device-mapper: era: process_old_eras: digest step failed, stopping digestion
      
      The reason of the above error is that the worker started the digestion
      of the archived writeset using the old, larger size.
      
      As a result, metadata_digest_transcribe_writeset tried to write beyond
      the end of the era array.
      
      Fixes: eec40579
      
       ("dm: add era target")
      Cc: stable@vger.kernel.org # v3.15+
      Signed-off-by: default avatarNikos Tsironis <ntsironis@arrikto.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      467214dd
    • Nikos Tsironis's avatar
      dm era: Reinitialize bitset cache before digesting a new writeset · fb898636
      Nikos Tsironis authored
      commit 2524933307fd0036d5c32357c693c021ab09a0b0 upstream.
      
      In case of devices with at most 64 blocks, the digestion of consecutive
      eras uses the writeset of the first era as the writeset of all eras to
      digest, leading to lost writes. That is, we lose the information about
      what blocks were written during the affected eras.
      
      The digestion code uses a dm_disk_bitset object to access the archived
      writesets. This structure includes a one word (64-bit) cache to reduce
      the number of array lookups.
      
      This structure is initialized only once, in metadata_digest_start(),
      when we kick off digestion.
      
      But, when we insert a new writeset into the writeset tree, before the
      digestion of the previous writeset is done, or equivalently when there
      are multiple writesets in the writeset tree to digest, then all these
      writesets are digested using the same cache and the cache is not
      re-initialized when moving from one writeset to the next.
      
      For devices with more than 64 blocks, i.e., the size of the cache, the
      cache is indirectly invalidated when we move to a next set of blocks, so
      we avoid the bug.
      
      But for devices with at most 64 blocks we end up using the same cached
      data for digesting all archived writesets, i.e., the cache is loaded
      when digesting the first writeset and it never gets reloaded, until the
      digestion is done.
      
      As a result, the writeset of the first era to digest is used as the
      writeset of all the following archived eras, leading to lost writes.
      
      Fix this by reinitializing the dm_disk_bitset structure, and thus
      invalidating the cache, every time the digestion code starts digesting a
      new writeset.
      
      Fixes: eec40579
      
       ("dm: add era target")
      Cc: stable@vger.kernel.org # v3.15+
      Signed-off-by: default avatarNikos Tsironis <ntsironis@arrikto.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fb898636
    • Nikos Tsironis's avatar
      dm era: Use correct value size in equality function of writeset tree · e59b9a84
      Nikos Tsironis authored
      commit 64f2d15afe7b336aafebdcd14cc835ecf856df4b upstream.
      
      Fix the writeset tree equality test function to use the right value size
      when comparing two btree values.
      
      Fixes: eec40579
      
       ("dm: add era target")
      Cc: stable@vger.kernel.org # v3.15+
      Signed-off-by: default avatarNikos Tsironis <ntsironis@arrikto.com>
      Reviewed-by: default avatarMing-Hung Tsai <mtsai@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e59b9a84
    • Nikos Tsironis's avatar
      dm era: Fix bitset memory leaks · fead0c8e
      Nikos Tsironis authored
      commit 904e6b266619c2da5c58b5dce14ae30629e39645 upstream.
      
      Deallocate the memory allocated for the in-core bitsets when destroying
      the target and in error paths.
      
      Fixes: eec40579
      
       ("dm: add era target")
      Cc: stable@vger.kernel.org # v3.15+
      Signed-off-by: default avatarNikos Tsironis <ntsironis@arrikto.com>
      Reviewed-by: default avatarMing-Hung Tsai <mtsai@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fead0c8e
    • Nikos Tsironis's avatar
      dm era: Verify the data block size hasn't changed · 8ca89085
      Nikos Tsironis authored
      commit c8e846ff93d5eaa5384f6f325a1687ac5921aade upstream.
      
      dm-era doesn't support changing the data block size of existing devices,
      so check explicitly that the requested block size for a new target
      matches the one stored in the metadata.
      
      Fixes: eec40579
      
       ("dm: add era target")
      Cc: stable@vger.kernel.org # v3.15+
      Signed-off-by: default avatarNikos Tsironis <ntsironis@arrikto.com>
      Reviewed-by: default avatarMing-Hung Tsai <mtsai@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8ca89085
    • Nikos Tsironis's avatar
      dm era: Recover committed writeset after crash · e8a146ef
      Nikos Tsironis authored
      commit de89afc1e40fdfa5f8b666e5d07c43d21a1d3be0 upstream.
      
      Following a system crash, dm-era fails to recover the committed writeset
      for the current era, leading to lost writes. That is, we lose the
      information about what blocks were written during the affected era.
      
      dm-era assumes that the writeset of the current era is archived when the
      device is suspended. So, when resuming the device, it just moves on to
      the next era, ignoring the committed writeset.
      
      This assumption holds when the device is properly shut down. But, when
      the system crashes, the code that suspends the target never runs, so the
      writeset for the current era is not archived.
      
      There are three issues that cause the committed writeset to get lost:
      
      1. dm-era doesn't load the committed writeset when opening the metadata
      2. The code that resizes the metadata wipes the information about the
         committed writeset (assuming it was loaded at step 1)
      3. era_preresume() starts a new era, without taking into account that
         the current era might not have been archived, due to a system crash.
      
      To fix this:
      
      1. Load the committed writeset when opening the metadata
      2. Fix the code that resizes the metadata to make sure it doesn't wipe
         the loaded writeset
      3. Fix era_preresume() to check for a loaded writeset and archive it,
         before starting a new era.
      
      Fixes: eec40579
      
       ("dm: add era target")
      Cc: stable@vger.kernel.org # v3.15+
      Signed-off-by: default avatarNikos Tsironis <ntsironis@arrikto.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e8a146ef
    • Mikulas Patocka's avatar
      dm writecache: fix writing beyond end of underlying device when shrinking · d8738847
      Mikulas Patocka authored
      
      
      commit 4134455f2aafdfeab50cabb4cccb35e916034b93 upstream.
      
      Do not attempt to write any data beyond the end of the underlying data
      device while shrinking it.
      
      The DM writecache device must be suspended when the underlying data
      device is shrunk.
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d8738847
    • Mikulas Patocka's avatar
      dm: fix deadlock when swapping to encrypted device · 5233c47c
      Mikulas Patocka authored
      
      
      commit a666e5c05e7c4aaabb2c5d58117b0946803d03d2 upstream.
      
      The system would deadlock when swapping to a dm-crypt device. The reason
      is that for each incoming write bio, dm-crypt allocates memory that holds
      encrypted data. These excessive allocations exhaust all the memory and the
      result is either deadlock or OOM trigger.
      
      This patch limits the number of in-flight swap bios, so that the memory
      consumed by dm-crypt is limited. The limit is enforced if the target set
      the "limit_swap_bios" variable and if the bio has REQ_SWAP set.
      
      Non-swap bios are not affected becuase taking the semaphore would cause
      performance degradation.
      
      This is similar to request-based drivers - they will also block when the
      number of requests is over the limit.
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5233c47c