1. 17 Dec, 2018 3 commits
  2. 24 Aug, 2018 1 commit
  3. 28 Jul, 2018 2 commits
  4. 25 May, 2018 1 commit
    • Eric Dumazet's avatar
      tcp: purge write queue in tcp_connect_init() · c89d5343
      Eric Dumazet authored
      [ Upstream commit 7f582b24 ]
      
      syzkaller found a reliable way to crash the host, hitting a BUG()
      in __tcp_retransmit_skb()
      
      Malicous MSG_FASTOPEN is the root cause. We need to purge write queue
      in tcp_connect_init() at the point we init snd_una/write_seq.
      
      This patch also replaces the BUG() by a less intrusive WARN_ON_ONCE()
      
      kernel BUG at net/ipv4/tcp_output.c:2837!
      invalid opcode: 0000 [#1] SMP KASAN
      Dumping ftrace buffer:
         (ftrace buffer empty)
      Modules linked in:
      CPU: 0 PID: 5276 Comm: syz-executor0 Not tainted 4.17.0-rc3+ #51
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      RIP: 0010:__tcp_retransmit_skb+0x2992/0x2eb0 net/ipv4/tcp_output.c:2837
      RSP: 0000:ffff8801dae06ff8 EFLAGS: 00010206
      RAX: ffff8801b9fe61c0 RBX: 00000000ffc18a16 RCX: ffffffff864e1a49
      RDX: 0000000000000100 RSI: ffffffff864e2e12 RDI: 0000000000000005
      RBP: ffff8801dae073a0 R08: ffff8801b9fe61c0 R09: ffffed0039c40dd2
      R10: ffffed0039c40dd2 R11: ffff8801ce206e93 R12: 00000000421eeaad
      R13: ffff8801ce206d4e R14: ffff8801ce206cc0 R15: ffff8801cd4f4a80
      FS:  0000000000000000(0000) GS:ffff8801dae00000(0063) knlGS:00000000096bc900
      CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033
      CR2: 0000000020000000 CR3: 00000001c47b6000 CR4: 00000000001406f0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       <IRQ>
       tcp_retransmit_skb+0x2e/0x250 net/ipv4/tcp_output.c:2923
       tcp_retransmit_timer+0xc50/0x3060 net/ipv4/tcp_timer.c:488
       tcp_write_timer_handler+0x339/0x960 net/ipv4/tcp_timer.c:573
       tcp_write_timer+0x111/0x1d0 net/ipv4/tcp_timer.c:593
       call_timer_fn+0x230/0x940 kernel/time/timer.c:1326
       expire_timers kernel/time/timer.c:1363 [inline]
       __run_timers+0x79e/0xc50 kernel/time/timer.c:1666
       run_timer_softirq+0x4c/0x70 kernel/time/timer.c:1692
       __do_softirq+0x2e0/0xaf5 kernel/softirq.c:285
       invoke_softirq kernel/softirq.c:365 [inline]
       irq_exit+0x1d1/0x200 kernel/softirq.c:405
       exiting_irq arch/x86/include/asm/apic.h:525 [inline]
       smp_apic_timer_interrupt+0x17e/0x710 arch/x86/kernel/apic/apic.c:1052
       apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:863
      
      Fixes: cf60af03
      
       ("net-tcp: Fast Open client - sendmsg(MSG_FASTOPEN)")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Acked-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c89d5343
  5. 09 Mar, 2018 2 commits
  6. 17 Dec, 2017 1 commit
    • Neal Cardwell's avatar
      tcp: when scheduling TLP, time of RTO should account for current ACK · 241eb29c
      Neal Cardwell authored
      
      [ Upstream commit ed66dfaf ]
      
      Fix the TLP scheduling logic so that when scheduling a TLP probe, we
      ensure that the estimated time at which an RTO would fire accounts for
      the fact that ACKs indicating forward progress should push back RTO
      times.
      
      After the following fix:
      
      df92c839 ("tcp: fix xmit timer to only be reset if data ACKed/SACKed")
      
      we had an unintentional behavior change in the following kind of
      scenario: suppose the RTT variance has been very low recently. Then
      suppose we send out a flight of N packets and our RTT is 100ms:
      
      t=0: send a flight of N packets
      t=100ms: receive an ACK for N-1 packets
      
      The response before df92c839 that was:
        -> schedule a TLP for now + RTO_interval
      
      The response after df92c839 is:
        -> schedule a TLP for t=0 + RTO_interval
      
      Since RTO_interval = srtt + RTT_variance, this means that we have
      scheduled a TLP timer at a point in the future that only accounts for
      RTT_variance. If the RTT_variance term is small, this means that the
      timer fires soon.
      
      Before df92c839 this would not happen, because in that code, when
      we receive an ACK for a prefix of flight, we did:
      
          1) Near the top of tcp_ack(), switch from TLP timer to RTO
             at write_queue_head->paket_tx_time + RTO_interval:
                  if (icsk->icsk_pending == ICSK_TIME_LOSS_PROBE)
                         tcp_rearm_rto(sk);
      
          2) In tcp_clean_rtx_queue(), update the RTO to now + RTO_interval:
                  if (flag & FLAG_ACKED) {
                         tcp_rearm_rto(sk);
      
          3) In tcp_ack() after tcp_fastretrans_alert() switch from RTO
             to TLP at now + RTO_interval:
                  if (icsk->icsk_pending == ICSK_TIME_RETRANS)
                         tcp_schedule_loss_probe(sk);
      
      In df92c839 we removed that 3-phase dance, and instead directly
      set the TLP timer once: we set the TLP timer in cases like this to
      write_queue_head->packet_tx_time + RTO_interval. So if the RTT
      variance is small, then this means that this is setting the TLP timer
      to fire quite soon. This means if the ACK for the tail of the flight
      takes longer than an RTT to arrive (often due to delayed ACKs), then
      the TLP timer fires too quickly.
      
      Fixes: df92c839
      
       ("tcp: fix xmit timer to only be reset if data ACKed/SACKed")
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      241eb29c
  7. 03 Nov, 2017 1 commit
  8. 01 Nov, 2017 1 commit
  9. 28 Oct, 2017 1 commit
  10. 26 Oct, 2017 1 commit
  11. 23 Oct, 2017 1 commit
    • Koichiro Den's avatar
      tcp: do tcp_mstamp_refresh before retransmits on TSQ handler · 3a91d29f
      Koichiro Den authored
      When retransmission on TSQ handler was introduced in the commit
      f9616c35 ("tcp: implement TSQ for retransmits"), the retransmitted
      skbs' timestamps were updated on the actual transmission. In the later
      commit 385e2070 ("tcp: use tp->tcp_mstamp in output path"), it stops
      being done so. In the commit, the comment says "We try to refresh
      tp->tcp_mstamp only when necessary", and at present tcp_tsq_handler and
      tcp_v4_mtu_reduced applies to this. About the latter, it's okay since
      it's rare enough.
      
      About the former, even though possible retransmissions on the tasklet
      comes just after the destructor run in NET_RX softirq handling, the time
      between them could be nonnegligibly large to the extent that
      tcp_rack_advance or rto rearming be affected if other (remaining) RX,
      BLOCK and (preceding) TASKLET sofirq handlings are unexpectedly heavy.
      
      So in the same way as tcp_write_timer_handler does, doing tcp_mstamp_refresh
      ensures the accuracy of algorithms relying on it.
      
      Fixes: 385e2070
      
       ("tcp: use tp->tcp_mstamp in output path")
      Signed-off-by: default avatarKoichiro Den <den@klaipeden.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3a91d29f
  12. 19 Sep, 2017 2 commits
    • Eric Dumazet's avatar
      tcp: fastopen: fix on syn-data transmit failure · b5b7db8d
      Eric Dumazet authored
      Our recent change exposed a bug in TCP Fastopen Client that syzkaller
      found right away [1]
      
      When we prepare skb with SYN+DATA, we attempt to transmit it,
      and we update socket state as if the transmit was a success.
      
      In socket RTX queue we have two skbs, one with the SYN alone,
      and a second one containing the DATA.
      
      When (malicious) ACK comes in, we now complain that second one had no
      skb_mstamp.
      
      The proper fix is to make sure that if the transmit failed, we do not
      pretend we sent the DATA skb, and make it our send_head.
      
      When 3WHS completes, we can now send the DATA right away, without having
      to wait for a timeout.
      
      [1]
      WARNING: CPU: 0 PID: 100189 at net/ipv4/tcp_input.c:3117 tcp_clean_rtx_queue+0x2057/0x2ab0 net/ipv4/tcp_input.c:3117()
      
       WARN_ON_ONCE(last_ackt == 0);
      
      Modules linked in:
      CPU: 0 PID: 100189 Comm: syz-executor1 Not tainted
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
       0000000000000000 ffff8800b35cb1d8 ffffffff81cad00d 0000000000000000
       ffffffff828a4347 ffff88009f86c080 ffffffff8316eb20 0000000000000d7f
       ffff8800b35cb220 ffffffff812c33c2 ffff8800baad2440 00000009d46575c0
      Call Trace:
       [<ffffffff81cad00d>] __dump_stack
       [<ffffffff81cad00d>] dump_stack+0xc1/0x124
       [<ffffffff812c33c2>] warn_slowpath_common+0xe2/0x150
       [<ffffffff812c361e>] warn_slowpath_null+0x2e/0x40
       [<ffffffff828a4347>] tcp_clean_rtx_queue+0x2057/0x2ab0 n
       [<ffffffff828ae6fd>] tcp_ack+0x151d/0x3930
       [<ffffffff828baa09>] tcp_rcv_state_process+0x1c69/0x4fd0
       [<ffffffff828efb7f>] tcp_v4_do_rcv+0x54f/0x7c0
       [<ffffffff8258aacb>] sk_backlog_rcv
       [<ffffffff8258aacb>] __release_sock+0x12b/0x3a0
       [<ffffffff8258ad9e>] release_sock+0x5e/0x1c0
       [<ffffffff8294a785>] inet_wait_for_connect
       [<ffffffff8294a785>] __inet_stream_connect+0x545/0xc50
       [<ffffffff82886f08>] tcp_sendmsg_fastopen
       [<ffffffff82886f08>] tcp_sendmsg+0x2298/0x35a0
       [<ffffffff82952515>] inet_sendmsg+0xe5/0x520
       [<ffffffff8257152f>] sock_sendmsg_nosec
       [<ffffffff8257152f>] sock_sendmsg+0xcf/0x110
      
      Fixes: 8c72c65b ("tcp: update skb->skb_mstamp more carefully")
      Fixes: 783237e8
      
       ("net-tcp: Fast Open client - sending SYN-data")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b5b7db8d
    • Yuchung Cheng's avatar
      tcp: remove two unused functions · 4c712441
      Yuchung Cheng authored
      remove tcp_may_send_now and tcp_snd_test that are no longer used
      
      Fixes: 840a3cbe
      
       ("tcp: remove forward retransmit feature")
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4c712441
  13. 16 Sep, 2017 1 commit
  14. 15 Sep, 2017 1 commit
    • Eric Dumazet's avatar
      tcp: update skb->skb_mstamp more carefully · 8c72c65b
      Eric Dumazet authored
      liujian reported a problem in TCP_USER_TIMEOUT processing with a patch
      in tcp_probe_timer() :
            https://www.spinics.net/lists/netdev/msg454496.html
      
      
      
      After investigations, the root cause of the problem is that we update
      skb->skb_mstamp of skbs in write queue, even if the attempt to send a
      clone or copy of it failed. One reason being a routing problem.
      
      This patch prevents this, solving liujian issue.
      
      It also removes a potential RTT miscalculation, since
      __tcp_retransmit_skb() is not OR-ing TCP_SKB_CB(skb)->sacked with
      TCPCB_EVER_RETRANS if a failure happens, but skb->skb_mstamp has
      been changed.
      
      A future ACK would then lead to a very small RTT sample and min_rtt
      would then be lowered to this too small value.
      
      Tested:
      
      # cat user_timeout.pkt
      --local_ip=192.168.102.64
      
          0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
         +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
         +0 bind(3, ..., ...) = 0
         +0 listen(3, 1) = 0
      
         +0 `ifconfig tun0 192.168.102.64/16; ip ro add 192.0.2.1 dev tun0`
      
         +0 < S 0:0(0) win 0 <mss 1460>
         +0 > S. 0:0(0) ack 1 <mss 1460>
      
        +.1 < . 1:1(0) ack 1 win 65530
         +0 accept(3, ..., ...) = 4
      
         +0 setsockopt(4, SOL_TCP, TCP_USER_TIMEOUT, [3000], 4) = 0
         +0 write(4, ..., 24) = 24
         +0 > P. 1:25(24) ack 1 win 29200
         +.1 < . 1:1(0) ack 25 win 65530
      
      //change the ipaddress
         +1 `ifconfig tun0 192.168.0.10/16`
      
         +1 write(4, ..., 24) = 24
         +1 write(4, ..., 24) = 24
         +1 write(4, ..., 24) = 24
         +1 write(4, ..., 24) = 24
      
         +0 `ifconfig tun0 192.168.102.64/16`
         +0 < . 1:2(1) ack 25 win 65530
         +0 `ifconfig tun0 192.168.0.10/16`
      
         +3 write(4, ..., 24) = -1
      
      # ./packetdrill user_timeout.pkt
      Signed-off-by: default avatarEric Dumazet <edumazet@googl.com>
      Reported-by: default avatarliujian <liujian56@huawei.com>
      Acked-by: default avatarNeal Cardwell <ncardwell@google.com>
      Acked-by: default avatarYuchung Cheng <ycheng@google.com>
      Acked-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8c72c65b
  15. 30 Aug, 2017 1 commit
  16. 09 Aug, 2017 1 commit
  17. 03 Aug, 2017 2 commits
    • Neal Cardwell's avatar
      tcp: fix xmit timer to only be reset if data ACKed/SACKed · df92c839
      Neal Cardwell authored
      Fix a TCP loss recovery performance bug raised recently on the netdev
      list, in two threads:
      
      (i)  July 26, 2017: netdev thread "TCP fast retransmit issues"
      (ii) July 26, 2017: netdev thread:
           "[PATCH V2 net-next] TLP: Don't reschedule PTO when there's one
           outstanding TLP retransmission"
      
      The basic problem is that incoming TCP packets that did not indicate
      forward progress could cause the xmit timer (TLP or RTO) to be rearmed
      and pushed back in time. In certain corner cases this could result in
      the following problems noted in these threads:
      
       - Repeated ACKs coming in with bogus SACKs corrupted by middleboxes
         could cause TCP to repeatedly schedule TLPs forever. We kept
         sending TLPs after every ~200ms, which elicited bogus SACKs, which
         caused more TLPs, ad infinitum; we never fired an RTO to fill in
         the holes.
      
       - Incoming data segments could, in some cases, cause us to reschedule
         our RTO or TLP timer further out in time, for no good reason. This
         could cause repeated inbound data to result in stalls in outbound
         data, in the presence of packet loss.
      
      This commit fixes these bugs by changing the TLP and RTO ACK
      processing to:
      
       (a) Only reschedule the xmit timer once per ACK.
      
       (b) Only reschedule the xmit timer if tcp_clean_rtx_queue() deems the
           ACK indicates sufficient forward progress (a packet was
           cumulatively ACKed, or we got a SACK for a packet that was sent
           before the most recent retransmit of the write queue head).
      
      This brings us back into closer compliance with the RFCs, since, as
      the comment for tcp_rearm_rto() notes, we should only restart the RTO
      timer after forward progress on the connection. Previously we were
      restarting the xmit timer even in these cases where there was no
      forward progress.
      
      As a side benefit, this commit simplifies and speeds up the TCP timer
      arming logic. We had been calling inet_csk_reset_xmit_timer() three
      times on normal ACKs that cumulatively acknowledged some data:
      
      1) Once near the top of tcp_ack() to switch from TLP timer to RTO:
              if (icsk->icsk_pending == ICSK_TIME_LOSS_PROBE)
                     tcp_rearm_rto(sk);
      
      2) Once in tcp_clean_rtx_queue(), to update the RTO:
              if (flag & FLAG_ACKED) {
                     tcp_rearm_rto(sk);
      
      3) Once in tcp_ack() after tcp_fastretrans_alert() to switch from RTO
         to TLP:
              if (icsk->icsk_pending == ICSK_TIME_RETRANS)
                     tcp_schedule_loss_probe(sk);
      
      This commit, by only rescheduling the xmit timer once per ACK,
      simplifies the code and reduces CPU overhead.
      
      This commit was tested in an A/B test with Google web server
      traffic. SNMP stats and request latency metrics were within noise
      levels, substantiating that for normal web traffic patterns this is a
      rare issue. This commit was also tested with packetdrill tests to
      verify that it fixes the timer behavior in the corner cases discussed
      in the netdev threads mentioned above.
      
      This patch is a bug fix patch intended to be queued for -stable
      relases.
      
      Fixes: 6ba8a3b1
      
       ("tcp: Tail loss probe (TLP)")
      Reported-by: default avatarKlavs Klavsen <kl@vsen.dk>
      Reported-by: default avatarMao Wenan <maowenan@huawei.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarNandita Dukkipati <nanditad@google.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      df92c839
    • Neal Cardwell's avatar
      tcp: enable xmit timer fix by having TLP use time when RTO should fire · a2815817
      Neal Cardwell authored
      Have tcp_schedule_loss_probe() base the TLP scheduling decision based
      on when the RTO *should* fire. This is to enable the upcoming xmit
      timer fix in this series, where tcp_schedule_loss_probe() cannot
      assume that the last timer installed was an RTO timer (because we are
      no longer doing the "rearm RTO, rearm RTO, rearm TLP" dance on every
      ACK). So tcp_schedule_loss_probe() must independently figure out when
      an RTO would want to fire.
      
      In the new TLP implementation following in this series, we cannot
      assume that icsk_timeout was set based on an RTO; after processing a
      cumulative ACK the icsk_timeout we see can be from a previous TLP or
      RTO. So we need to independently recalculate the RTO time (instead of
      reading it out of icsk_timeout). Removing this dependency on the
      nature of icsk_timeout makes things a little easier to reason about
      anyway.
      
      Note that the old and new code should be equivalent, since they are
      both saying: "if the RTO is in the future, but at an earlier time than
      the normal TLP time, then set the TLP timer to fire when the RTO would
      have fired".
      
      Fixes: 6ba8a3b1
      
       ("tcp: Tail loss probe (TLP)")
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarNandita Dukkipati <nanditad@google.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a2815817
  18. 31 Jul, 2017 1 commit
  19. 30 Jul, 2017 1 commit
    • Arnd Bergmann's avatar
      tcp: avoid bogus gcc-7 array-bounds warning · efe967cd
      Arnd Bergmann authored
      When using CONFIG_UBSAN_SANITIZE_ALL, the TCP code produces a
      false-positive warning:
      
      net/ipv4/tcp_output.c: In function 'tcp_connect':
      net/ipv4/tcp_output.c:2207:40: error: array subscript is below array bounds [-Werror=array-bounds]
         tp->chrono_stat[tp->chrono_type - 1] += now - tp->chrono_start;
                                              ^~
      net/ipv4/tcp_output.c:2207:40: error: array subscript is below array bounds [-Werror=array-bounds]
         tp->chrono_stat[tp->chrono_type - 1] += now - tp->chrono_start;
         ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~
      
      I have opened a gcc bug for this, but distros have already shipped
      compilers with this problem, and it's not clear yet whether there is
      a way for gcc to avoid the warning. As the problem is related to the
      bitfield access, this introduces a temporary variable to store the old
      enum value.
      
      I did not notice this warning earlier, since UBSAN is disabled when
      building with COMPILE_TEST, and that was always turned on in both
      allmodconfig and randconfig tests.
      
      Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81601
      
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      efe967cd
  20. 19 Jul, 2017 1 commit
  21. 04 Jul, 2017 1 commit
  22. 01 Jul, 2017 5 commits
  23. 08 Jun, 2017 3 commits
  24. 02 Jun, 2017 1 commit
  25. 17 May, 2017 4 commits