1. 15 Aug, 2013 1 commit
  2. 30 Jul, 2013 1 commit
  3. 25 Jul, 2013 1 commit
    • Eric Dumazet's avatar
      tcp: TCP_NOTSENT_LOWAT socket option · c9bee3b7
      Eric Dumazet authored
      Idea of this patch is to add optional limitation of number of
      unsent bytes in TCP sockets, to reduce usage of kernel memory.
      TCP receiver might announce a big window, and TCP sender autotuning
      might allow a large amount of bytes in write queue, but this has little
      performance impact if a large part of this buffering is wasted :
      Write queue needs to be large only to deal with large BDP, not
      necessarily to cope with scheduling delays (incoming ACKS make room
      for the application to queue more bytes)
      For most workloads, using a value of 128 KB or less is OK to give
      applications enough time to react to POLLOUT events in time
      (or being awaken in a blocking sendmsg())
      This patch adds two ways to set the limit :
      1) Per socket option TCP_NOTSENT_LOWAT
      2) A sysctl (/proc/sys/net/ipv4/tcp_notsent_lowat) for sockets
      not using TCP_NOTSENT_LOWAT socket option (or setting a zero value)
      Default value being UINT_MAX (0xFFFFFFFF), meaning this has no effect.
      This changes poll()/select()/epoll() to report POLLOUT
      only if number of unsent bytes is below tp->nosent_lowat
      Note this might increase number of sendmsg()/sendfile() calls
      when using non blocking sockets,
      and increase number of context switches for blocking sockets.
      Note this is not related to SO_SNDLOWAT (as SO_SNDLOWAT is
      defined as :
       Specify the minimum number of bytes in the buffer until
       the socket layer will pass the data to the protocol)
      netperf sessions, and watching /proc/net/protocols "memory" column for TCP
      With 200 concurrent netperf -t TCP_STREAM sessions, amount of kernel memory
      used by TCP buffers shrinks by ~55 % (20567 pages instead of 45458)
      lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
      lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
      TCPv6     1880      2   45458   no     208   yes  ipv6        y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y
      TCP       1696    508   45458   no     208   yes  kernel      y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y
      lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
      lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
      TCPv6     1880      2   20567   no     208   yes  ipv6        y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y
      TCP       1696    508   20567   no     208   yes  kernel      y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y
      Using 128KB has no bad effect on the throughput or cpu usage
      of a single flow, although there is an increase of context switches.
      A bonus is that we hold socket lock for a shorter amount
      of time and should improve latencies of ACK processing.
      lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
      lpq83:~# perf stat -e context-switches ./netperf -H -t omni -l 20 -c -i10,3
      OMNI Send TEST from ( port 0 AF_INET to () port 0 AF_INET : +/-2.500% @ 99% conf.
      Local       Remote      Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service
      Send Socket Recv Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand
      Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units
      Final       Final                                             %     Method %      Method
      1651584     6291456     16384  20.00   17447.90   10^6bits/s  3.13  S      -1.00  U      0.353   -1.000  usec/KB
       Performance counter stats for './netperf -H -t omni -l 20 -c -i10,3':
                 412,514 context-switches
           200.034645535 seconds time elapsed
      lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
      lpq83:~# perf stat -e context-switches ./netperf -H -t omni -l 20 -c -i10,3
      OMNI Send TEST from ( port 0 AF_INET to () port 0 AF_INET : +/-2.500% @ 99% conf.
      Local       Remote      Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service
      Send Socket Recv Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand
      Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units
      Final       Final                                             %     Method %      Method
      1593240     6291456     16384  20.00   17321.16   10^6bits/s  3.35  S      -1.00  U      0.381   -1.000  usec/KB
       Performance counter stats for './netperf -H -t omni -l 20 -c -i10,3':
               2,675,818 context-switches
           200.029651391 seconds time elapsed
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-By: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  4. 23 Jul, 2013 1 commit
  5. 11 Jul, 2013 2 commits
  6. 11 Jun, 2013 1 commit
  7. 11 May, 2013 1 commit
    • Eric Dumazet's avatar
      ipv6: do not clear pinet6 field · f77d6021
      Eric Dumazet authored
      We have seen multiple NULL dereferences in __inet6_lookup_established()
      After analysis, I found that inet6_sk() could be NULL while the
      check for sk_family == AF_INET6 was true.
      Bug was added in linux-2.6.29 when RCU lookups were introduced in UDP
      and TCP stacks.
      Once an IPv6 socket, using SLAB_DESTROY_BY_RCU is inserted in a hash
      table, we no longer can clear pinet6 field.
      This patch extends logic used in commit fcbdf09d
      ("net: fix nulls list corruptions in sk_prot_alloc")
      TCP/UDP/UDPLite IPv6 protocols provide their own .clear_sk() method
      to make sure we do not clear pinet6 field.
      At socket clone phase, we do not really care, as cloning the parent (non
      NULL) pinet6 is not adding a fatal race.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  8. 29 Apr, 2013 1 commit
  9. 07 Apr, 2013 1 commit
  10. 18 Mar, 2013 1 commit
    • Eric Dumazet's avatar
      tcp: dont handle MTU reduction on LISTEN socket · 0d4f0608
      Eric Dumazet authored
      When an ICMP ICMP_FRAG_NEEDED (or ICMPV6_PKT_TOOBIG) message finds a
      LISTEN socket, and this socket is currently owned by the user, we
      set TCP_MTU_REDUCED_DEFERRED flag in listener tsq_flags.
      This is bad because if we clone the parent before it had a chance to
      clear the flag, the child inherits the tsq_flags value, and next
      tcp_release_cb() on the child will decrement sk_refcnt.
      Result is that we might free a live TCP socket, as reported by
      IPv4: Attempt to release TCP socket in state 1
      Fix this issue by testing sk_state against TCP_LISTEN early, so that we
      set TCP_MTU_REDUCED_DEFERRED on appropriate sockets (not a LISTEN one)
      This bug was introduced in commit 563d34d0
      (tcp: dont drop MTU reduction indications)
      Reported-by: default avatardormando <dormando@rydia.net>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  11. 17 Mar, 2013 1 commit
    • Christoph Paasch's avatar
      tcp: Remove TCPCT · 1a2c6181
      Christoph Paasch authored
      TCPCT uses option-number 253, reserved for experimental use and should
      not be used in production environments.
      Further, TCPCT does not fully implement RFC 6013.
      As a nice side-effect, removing TCPCT increases TCP's performance for
      very short flows:
      Doing an apache-benchmark with -c 100 -n 100000, sending HTTP-requests
      for files of 1KB size.
      before this patch:
      	average (among 7 runs) of 20845.5 Requests/Second
      	average (among 7 runs) of 21403.6 Requests/Second
      Signed-off-by: default avatarChristoph Paasch <christoph.paasch@uclouvain.be>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  12. 13 Feb, 2013 1 commit
    • Andrey Vagin's avatar
      tcp: send packets with a socket timestamp · ee684b6f
      Andrey Vagin authored
      A socket timestamp is a sum of the global tcp_time_stamp and
      a per-socket offset.
      A socket offset is added in places where externally visible
      tcp timestamp option is parsed/initialized.
      Connections in the SYN_RECV state are not supported, global
      tcp_time_stamp is used for them, because repair mode doesn't support
      this state. In a future it can be implemented by the similar way
      as for TIME_WAIT sockets.
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      Cc: James Morris <jmorris@namei.org>
      Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      Cc: Patrick McHardy <kaber@trash.net>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Signed-off-by: default avatarAndrey Vagin <avagin@openvz.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  13. 04 Feb, 2013 1 commit
  14. 23 Jan, 2013 1 commit
    • Tom Herbert's avatar
      soreuseport: TCP/IPv6 implementation · 5ba24953
      Tom Herbert authored
      Motivation for soreuseport would be something like a web server
      binding to port 80 running with multiple threads, where each thread
      might have it's own listener socket.  This could be done as an
      alternative to other models: 1) have one listener thread which
      dispatches completed connections to workers. 2) accept on a single
      listener socket from multiple threads.  In case #1 the listener thread
      can easily become the bottleneck with high connection turn-over rate.
      In case #2, the proportion of connections accepted per thread tends
      to be uneven under high connection load (assuming simple event loop:
      while (1) { accept(); process() }, wakeup does not promote fairness
      among the sockets.  We have seen the  disproportion to be as high
      as 3:1 ratio between thread accepting most connections and the one
      accepting the fewest.  With so_reusport the distribution is
      Signed-off-by: default avatarTom Herbert <therbert@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  15. 14 Jan, 2013 1 commit
  16. 07 Jan, 2013 1 commit
  17. 14 Dec, 2012 1 commit
    • Christoph Paasch's avatar
      inet: Fix kmemleak in tcp_v4/6_syn_recv_sock and dccp_v4/6_request_recv_sock · e337e24d
      Christoph Paasch authored
      If in either of the above functions inet_csk_route_child_sock() or
      __inet_inherit_port() fails, the newsk will not be freed:
      unreferenced object 0xffff88022e8a92c0 (size 1592):
        comm "softirq", pid 0, jiffies 4294946244 (age 726.160s)
        hex dump (first 32 bytes):
          0a 01 01 01 0a 01 01 02 00 00 00 00 a7 cc 16 00  ................
          02 00 03 01 00 00 00 00 00 00 00 00 00 00 00 00  ................
          [<ffffffff8153d190>] kmemleak_alloc+0x21/0x3e
          [<ffffffff810ab3e7>] kmem_cache_alloc+0xb5/0xc5
          [<ffffffff8149b65b>] sk_prot_alloc.isra.53+0x2b/0xcd
          [<ffffffff8149b784>] sk_clone_lock+0x16/0x21e
          [<ffffffff814d711a>] inet_csk_clone_lock+0x10/0x7b
          [<ffffffff814ebbc3>] tcp_create_openreq_child+0x21/0x481
          [<ffffffff814e8fa5>] tcp_v4_syn_recv_sock+0x3a/0x23b
          [<ffffffff814ec5ba>] tcp_check_req+0x29f/0x416
          [<ffffffff814e8e10>] tcp_v4_do_rcv+0x161/0x2bc
          [<ffffffff814eb917>] tcp_v4_rcv+0x6c9/0x701
          [<ffffffff814cea9f>] ip_local_deliver_finish+0x70/0xc4
          [<ffffffff814cec20>] ip_local_deliver+0x4e/0x7f
          [<ffffffff814ce9f8>] ip_rcv_finish+0x1fc/0x233
          [<ffffffff814cee68>] ip_rcv+0x217/0x267
          [<ffffffff814a7bbe>] __netif_receive_skb+0x49e/0x553
          [<ffffffff814a7cc3>] netif_receive_skb+0x50/0x82
      This happens, because sk_clone_lock initializes sk_refcnt to 2, and thus
      a single sock_put() is not enough to free the memory. Additionally, things
      like xfrm, memcg, cookie_values,... may have been initialized.
      We have to free them properly.
      This is fixed by forcing a call to tcp_done(), ending up in
      inet_csk_destroy_sock, doing the final sock_put(). tcp_done() is necessary,
      because it ends up doing all the cleanup on xfrm, memcg, cookie_values,
      Before calling tcp_done, we have to set the socket to SOCK_DEAD, to
      force it entering inet_csk_destroy_sock. To avoid the warning in
      inet_csk_destroy_sock, inet_num has to be set to 0.
      As inet_csk_destroy_sock does a dec on orphan_count, we first have to
      increase it.
      Calling tcp_done() allows us to remove the calls to
      tcp_clear_xmit_timer() and tcp_cleanup_congestion_control().
      A similar approach is taken for dccp by calling dccp_done().
      This is in the kernel since 093d2823 (tproxy: fix hash locking issue
      when using port redirection in __inet_inherit_port()), thus since
      version >= 2.6.37.
      Signed-off-by: default avatarChristoph Paasch <christoph.paasch@uclouvain.be>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  18. 22 Nov, 2012 1 commit
  19. 15 Nov, 2012 4 commits
  20. 03 Nov, 2012 1 commit
    • Eric Dumazet's avatar
      tcp: better retrans tracking for defer-accept · e6c022a4
      Eric Dumazet authored
      For passive TCP connections using TCP_DEFER_ACCEPT facility,
      we incorrectly increment req->retrans each time timeout triggers
      while no SYNACK is sent.
      SYNACK are not sent for TCP_DEFER_ACCEPT that were established (for
      which we received the ACK from client). Only the last SYNACK is sent
      so that we can receive again an ACK from client, to move the req into
      accept queue. We plan to change this later to avoid the useless
      retransmit (and potential problem as this SYNACK could be lost)
      TCP_INFO later gives wrong information to user, claiming imaginary
      Decouple req->retrans field into two independent fields :
      num_retrans : number of retransmit
      num_timeout : number of timeouts
      num_timeout is the counter that is incremented at each timeout,
      regardless of actual SYNACK being sent or not, and used to
      compute the exponential timeout.
      Introduce inet_rtx_syn_ack() helper to increment num_retrans
      only if ->rtx_syn_ack() succeeded.
      Use inet_rtx_syn_ack() from tcp_check_req() to increment num_retrans
      when we re-send a SYNACK in answer to a (retransmitted) SYN.
      Prior to this patch, we were not counting these retransmits.
      Change tcp_v[46]_rtx_synack() to increment TCP_MIB_RETRANSSEGS
      only if a synack packet was successfully queued.
      Reported-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Julian Anastasov <ja@ssi.bg>
      Cc: Vijay Subramanian <subramanian.vijay@gmail.com>
      Cc: Elliott Hughes <enh@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  21. 23 Oct, 2012 1 commit
  22. 12 Oct, 2012 1 commit
    • Alexey Kuznetsov's avatar
      tcp: resets are misrouted · 4c675258
      Alexey Kuznetsov authored
      After commit e2446eaa ("tcp_v4_send_reset: binding oif to iif in no
      sock case").. tcp resets are always lost, when routing is asymmetric.
      Yes, backing out that patch will result in misrouting of resets for
      dead connections which used interface binding when were alive, but we
      actually cannot do anything here.  What's died that's died and correct
      handling normal unbound connections is obviously a priority.
      Comment to comment:
      > This has few benefits:
      >   1. tcp_v6_send_reset already did that.
      It was done to route resets for IPv6 link local addresses. It was a
      mistake to do so for global addresses. The patch fixes this as well.
      Actually, the problem appears to be even more serious than guaranteed
      loss of resets.  As reported by Sergey Soloviev <sol@eqv.ru>, those
      misrouted resets create a lot of arp traffic and huge amount of
      unresolved arp entires putting down to knees NAT firewalls which use
      asymmetric routing.
      Signed-off-by: default avatarAlexey Kuznetsov <kuznet@ms2.inr.ac.ru>
  23. 01 Oct, 2012 1 commit
  24. 22 Sep, 2012 2 commits
    • Neal Cardwell's avatar
      tcp: TCP Fast Open Server - take SYNACK RTT after completing 3WHS · 016818d0
      Neal Cardwell authored
      When taking SYNACK RTT samples for servers using TCP Fast Open, fix
      the code to ensure that we only call tcp_valid_rtt_meas() after we
      receive the ACK that completes the 3-way handshake.
      Previously we were always taking an RTT sample in
      tcp_v4_syn_recv_sock(). However, for TCP Fast Open connections
      tcp_v4_conn_req_fastopen() calls tcp_v4_syn_recv_sock() at the time we
      receive the SYN. So for TFO we must wait until tcp_rcv_state_process()
      to take the RTT sample.
      To fix this, we wait until after TFO calls tcp_v4_syn_recv_sock()
      before we set the snt_synack timestamp, since tcp_synack_rtt_meas()
      already ensures that we only take a SYNACK RTT sample if snt_synack is
      non-zero. To be careful, we only take a snt_synack timestamp when
      a SYNACK transmit or retransmit succeeds.
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    • Neal Cardwell's avatar
      tcp: extract code to compute SYNACK RTT · 623df484
      Neal Cardwell authored
      In preparation for adding another spot where we compute the SYNACK
      RTT, extract this code so that it can be shared.
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  25. 05 Sep, 2012 1 commit
  26. 01 Sep, 2012 1 commit
    • Jerry Chu's avatar
      tcp: TCP Fast Open Server - support TFO listeners · 8336886f
      Jerry Chu authored
      This patch builds on top of the previous patch to add the support
      for TFO listeners. This includes -
      1. allocating, properly initializing, and managing the per listener
      fastopen_queue structure when TFO is enabled
      2. changes to the inet_csk_accept code to support TFO. E.g., the
      request_sock can no longer be freed upon accept(), not until 3WHS
      3. allowing a TCP_SYN_RECV socket to properly poll() and sendmsg()
      if it's a TFO socket
      4. properly closing a TFO listener, and a TFO socket before 3WHS
      5. supporting TCP_FASTOPEN socket option
      6. modifying tcp_check_req() to use to check a TFO socket as well
      as request_sock
      7. supporting TCP's TFO cookie option
      8. adding a new SYN-ACK retransmit handler to use the timer directly
      off the TFO socket rather than the listener socket. Note that TFO
      server side will not retransmit anything other than SYN-ACK until
      the 3WHS is completed.
      The patch also contains an important function
      "reqsk_fastopen_remove()" to manage the somewhat complex relation
      between a listener, its request_sock, and the corresponding child
      socket. See the comment above the function for the detail.
      Signed-off-by: default avatarH.K. Jerry Chu <hkchu@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  27. 20 Aug, 2012 1 commit
  28. 15 Aug, 2012 1 commit
  29. 10 Aug, 2012 1 commit
    • Eric Dumazet's avatar
      net: tcp: ipv6_mapped needs sk_rx_dst_set method · 63d02d15
      Eric Dumazet authored
      commit 5d299f3d (net: ipv6: fix TCP early demux) added a
      regression for ipv6_mapped case.
      [   67.422369] SELinux: initialized (dev autofs, type autofs), uses
      [   67.449678] SELinux: initialized (dev autofs, type autofs), uses
      [   92.631060] BUG: unable to handle kernel NULL pointer dereference at
      [   92.631435] IP: [<          (null)>]           (null)
      [   92.631645] PGD 0
      [   92.631846] Oops: 0010 [#1] SMP
      [   92.632095] Modules linked in: autofs4 sunrpc ipv6 dm_mirror
      dm_region_hash dm_log dm_multipath dm_mod video sbs sbshc battery ac lp
      parport sg snd_hda_intel snd_hda_codec snd_seq_oss snd_seq_midi_event
      snd_seq snd_seq_device pcspkr snd_pcm_oss snd_mixer_oss snd_pcm
      snd_timer serio_raw button floppy snd i2c_i801 i2c_core soundcore
      snd_page_alloc shpchp ide_cd_mod cdrom microcode ehci_hcd ohci_hcd
      [   92.634294] CPU 0
      [   92.634294] Pid: 4469, comm: sendmail Not tainted 3.6.0-rc1 #3
      [   92.634294] RIP: 0010:[<0000000000000000>]  [<          (null)>]
      [   92.634294] RSP: 0018:ffff880245fc7cb0  EFLAGS: 00010282
      [   92.634294] RAX: ffffffffa01985f0 RBX: ffff88024827ad00 RCX:
      [   92.634294] RDX: 0000000000000218 RSI: ffff880254735380 RDI:
      [   92.634294] RBP: ffff880245fc7cc8 R08: 0000000000000001 R09:
      [   92.634294] R10: 0000000000000000 R11: ffff880245fc7bf8 R12:
      [   92.634294] R13: ffff880254735380 R14: 0000000000000000 R15:
      [   92.634294] FS:  00007f4516ccd6f0(0000) GS:ffff880256600000(0000)
      [   92.634294] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [   92.634294] CR2: 0000000000000000 CR3: 0000000245ed1000 CR4:
      [   92.634294] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
      [   92.634294] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
      [   92.634294] Process sendmail (pid: 4469, threadinfo ffff880245fc6000,
      task ffff880254b8cac0)
      [   92.634294] Stack:
      [   92.634294]  ffffffff813837a7 ffff88024827ad00 ffff880254b6b0e8
      [   92.634294]  ffffffff81385083 00000000001d2680 ffff8802547353a8
      [   92.634294]  ffffffff8105903a ffff88024827ad60 0000000000000002
      [   92.634294] Call Trace:
      [   92.634294]  [<ffffffff813837a7>] ? tcp_finish_connect+0x2c/0xfa
      [   92.634294]  [<ffffffff81385083>] tcp_rcv_state_process+0x2b6/0x9c6
      [   92.634294]  [<ffffffff8105903a>] ? sched_clock_cpu+0xc3/0xd1
      [   92.634294]  [<ffffffff81059073>] ? local_clock+0x2b/0x3c
      [   92.634294]  [<ffffffff8138caf3>] tcp_v4_do_rcv+0x63a/0x670
      [   92.634294]  [<ffffffff8133278e>] release_sock+0x128/0x1bd
      [   92.634294]  [<ffffffff8139f060>] __inet_stream_connect+0x1b1/0x352
      [   92.634294]  [<ffffffff813325f5>] ? lock_sock_nested+0x74/0x7f
      [   92.634294]  [<ffffffff8104b333>] ? wake_up_bit+0x25/0x25
      [   92.634294]  [<ffffffff813325f5>] ? lock_sock_nested+0x74/0x7f
      [   92.634294]  [<ffffffff8139f223>] ? inet_stream_connect+0x22/0x4b
      [   92.634294]  [<ffffffff8139f234>] inet_stream_connect+0x33/0x4b
      [   92.634294]  [<ffffffff8132e8cf>] sys_connect+0x78/0x9e
      [   92.634294]  [<ffffffff813fd407>] ? sysret_check+0x1b/0x56
      [   92.634294]  [<ffffffff81088503>] ? __audit_syscall_entry+0x195/0x1c8
      [   92.634294]  [<ffffffff811cc26e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
      [   92.634294]  [<ffffffff813fd3e2>] system_call_fastpath+0x16/0x1b
      [   92.634294] Code:  Bad RIP value.
      [   92.634294] RIP  [<          (null)>]           (null)
      [   92.634294]  RSP <ffff880245fc7cb0>
      [   92.634294] CR2: 0000000000000000
      [   92.648982] ---[ end trace 24e2bed94314c8d9 ]---
      [   92.649146] Kernel panic - not syncing: Fatal exception in interrupt
      Fix this using inet_sk_rx_dst_set(), and export this function in case
      IPv6 is modular.
      Reported-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  30. 09 Aug, 2012 1 commit
    • Eric Dumazet's avatar
      time: jiffies_delta_to_clock_t() helper to the rescue · a399a805
      Eric Dumazet authored
      Various /proc/net files sometimes report crazy timer values, expressed
      in clock_t units.
      This happens when an expired timer delta (expires - jiffies) is passed
      to jiffies_to_clock_t().
      This function has an overflow in :
      return div_u64((u64)x * TICK_NSEC, NSEC_PER_SEC / USER_HZ);
      commit cbbc719f (time: Change jiffies_to_clock_t() argument type
      to unsigned long) only got around the problem.
      As we cant output negative values in /proc/net/tcp without breaking
      various tools, I suggest adding a jiffies_delta_to_clock_t() wrapper
      that caps the negative delta to a 0 value.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarMaciej Żenczykowski <maze@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: hank <pyu@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  31. 06 Aug, 2012 1 commit
  32. 01 Aug, 2012 2 commits
  33. 26 Jul, 2012 1 commit
  34. 23 Jul, 2012 1 commit
    • Eric Dumazet's avatar
      tcp: dont drop MTU reduction indications · 563d34d0
      Eric Dumazet authored
      ICMP messages generated in output path if frame length is bigger than
      mtu are actually lost because socket is owned by user (doing the xmit)
      One example is the ipgre_tunnel_xmit() calling
      icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED, htonl(mtu));
      We had a similar case fixed in commit a34a101e (ipv6: disable GSO on
      sockets hitting dst_allfrag).
      Problem of such fix is that it relied on retransmit timers, so short tcp
      sessions paid a too big latency increase price.
      This patch uses the tcp_release_cb() infrastructure so that MTU
      reduction messages (ICMP messages) are not lost, and no extra delay
      is added in TCP transmits.
      Reported-by: default avatarMaciej Żenczykowski <maze@google.com>
      Diagnosed-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Nandita Dukkipati <nanditad@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Tore Anderson <tore@fud.no>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>