1. 17 Nov, 2016 1 commit
    • Xin Long's avatar
      sctp: use new rhlist interface on sctp transport rhashtable · 7fda702f
      Xin Long authored
      
      
      Now sctp transport rhashtable uses hash(lport, dport, daddr) as the key
      to hash a node to one chain. If in one host thousands of assocs connect
      to one server with the same lport and different laddrs (although it's
      not a normal case), all the transports would be hashed into the same
      chain.
      
      It may cause to keep returning -EBUSY when inserting a new node, as the
      chain is too long and sctp inserts a transport node in a loop, which
      could even lead to system hangs there.
      
      The new rhlist interface works for this case that there are many nodes
      with the same key in one chain. It puts them into a list then makes this
      list be as a node of the chain.
      
      This patch is to replace rhashtable_ interface with rhltable_ interface.
      Since a chain would not be too long and it would not return -EBUSY with
      this fix when inserting a node, the reinsert loop is also removed here.
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Acked-by: default avatarNeil Horman <nhorman@tuxdriver.com>
      Acked-by: default avatarMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7fda702f
  2. 13 Oct, 2016 2 commits
  3. 30 Sep, 2016 2 commits
    • Xin Long's avatar
      sctp: remove prsctp_param from sctp_chunk · 0605483f
      Xin Long authored
      Now sctp uses chunk->prsctp_param to save the prsctp param for all the
      prsctp polices, we didn't need to introduce prsctp_param to sctp_chunk.
      We can just use chunk->sinfo.sinfo_timetolive for RTX and BUF polices,
      and reuse msg->expires_at for TTL policy, as the prsctp polices and old
      expires policy are mutual exclusive.
      
      This patch is to remove prsctp_param from sctp_chunk, and reuse msg's
      expires_at for TTL and chunk's sinfo.sinfo_timetolive for RTX and BUF
      polices.
      
      Note that sctp can't use chunk's sinfo.sinfo_timetolive for TTL policy,
      as it needs a u64 variables to save the expires_at time.
      
      This one also fixes the "netperf-Throughput_Mbps -37.2% regression"
      issue.
      
      Fixes: a6c2f792
      
       ("sctp: implement prsctp TTL policy")
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0605483f
    • Xin Long's avatar
      sctp: move sent_count to the memory hole in sctp_chunk · 73dca124
      Xin Long authored
      Now pahole sctp_chunk, it has 2 memory holes:
         struct sctp_chunk {
      	struct list_head           list;
      	atomic_t                   refcnt;
      	/* XXX 4 bytes hole, try to pack */
      	...
      	long unsigned int          prsctp_param;
      	int                        sent_count;
      	/* XXX 4 bytes hole, try to pack */
      
      This patch is to move up sent_count to fill the 1st one and eliminate
      the 2nd one.
      
      It's not just another struct compaction, it also fixes the "netperf-
      Throughput_Mbps -37.2% regression" issue when overloading the CPU.
      
      Fixes: a6c2f792
      
       ("sctp: implement prsctp TTL policy")
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      73dca124
  4. 19 Sep, 2016 2 commits
  5. 14 Jul, 2016 3 commits
    • Marcelo Ricardo Leitner's avatar
      sctp: avoid identifying address family many times for a chunk · e7487c86
      Marcelo Ricardo Leitner authored
      
      
      Identifying address family operations during rx path is not something
      expensive but it's ugly to the eye to have it done multiple times,
      specially when we already validated it during initial rx processing.
      
      This patch takes advantage of the now shared sctp_input_cb and make the
      pointer to the operations readily available.
      Signed-off-by: default avatarMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e7487c86
    • Marcelo Ricardo Leitner's avatar
      sctp: allow GSO frags to access the chunk too · 1f45f78f
      Marcelo Ricardo Leitner authored
      SCTP will try to access original IP headers on sctp_recvmsg in order to
      copy the addresses used. There are also other places that do similar access
      to IP or even SCTP headers. But after 90017acc ("sctp: Add GSO
      support") they aren't always there because they are only present in the
      header skb.
      
      SCTP handles the queueing of incoming data by cloning the incoming skb
      and limiting to only the relevant payload. This clone has its cb updated
      to something different and it's then queued on socket rx queue. Thus we
      need to fix this in two moments.
      
      For rx path, not related to socket queue yet, this patch uses a
      partially copied sctp_input_cb to such GSO frags. This restores the
      ability to access the headers for this part of the code.
      
      Regarding the socket rx queue, it removes iif member from sctp_event and
      also add a chunk pointer on it.
      
      With these changes we're always able to reach the headers again.
      
      The biggest change here is that now the sctp_chunk struct and the
      original skb are only freed after the application consumed the buffer.
      Note however that the original payload was already like this due to the
      skb cloning.
      
      For iif, SCTP's IPv4 code doesn't use it, so no change is necessary.
      IPv6 now can fetch it directly from original's IPv6 CB as the original
      skb is still accessible.
      
      In the future we probably can simplify sctp_v*_skb_iif() stuff, as
      sctp_v4_skb_iif() was called but it's return value not used, and now
      it's not even called, but such cleanup is out of scope for this change.
      
      Fixes: 90017acc
      
       ("sctp: Add GSO support")
      Signed-off-by: default avatarMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1f45f78f
    • Marcelo Ricardo Leitner's avatar
      sctp: allow others to use sctp_input_cb · 9e238323
      Marcelo Ricardo Leitner authored
      
      
      We process input path in other files too and having access to it is
      nice, so move it to a header where it's shared.
      Signed-off-by: default avatarMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9e238323
  6. 11 Jul, 2016 4 commits
    • Xin Long's avatar
      sctp: implement prsctp PRIO policy · 8dbdf1f5
      Xin Long authored
      
      
      prsctp PRIO policy is a policy to abandon lower priority chunks when
      asoc doesn't have enough snd buffer, so that the current chunk with
      higher priority can be queued successfully.
      
      Similar to TTL/RTX policy, we will set the priority of the chunk to
      prsctp_param with sinfo->sinfo_timetolive in sctp_set_prsctp_policy().
      So if PRIO policy is enabled, msg->expire_at won't work.
      
      asoc->sent_cnt_removable will record how many chunks can be checked to
      remove. If priority policy is enabled, when the chunk is queued into
      the out_queue, we will increase sent_cnt_removable. When the chunk is
      moved to abandon_queue or dequeue and free, we will decrease
      sent_cnt_removable.
      
      In sctp_sendmsg, we will check if there is enough snd buffer for current
      msg and if sent_cnt_removable is not 0. Then try to abandon chunks in
      sctp_prune_prsctp when sendmsg from the retransmit/transmited queue, and
      free chunks from out_queue in right order until the abandon+free size >
      msg_len - sctp_wfree. For the abandon size, we have to wait until it
      sends FORWARD TSN, receives the sack and the chunks are really freed.
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8dbdf1f5
    • Xin Long's avatar
      sctp: implement prsctp TTL policy · a6c2f792
      Xin Long authored
      
      
      prsctp TTL policy is a policy to abandon chunks when they expire
      at the specific time in local stack. It's similar with expires_at
      in struct sctp_datamsg.
      
      This patch uses sinfo->sinfo_timetolive to set the specific time for
      TTL policy. sinfo->sinfo_timetolive is also used for msg->expires_at.
      So if prsctp_enable or TTL policy is not enabled, msg->expires_at
      still works as before.
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a6c2f792
    • Xin Long's avatar
      sctp: add SCTP_PR_ASSOC_STATUS on sctp sockopt · 826d253d
      Xin Long authored
      
      
      This patch adds SCTP_PR_ASSOC_STATUS to sctp sockopt, which is used
      to dump the prsctp statistics info from the asoc. The prsctp statistics
      includes abandoned_sent/unsent from the asoc. abandoned_sent is the
      count of the packets we drop packets from retransmit/transmited queue,
      and abandoned_unsent is the count of the packets we drop from out_queue
      according to the policy.
      
      Note: another option for prsctp statistics dump described in rfc is
      SCTP_PR_STREAM_STATUS, which is used to dump the prsctp statistics
      info from each stream. But by now, linux doesn't yet have per stream
      statistics info, it needs rfc6525 to be implemented. As the prsctp
      statistics for each stream has to be based on per stream statistics,
      we will delay it until rfc6525 is done in linux.
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      826d253d
    • Xin Long's avatar
      sctp: add SCTP_PR_SUPPORTED on sctp sockopt · 28aa4c26
      Xin Long authored
      
      
      According to section 4.5 of rfc7496, prsctp_enable should be per asoc.
      We will add prsctp_enable to both asoc and ep, and replace the places
      where it used net.sctp->prsctp_enable with asoc->prsctp_enable.
      
      ep->prsctp_enable will be initialized with net.sctp->prsctp_enable, and
      asoc->prsctp_enable will be initialized with ep->prsctp_enable. We can
      also modify it's value through sockopt SCTP_PR_SUPPORTED.
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      28aa4c26
  7. 03 Jun, 2016 1 commit
    • Marcelo Ricardo Leitner's avatar
      sctp: Add GSO support · 90017acc
      Marcelo Ricardo Leitner authored
      SCTP has this pecualiarity that its packets cannot be just segmented to
      (P)MTU. Its chunks must be contained in IP segments, padding respected.
      So we can't just generate a big skb, set gso_size to the fragmentation
      point and deliver it to IP layer.
      
      This patch takes a different approach. SCTP will now build a skb as it
      would be if it was received using GRO. That is, there will be a cover
      skb with protocol headers and children ones containing the actual
      segments, already segmented to a way that respects SCTP RFCs.
      
      With that, we can tell skb_segment() to just split based on frag_list,
      trusting its sizes are already in accordance.
      
      This way SCTP can benefit from GSO and instead of passing several
      packets through the stack, it can pass a single large packet.
      
      v2:
      - Added support for receiving GSO frames, as requested by Dave Miller.
      - Clear skb->cb if packet is GSO (otherwise it's not used by SCTP)
      - Added heuristics similar to what we have in TCP for not generating
        single GSO packets that fills cwnd.
      v3:
      - consider sctphdr size in skb_gso_transport_seglen()
      - rebased due to 5c7cdf33
      
       ("gso: Remove arbitrary checks for
        unsupported GSO")
      Signed-off-by: default avatarMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Tested-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      90017acc
  8. 02 May, 2016 1 commit
  9. 14 Apr, 2016 2 commits
  10. 11 Apr, 2016 1 commit
    • Marcelo Ricardo Leitner's avatar
      sctp: avoid refreshing heartbeat timer too often · ba6f5e33
      Marcelo Ricardo Leitner authored
      
      
      Currently on high rate SCTP streams the heartbeat timer refresh can
      consume quite a lot of resources as timer updates are costly and it
      contains a random factor, which a) is also costly and b) invalidates
      mod_timer() optimization for not editing a timer to the same value.
      It may even cause the timer to be slightly advanced, for no good reason.
      
      As suggested by David Laight this patch now removes this timer update
      from hot path by leaving the timer on and re-evaluating upon its
      expiration if the heartbeat is still needed or not, similarly to what is
      done for TCP. If it's not needed anymore the timer is re-scheduled to
      the new timeout, considering the time already elapsed.
      
      For this, we now record the last tx timestamp per transport, updated in
      the same spots as hb timer was restarted on tx. Also split up
      sctp_transport_reset_timers into sctp_transport_reset_t3_rtx and
      sctp_transport_reset_hb_timer, so we can re-arm T3 without re-arming the
      heartbeat one.
      
      On loopback with MTU of 65535 and data chunks with 1636, so that we
      have a considerable amount of chunks without stressing system calls,
      netperf -t SCTP_STREAM -l 30, perf looked like this before:
      
      Samples: 103K of event 'cpu-clock', Event count (approx.): 25833000000
        Overhead  Command  Shared Object      Symbol
      +    6,15%  netperf  [kernel.vmlinux]   [k] copy_user_enhanced_fast_string
      -    5,43%  netperf  [kernel.vmlinux]   [k] _raw_write_unlock_irqrestore
         - _raw_write_unlock_irqrestore
            - 96,54% _raw_spin_unlock_irqrestore
               - 36,14% mod_timer
                  + 97,24% sctp_transport_reset_timers
                  + 2,76% sctp_do_sm
               + 33,65% __wake_up_sync_key
               + 28,77% sctp_ulpq_tail_event
               + 1,40% del_timer
            - 1,84% mod_timer
               + 99,03% sctp_transport_reset_timers
               + 0,97% sctp_do_sm
            + 1,50% sctp_ulpq_tail_event
      
      And after this patch, now with netperf -l 60:
      
      Samples: 230K of event 'cpu-clock', Event count (approx.): 57707250000
        Overhead  Command  Shared Object      Symbol
      +    5,65%  netperf  [kernel.vmlinux]   [k] memcpy_erms
      +    5,59%  netperf  [kernel.vmlinux]   [k] copy_user_enhanced_fast_string
      -    5,05%  netperf  [kernel.vmlinux]   [k] _raw_spin_unlock_irqrestore
         - _raw_spin_unlock_irqrestore
            + 49,89% __wake_up_sync_key
            + 45,68% sctp_ulpq_tail_event
            - 2,85% mod_timer
               + 76,51% sctp_transport_reset_t3_rtx
               + 23,49% sctp_do_sm
            + 1,55% del_timer
      +    2,50%  netperf  [sctp]             [k] sctp_datamsg_from_user
      +    2,26%  netperf  [sctp]             [k] sctp_sendmsg
      
      Throughput-wise, from 6800mbps without the patch to 7050mbps with it,
      ~3.7%.
      Signed-off-by: default avatarMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ba6f5e33
  11. 14 Mar, 2016 1 commit
    • Marcelo Ricardo Leitner's avatar
      sctp: allow sctp_transmit_packet and others to use gfp · cea8768f
      Marcelo Ricardo Leitner authored
      
      
      Currently sctp_sendmsg() triggers some calls that will allocate memory
      with GFP_ATOMIC even when not necessary. In the case of
      sctp_packet_transmit it will allocate a linear skb that will be used to
      construct the packet and this may cause sends to fail due to ENOMEM more
      often than anticipated specially with big MTUs.
      
      This patch thus allows it to inherit gfp flags from upper calls so that
      it can use GFP_KERNEL if it was triggered by a sctp_sendmsg call or
      similar. All others, like retransmits or flushes started from BH, are
      still allocated using GFP_ATOMIC.
      
      In netperf tests this didn't result in any performance drawbacks when
      memory is not too fragmented and made it trigger ENOMEM way less often.
      Signed-off-by: default avatarMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cea8768f
  12. 08 Mar, 2016 1 commit
  13. 17 Feb, 2016 1 commit
  14. 28 Jan, 2016 2 commits
  15. 27 Jan, 2016 1 commit
  16. 05 Jan, 2016 2 commits
  17. 07 Dec, 2015 1 commit
    • lucien's avatar
      sctp: start t5 timer only when peer rwnd is 0 and local state is SHUTDOWN_PENDING · 8a0d19c5
      lucien authored
      when A sends a data to B, then A close() and enter into SHUTDOWN_PENDING
      state, if B neither claim his rwnd is 0 nor send SACK for this data, A
      will keep retransmitting this data until t5 timeout, Max.Retrans times
      can't work anymore, which is bad.
      
      if B's rwnd is not 0, it should send abort after Max.Retrans times, only
      when B's rwnd == 0 and A's retransmitting beyonds Max.Retrans times, A
      will start t5 timer, which is also commit f8d96052 ("sctp: Enforce
      retransmission limit during shutdown") means, but it lacks the condition
      peer rwnd == 0.
      
      so fix it by adding a bit (zero_window_announced) in peer to record if
      the last rwnd is 0. If it was, zero_window_announced will be set. and use
      this bit to decide if start t5 timer when local.state is SHUTDOWN_PENDING.
      
      Fixes: commit f8d96052
      
       ("sctp: Enforce retransmission limit during shutdown")
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8a0d19c5
  18. 03 Dec, 2015 1 commit
  19. 14 Jun, 2015 1 commit
    • Marcelo Ricardo Leitner's avatar
      sctp: fix ASCONF list handling · 2d45a02d
      Marcelo Ricardo Leitner authored
      ->auto_asconf_splist is per namespace and mangled by functions like
      sctp_setsockopt_auto_asconf() which doesn't guarantee any serialization.
      
      Also, the call to inet_sk_copy_descendant() was backuping
      ->auto_asconf_list through the copy but was not honoring
      ->do_auto_asconf, which could lead to list corruption if it was
      different between both sockets.
      
      This commit thus fixes the list handling by using ->addr_wq_lock
      spinlock to protect the list. A special handling is done upon socket
      creation and destruction for that. Error handlig on sctp_init_sock()
      will never return an error after having initialized asconf, so
      sctp_destroy_sock() can be called without addrq_wq_lock. The lock now
      will be take on sctp_close_sock(), before locking the socket, so we
      don't do it in inverse order compared to sctp_addr_wq_timeout_handler().
      
      Instead of taking the lock on sctp_sock_migrate() for copying and
      restoring the list values, it's preferred to avoid rewritting it by
      implementing sctp_copy_descendant().
      
      Issue was found with a test application that kept flipping sysctl
      default_auto_asconf on and off, but one could trigger it by issuing
      simultaneous setsockopt() calls on multiple sockets or by
      creating/destroying sockets fast enough. This is only triggerable
      locally.
      
      Fixes: 9f7d653b
      
       ("sctp: Add Auto-ASCONF support (core).")
      Reported-by: default avatarJi Jianwen <jiji@redhat.com>
      Suggested-by: default avatarNeil Horman <nhorman@tuxdriver.com>
      Suggested-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Acked-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2d45a02d
  20. 24 Nov, 2014 1 commit
  21. 24 Oct, 2014 1 commit
  22. 01 Aug, 2014 1 commit
    • Jason Gunthorpe's avatar
      sctp: Fixup v4mapped behaviour to comply with Sock API · 299ee123
      Jason Gunthorpe authored
      
      
      The SCTP socket extensions API document describes the v4mapping option as
      follows:
      
      8.1.15.  Set/Clear IPv4 Mapped Addresses (SCTP_I_WANT_MAPPED_V4_ADDR)
      
         This socket option is a Boolean flag which turns on or off the
         mapping of IPv4 addresses.  If this option is turned on, then IPv4
         addresses will be mapped to V6 representation.  If this option is
         turned off, then no mapping will be done of V4 addresses and a user
         will receive both PF_INET6 and PF_INET type addresses on the socket.
         See [RFC3542] for more details on mapped V6 addresses.
      
      This description isn't really in line with what the code does though.
      
      Introduce addr_to_user (renamed addr_v4map), which should be called
      before any sockaddr is passed back to user space. The new function
      places the sockaddr into the correct format depending on the
      SCTP_I_WANT_MAPPED_V4_ADDR option.
      
      Audit all places that touched v4mapped and either sanely construct
      a v4 or v6 address then call addr_to_user, or drop the
      unnecessary v4mapped check entirely.
      
      Audit all places that call addr_to_user and verify they are on a sycall
      return path.
      
      Add a custom getname that formats the address properly.
      
      Several bugs are addressed:
       - SCTP_I_WANT_MAPPED_V4_ADDR=0 often returned garbage for
         addresses to user space
       - The addr_len returned from recvmsg was not correct when
         returning AF_INET on a v6 socket
       - flowlabel and scope_id were not zerod when promoting
         a v4 to v6
       - Some syscalls like bind and connect behaved differently
         depending on v4mapped
      
      Tested bind, getpeername, getsockname, connect, and recvmsg for proper
      behaviour in v4mapped = 1 and 0 cases.
      Signed-off-by: default avatarNeil Horman <nhorman@tuxdriver.com>
      Tested-by: default avatarJason Gunthorpe <jgunthorpe@obsidianresearch.com>
      Signed-off-by: default avatarJason Gunthorpe <jgunthorpe@obsidianresearch.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      299ee123
  23. 16 Jul, 2014 3 commits
    • Geir Ola Vaagland's avatar
      net: sctp: implement rfc6458, 5.3.6. SCTP_NXTINFO cmsg support · 2347c80f
      Geir Ola Vaagland authored
      
      
      This patch implements section 5.3.6. of RFC6458, that is, support
      for 'SCTP Next Receive Information Structure' (SCTP_NXTINFO) which
      is placed into ancillary data cmsghdr structure for each recvmsg()
      call, if this information is already available when delivering the
      current message.
      
      This option can be enabled/disabled via setsockopt(2) on SOL_SCTP
      level by setting an int value with 1/0 for SCTP_RECVNXTINFO in
      user space applications as per RFC6458, section 8.1.30.
      
      The sctp_nxtinfo structure is defined as per RFC as below ...
      
        struct sctp_nxtinfo {
          uint16_t nxt_sid;
          uint16_t nxt_flags;
          uint32_t nxt_ppid;
          uint32_t nxt_length;
          sctp_assoc_t nxt_assoc_id;
        };
      
      ... and provided under cmsg_level IPPROTO_SCTP, cmsg_type
      SCTP_NXTINFO, while cmsg_data[] contains struct sctp_nxtinfo.
      
      Joint work with Daniel Borkmann.
      Signed-off-by: default avatarGeir Ola Vaagland <geirola@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2347c80f
    • Geir Ola Vaagland's avatar
      net: sctp: implement rfc6458, 5.3.5. SCTP_RCVINFO cmsg support · 0d3a421d
      Geir Ola Vaagland authored
      
      
      This patch implements section 5.3.5. of RFC6458, that is, support
      for 'SCTP Receive Information Structure' (SCTP_RCVINFO) which is
      placed into ancillary data cmsghdr structure for each recvmsg()
      call.
      
      This option can be enabled/disabled via setsockopt(2) on SOL_SCTP
      level by setting an int value with 1/0 for SCTP_RECVRCVINFO in user
      space applications as per RFC6458, section 8.1.29.
      
      The sctp_rcvinfo structure is defined as per RFC as below ...
      
        struct sctp_rcvinfo {
          uint16_t rcv_sid;
          uint16_t rcv_ssn;
          uint16_t rcv_flags;
          <-- 2 bytes hole  -->
          uint32_t rcv_ppid;
          uint32_t rcv_tsn;
          uint32_t rcv_cumtsn;
          uint32_t rcv_context;
          sctp_assoc_t rcv_assoc_id;
        };
      
      ... and provided under cmsg_level IPPROTO_SCTP, cmsg_type
      SCTP_RCVINFO, while cmsg_data[] contains struct sctp_rcvinfo.
      An sctp_rcvinfo item always corresponds to the data in msg_iov.
      
      Joint work with Daniel Borkmann.
      Signed-off-by: default avatarGeir Ola Vaagland <geirola@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0d3a421d
    • Geir Ola Vaagland's avatar
      net: sctp: implement rfc6458, 5.3.4. SCTP_SNDINFO cmsg support · 63b94938
      Geir Ola Vaagland authored
      
      
      This patch implements section 5.3.4. of RFC6458, that is, support
      for 'SCTP Send Information Structure' (SCTP_SNDINFO) which can be
      placed into ancillary data cmsghdr structure for sendmsg() calls.
      
      The sctp_sndinfo structure is defined as per RFC as below ...
      
        struct sctp_sndinfo {
          uint16_t snd_sid;
          uint16_t snd_flags;
          uint32_t snd_ppid;
          uint32_t snd_context;
          sctp_assoc_t snd_assoc_id;
        };
      
      ... and supplied under cmsg_level IPPROTO_SCTP, cmsg_type
      SCTP_SNDINFO, while cmsg_data[] contains struct sctp_sndinfo.
      An sctp_sndinfo item always corresponds to the data in msg_iov.
      
      Joint work with Daniel Borkmann.
      Signed-off-by: default avatarGeir Ola Vaagland <geirola@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      63b94938
  24. 11 Jun, 2014 1 commit
  25. 18 Apr, 2014 1 commit
    • Vlad Yasevich's avatar
      net: sctp: cache auth_enable per endpoint · b14878cc
      Vlad Yasevich authored
      
      
      Currently, it is possible to create an SCTP socket, then switch
      auth_enable via sysctl setting to 1 and crash the system on connect:
      
      Oops[#1]:
      CPU: 0 PID: 0 Comm: swapper Not tainted 3.14.1-mipsgit-20140415 #1
      task: ffffffff8056ce80 ti: ffffffff8055c000 task.ti: ffffffff8055c000
      [...]
      Call Trace:
      [<ffffffff8043c4e8>] sctp_auth_asoc_set_default_hmac+0x68/0x80
      [<ffffffff8042b300>] sctp_process_init+0x5e0/0x8a4
      [<ffffffff8042188c>] sctp_sf_do_5_1B_init+0x234/0x34c
      [<ffffffff804228c8>] sctp_do_sm+0xb4/0x1e8
      [<ffffffff80425a08>] sctp_endpoint_bh_rcv+0x1c4/0x214
      [<ffffffff8043af68>] sctp_rcv+0x588/0x630
      [<ffffffff8043e8e8>] sctp6_rcv+0x10/0x24
      [<ffffffff803acb50>] ip6_input+0x2c0/0x440
      [<ffffffff8030fc00>] __netif_receive_skb_core+0x4a8/0x564
      [<ffffffff80310650>] process_backlog+0xb4/0x18c
      [<ffffffff80313cbc>] net_rx_action+0x12c/0x210
      [<ffffffff80034254>] __do_softirq+0x17c/0x2ac
      [<ffffffff800345e0>] irq_exit+0x54/0xb0
      [<ffffffff800075a4>] ret_from_irq+0x0/0x4
      [<ffffffff800090ec>] rm7k_wait_irqoff+0x24/0x48
      [<ffffffff8005e388>] cpu_startup_entry+0xc0/0x148
      [<ffffffff805a88b0>] start_kernel+0x37c/0x398
      Code: dd0900b8  000330f8  0126302d <dcc60000> 50c0fff1  0047182a  a48306a0
      03e00008  00000000
      ---[ end trace b530b0551467f2fd ]---
      Kernel panic - not syncing: Fatal exception in interrupt
      
      What happens while auth_enable=0 in that case is, that
      ep->auth_hmacs is initialized to NULL in sctp_auth_init_hmacs()
      when endpoint is being created.
      
      After that point, if an admin switches over to auth_enable=1,
      the machine can crash due to NULL pointer dereference during
      reception of an INIT chunk. When we enter sctp_process_init()
      via sctp_sf_do_5_1B_init() in order to respond to an INIT chunk,
      the INIT verification succeeds and while we walk and process
      all INIT params via sctp_process_param() we find that
      net->sctp.auth_enable is set, therefore do not fall through,
      but invoke sctp_auth_asoc_set_default_hmac() instead, and thus,
      dereference what we have set to NULL during endpoint
      initialization phase.
      
      The fix is to make auth_enable immutable by caching its value
      during endpoint initialization, so that its original value is
      being carried along until destruction. The bug seems to originate
      from the very first days.
      
      Fix in joint work with Daniel Borkmann.
      Reported-by: default avatarJoshua Kinard <kumba@gentoo.org>
      Signed-off-by: default avatarVlad Yasevich <vyasevic@redhat.com>
      Signed-off-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Acked-by: default avatarNeil Horman <nhorman@tuxdriver.com>
      Tested-by: default avatarJoshua Kinard <kumba@gentoo.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b14878cc
  26. 14 Apr, 2014 1 commit
    • Daniel Borkmann's avatar
      Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer" · 362d5204
      Daniel Borkmann authored
      This reverts commit ef2820a7 ("net: sctp: Fix a_rwnd/rwnd management
      to reflect real state of the receiver's buffer") as it introduced a
      serious performance regression on SCTP over IPv4 and IPv6, though a not
      as dramatic on the latter. Measurements are on 10Gbit/s with ixgbe NICs.
      
      Current state:
      
      [root@Lab200slot2 ~]# iperf3 --sctp -4 -c 192.168.241.3 -V -l 1452 -t 60
      iperf version 3.0.1 (10 January 2014)
      Linux Lab200slot2 3.14.0 #1 SMP Thu Apr 3 23:18:29 EDT 2014 x86_64
      Time: Fri, 11 Apr 2014 17:56:21 GMT
      Connecting to host 192.168.241.3, port 5201
            Cookie: Lab200slot2.1397238981.812898.548918
      [  4] local 192.168.241.2 port 38616 connected to 192.168.241.3 port 5201
      Starting Test: protocol: SCTP, 1 streams, 1452 byte blocks, omitting 0 seconds, 60 second test
      [ ID] Interval           Transfer     Bandwidth
      [  4]   0.00-1.09   sec  20.8 MBytes   161 Mbits/sec
      [  4]   1.09-2.13   sec  10.8 MBytes  86.8 Mbits/sec
      [  4]   2.13-3.15   sec  3.57 MBytes  29.5 Mbits/sec
      [  4]   3.15-4.16   sec  4.33 MBytes  35.7 Mbits/sec
      [  4]   4.16-6.21   sec  10.4 MBytes  42.7 Mbits/sec
      [  4]   6.21-6.21   sec  0.00 Bytes    0.00 bits/sec
      [  4]   6.21-7.35   sec  34.6 MBytes   253 Mbits/sec
      [  4]   7.35-11.45  sec  22.0 MBytes  45.0 Mbits/sec
      [  4]  11.45-11.45  sec  0.00 Bytes    0.00 bits/sec
      [  4]  11.45-11.45  sec  0.00 Bytes    0.00 bits/sec
      [  4]  11.45-11.45  sec  0.00 Bytes    0.00 bits/sec
      [  4]  11.45-12.51  sec  16.0 MBytes   126 Mbits/sec
      [  4]  12.51-13.59  sec  20.3 MBytes   158 Mbits/sec
      [  4]  13.59-14.65  sec  13.4 MBytes   107 Mbits/sec
      [  4]  14.65-16.79  sec  33.3 MBytes   130 Mbits/sec
      [  4]  16.79-16.79  sec  0.00 Bytes    0.00 bits/sec
      [  4]  16.79-17.82  sec  5.94 MBytes  48.7 Mbits/sec
      (etc)
      
      [root@Lab200slot2 ~]#  iperf3 --sctp -6 -c 2001:db8:0:f101::1 -V -l 1400 -t 60
      iperf version 3.0.1 (10 January 2014)
      Linux Lab200slot2 3.14.0 #1 SMP Thu Apr 3 23:18:29 EDT 2014 x86_64
      Time: Fri, 11 Apr 2014 19:08:41 GMT
      Connecting to host 2001:db8:0:f101::1, port 5201
            Cookie: Lab200slot2.1397243321.714295.2b3f7c
      [  4] local 2001:db8:0:f101::2 port 55804 connected to 2001:db8:0:f101::1 port 5201
      Starting Test: protocol: SCTP, 1 streams, 1400 byte blocks, omitting 0 seconds, 60 second test
      [ ID] Interval           Transfer     Bandwidth
      [  4]   0.00-1.00   sec   169 MBytes  1.42 Gbits/sec
      [  4]   1.00-2.00   sec   201 MBytes  1.69 Gbits/sec
      [  4]   2.00-3.00   sec   188 MBytes  1.58 Gbits/sec
      [  4]   3.00-4.00   sec   174 MBytes  1.46 Gbits/sec
      [  4]   4.00-5.00   sec   165 MBytes  1.39 Gbits/sec
      [  4]   5.00-6.00   sec   199 MBytes  1.67 Gbits/sec
      [  4]   6.00-7.00   sec   163 MBytes  1.36 Gbits/sec
      [  4]   7.00-8.00   sec   174 MBytes  1.46 Gbits/sec
      [  4]   8.00-9.00   sec   193 MBytes  1.62 Gbits/sec
      [  4]   9.00-10.00  sec   196 MBytes  1.65 Gbits/sec
      [  4]  10.00-11.00  sec   157 MBytes  1.31 Gbits/sec
      [  4]  11.00-12.00  sec   175 MBytes  1.47 Gbits/sec
      [  4]  12.00-13.00  sec   192 MBytes  1.61 Gbits/sec
      [  4]  13.00-14.00  sec   199 MBytes  1.67 Gbits/sec
      (etc)
      
      After patch:
      
      [root@Lab200slot2 ~]#  iperf3 --sctp -4 -c 192.168.240.3 -V -l 1452 -t 60
      iperf version 3.0.1 (10 January 2014)
      Linux Lab200slot2 3.14.0+ #1 SMP Mon Apr 14 12:06:40 EDT 2014 x86_64
      Time: Mon, 14 Apr 2014 16:40:48 GMT
      Connecting to host 192.168.240.3, port 5201
            Cookie: Lab200slot2.1397493648.413274.65e131
      [  4] local 192.168.240.2 port 50548 connected to 192.168.240.3 port 5201
      Starting Test: protocol: SCTP, 1 streams, 1452 byte blocks, omitting 0 seconds, 60 second test
      [ ID] Interval           Transfer     Bandwidth
      [  4]   0.00-1.00   sec   240 MBytes  2.02 Gbits/sec
      [  4]   1.00-2.00   sec   239 MBytes  2.01 Gbits/sec
      [  4]   2.00-3.00   sec   240 MBytes  2.01 Gbits/sec
      [  4]   3.00-4.00   sec   239 MBytes  2.00 Gbits/sec
      [  4]   4.00-5.00   sec   245 MBytes  2.05 Gbits/sec
      [  4]   5.00-6.00   sec   240 MBytes  2.01 Gbits/sec
      [  4]   6.00-7.00   sec   240 MBytes  2.02 Gbits/sec
      [  4]   7.00-8.00   sec   239 MBytes  2.01 Gbits/sec
      
      With the reverted patch applied, the SCTP/IPv4 performance is back
      to normal on latest upstream for IPv4 and IPv6 and has same throughput
      as 3.4.2 test kernel, steady and interval reports are smooth again.
      
      Fixes: ef2820a7
      
       ("net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer")
      Reported-by: default avatarPeter Butler <pbutler@sonusnet.com>
      Reported-by: default avatarDongsheng Song <dongsheng.song@gmail.com>
      Reported-by: default avatarFengguang Wu <fengguang.wu@intel.com>
      Tested-by: default avatarPeter Butler <pbutler@sonusnet.com>
      Signed-off-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Cc: Matija Glavinic Pecotic <matija.glavinic-pecotic.ext@nsn.com>
      Cc: Alexander Sverdlin <alexander.sverdlin@nsn.com>
      Cc: Vlad Yasevich <vyasevich@gmail.com>
      Acked-by: default avatarVlad Yasevich <vyasevich@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      362d5204
  27. 17 Feb, 2014 1 commit
    • Matija Glavinic Pecotic's avatar
      net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer · ef2820a7
      Matija Glavinic Pecotic authored
      
      
      Implementation of (a)rwnd calculation might lead to severe performance issues
      and associations completely stalling. These problems are described and solution
      is proposed which improves lksctp's robustness in congestion state.
      
      1) Sudden drop of a_rwnd and incomplete window recovery afterwards
      
      Data accounted in sctp_assoc_rwnd_decrease takes only payload size (sctp data),
      but size of sk_buff, which is blamed against receiver buffer, is not accounted
      in rwnd. Theoretically, this should not be the problem as actual size of buffer
      is double the amount requested on the socket (SO_RECVBUF). Problem here is
      that this will have bad scaling for data which is less then sizeof sk_buff.
      E.g. in 4G (LTE) networks, link interfacing radio side will have a large portion
      of traffic of this size (less then 100B).
      
      An example of sudden drop and incomplete window recovery is given below. Node B
      exhibits problematic behavior. Node A initiates association and B is configured
      to advertise rwnd of 10000. A sends messages of size 43B (size of typical sctp
      message in 4G (LTE) network). On B data is left in buffer by not reading socket
      in userspace.
      
      Lets examine when we will hit pressure state and declare rwnd to be 0 for
      scenario with above stated parameters (rwnd == 10000, chunk size == 43, each
      chunk is sent in separate sctp packet)
      
      Logic is implemented in sctp_assoc_rwnd_decrease:
      
      socket_buffer (see below) is maximum size which can be held in socket buffer
      (sk_rcvbuf). current_alloced is amount of data currently allocated (rx_count)
      
      A simple expression is given for which it will be examined after how many
      packets for above stated parameters we enter pressure state:
      
      We start by condition which has to be met in order to enter pressure state:
      
      	socket_buffer < currently_alloced;
      
      currently_alloced is represented as size of sctp packets received so far and not
      yet delivered to userspace. x is the number of chunks/packets (since there is no
      bundling, and each chunk is delivered in separate packet, we can observe each
      chunk also as sctp packet, and what is important here, having its own sk_buff):
      
      	socket_buffer < x*each_sctp_packet;
      
      each_sctp_packet is sctp chunk size + sizeof(struct sk_buff). socket_buffer is
      twice the amount of initially requested size of socket buffer, which is in case
      of sctp, twice the a_rwnd requested:
      
      	2*rwnd < x*(payload+sizeof(struc sk_buff));
      
      sizeof(struct sk_buff) is 190 (3.13.0-rc4+). Above is stated that rwnd is 10000
      and each payload size is 43
      
      	20000 < x(43+190);
      
      	x > 20000/233;
      
      	x ~> 84;
      
      After ~84 messages, pressure state is entered and 0 rwnd is advertised while
      received 84*43B ~= 3612B sctp data. This is why external observer notices sudden
      drop from 6474 to 0, as it will be now shown in example:
      
      IP A.34340 > B.12345: sctp (1) [INIT] [init tag: 1875509148] [rwnd: 81920] [OS: 10] [MIS: 65535] [init TSN: 1096057017]
      IP B.12345 > A.34340: sctp (1) [INIT ACK] [init tag: 3198966556] [rwnd: 10000] [OS: 10] [MIS: 10] [init TSN: 902132839]
      IP A.34340 > B.12345: sctp (1) [COOKIE ECHO]
      IP B.12345 > A.34340: sctp (1) [COOKIE ACK]
      IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057017] [SID: 0] [SSEQ 0] [PPID 0x18]
      IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057017] [a_rwnd 9957] [#gap acks 0] [#dup tsns 0]
      IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057018] [SID: 0] [SSEQ 1] [PPID 0x18]
      IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057018] [a_rwnd 9957] [#gap acks 0] [#dup tsns 0]
      IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057019] [SID: 0] [SSEQ 2] [PPID 0x18]
      IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057019] [a_rwnd 9914] [#gap acks 0] [#dup tsns 0]
      <...>
      IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057098] [SID: 0] [SSEQ 81] [PPID 0x18]
      IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057098] [a_rwnd 6517] [#gap acks 0] [#dup tsns 0]
      IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057099] [SID: 0] [SSEQ 82] [PPID 0x18]
      IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057099] [a_rwnd 6474] [#gap acks 0] [#dup tsns 0]
      IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057100] [SID: 0] [SSEQ 83] [PPID 0x18]
      
      --> Sudden drop
      
      IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057100] [a_rwnd 0] [#gap acks 0] [#dup tsns 0]
      
      At this point, rwnd_press stores current rwnd value so it can be later restored
      in sctp_assoc_rwnd_increase. This however doesn't happen as condition to start
      slowly increasing rwnd until rwnd_press is returned to rwnd is never met. This
      condition is not met since rwnd, after it hit 0, must first reach rwnd_press by
      adding amount which is read from userspace. Let us observe values in above
      example. Initial a_rwnd is 10000, pressure was hit when rwnd was ~6500 and the
      amount of actual sctp data currently waiting to be delivered to userspace
      is ~3500. When userspace starts to read, sctp_assoc_rwnd_increase will be blamed
      only for sctp data, which is ~3500. Condition is never met, and when userspace
      reads all data, rwnd stays on 3569.
      
      IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057100] [a_rwnd 1505] [#gap acks 0] [#dup tsns 0]
      IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057100] [a_rwnd 3010] [#gap acks 0] [#dup tsns 0]
      IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057101] [SID: 0] [SSEQ 84] [PPID 0x18]
      IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057101] [a_rwnd 3569] [#gap acks 0] [#dup tsns 0]
      
      --> At this point userspace read everything, rwnd recovered only to 3569
      
      IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057102] [SID: 0] [SSEQ 85] [PPID 0x18]
      IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057102] [a_rwnd 3569] [#gap acks 0] [#dup tsns 0]
      
      Reproduction is straight forward, it is enough for sender to send packets of
      size less then sizeof(struct sk_buff) and receiver keeping them in its buffers.
      
      2) Minute size window for associations sharing the same socket buffer
      
      In case multiple associations share the same socket, and same socket buffer
      (sctp.rcvbuf_policy == 0), different scenarios exist in which congestion on one
      of the associations can permanently drop rwnd of other association(s).
      
      Situation will be typically observed as one association suddenly having rwnd
      dropped to size of last packet received and never recovering beyond that point.
      Different scenarios will lead to it, but all have in common that one of the
      associations (let it be association from 1)) nearly depleted socket buffer, and
      the other association blames socket buffer just for the amount enough to start
      the pressure. This association will enter pressure state, set rwnd_press and
      announce 0 rwnd.
      When data is read by userspace, similar situation as in 1) will occur, rwnd will
      increase just for the size read by userspace but rwnd_press will be high enough
      so that association doesn't have enough credit to reach rwnd_press and restore
      to previous state. This case is special case of 1), being worse as there is, in
      the worst case, only one packet in buffer for which size rwnd will be increased.
      Consequence is association which has very low maximum rwnd ('minute size', in
      our case down to 43B - size of packet which caused pressure) and as such
      unusable.
      
      Scenario happened in the field and labs frequently after congestion state (link
      breaks, different probabilities of packet drop, packet reordering) and with
      scenario 1) preceding. Here is given a deterministic scenario for reproduction:
      
      >From node A establish two associations on the same socket, with rcvbuf_policy
      being set to share one common buffer (sctp.rcvbuf_policy == 0). On association 1
      repeat scenario from 1), that is, bring it down to 0 and restore up. Observe
      scenario 1). Use small payload size (here we use 43). Once rwnd is 'recovered',
      bring it down close to 0, as in just one more packet would close it. This has as
      a consequence that association number 2 is able to receive (at least) one more
      packet which will bring it in pressure state. E.g. if association 2 had rwnd of
      10000, packet received was 43, and we enter at this point into pressure,
      rwnd_press will have 9957. Once payload is delivered to userspace, rwnd will
      increase for 43, but conditions to restore rwnd to original state, just as in
      1), will never be satisfied.
      
      --> Association 1, between A.y and B.12345
      
      IP A.55915 > B.12345: sctp (1) [INIT] [init tag: 836880897] [rwnd: 10000] [OS: 10] [MIS: 65535] [init TSN: 4032536569]
      IP B.12345 > A.55915: sctp (1) [INIT ACK] [init tag: 2873310749] [rwnd: 81920] [OS: 10] [MIS: 10] [init TSN: 3799315613]
      IP A.55915 > B.12345: sctp (1) [COOKIE ECHO]
      IP B.12345 > A.55915: sctp (1) [COOKIE ACK]
      
      --> Association 2, between A.z and B.12346
      
      IP A.55915 > B.12346: sctp (1) [INIT] [init tag: 534798321] [rwnd: 10000] [OS: 10] [MIS: 65535] [init TSN: 2099285173]
      IP B.12346 > A.55915: sctp (1) [INIT ACK] [init tag: 516668823] [rwnd: 81920] [OS: 10] [MIS: 10] [init TSN: 3676403240]
      IP A.55915 > B.12346: sctp (1) [COOKIE ECHO]
      IP B.12346 > A.55915: sctp (1) [COOKIE ACK]
      
      --> Deplete socket buffer by sending messages of size 43B over association 1
      
      IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315613] [SID: 0] [SSEQ 0] [PPID 0x18]
      IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315613] [a_rwnd 9957] [#gap acks 0] [#dup tsns 0]
      
      <...>
      
      IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315696] [a_rwnd 6388] [#gap acks 0] [#dup tsns 0]
      IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315697] [SID: 0] [SSEQ 84] [PPID 0x18]
      IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315697] [a_rwnd 6345] [#gap acks 0] [#dup tsns 0]
      
      --> Sudden drop on 1
      
      IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315698] [SID: 0] [SSEQ 85] [PPID 0x18]
      IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315698] [a_rwnd 0] [#gap acks 0] [#dup tsns 0]
      
      --> Here userspace read, rwnd 'recovered' to 3698, now deplete again using
          association 1 so there is place in buffer for only one more packet
      
      IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315799] [SID: 0] [SSEQ 186] [PPID 0x18]
      IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315799] [a_rwnd 86] [#gap acks 0] [#dup tsns 0]
      IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315800] [SID: 0] [SSEQ 187] [PPID 0x18]
      IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315800] [a_rwnd 43] [#gap acks 0] [#dup tsns 0]
      
      --> Socket buffer is almost depleted, but there is space for one more packet,
          send them over association 2, size 43B
      
      IP B.12346 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3676403240] [SID: 0] [SSEQ 0] [PPID 0x18]
      IP A.55915 > B.12346: sctp (1) [SACK] [cum ack 3676403240] [a_rwnd 0] [#gap acks 0] [#dup tsns 0]
      
      --> Immediate drop
      
      IP A.60995 > B.12346: sctp (1) [SACK] [cum ack 387491510] [a_rwnd 0] [#gap acks 0] [#dup tsns 0]
      
      --> Read everything from the socket, both association recover up to maximum rwnd
          they are capable of reaching, note that association 1 recovered up to 3698,
          and association 2 recovered only to 43
      
      IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315800] [a_rwnd 1548] [#gap acks 0] [#dup tsns 0]
      IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315800] [a_rwnd 3053] [#gap acks 0] [#dup tsns 0]
      IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315801] [SID: 0] [SSEQ 188] [PPID 0x18]
      IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315801] [a_rwnd 3698] [#gap acks 0] [#dup tsns 0]
      IP B.12346 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3676403241] [SID: 0] [SSEQ 1] [PPID 0x18]
      IP A.55915 > B.12346: sctp (1) [SACK] [cum ack 3676403241] [a_rwnd 43] [#gap acks 0] [#dup tsns 0]
      
      A careful reader might wonder why it is necessary to reproduce 1) prior
      reproduction of 2). It is simply easier to observe when to send packet over
      association 2 which will push association into the pressure state.
      
      Proposed solution:
      
      Both problems share the same root cause, and that is improper scaling of socket
      buffer with rwnd. Solution in which sizeof(sk_buff) is taken into concern while
      calculating rwnd is not possible due to fact that there is no linear
      relationship between amount of data blamed in increase/decrease with IP packet
      in which payload arrived. Even in case such solution would be followed,
      complexity of the code would increase. Due to nature of current rwnd handling,
      slow increase (in sctp_assoc_rwnd_increase) of rwnd after pressure state is
      entered is rationale, but it gives false representation to the sender of current
      buffer space. Furthermore, it implements additional congestion control mechanism
      which is defined on implementation, and not on standard basis.
      
      Proposed solution simplifies whole algorithm having on mind definition from rfc:
      
      o  Receiver Window (rwnd): This gives the sender an indication of the space
         available in the receiver's inbound buffer.
      
      Core of the proposed solution is given with these lines:
      
      sctp_assoc_rwnd_update:
      	if ((asoc->base.sk->sk_rcvbuf - rx_count) > 0)
      		asoc->rwnd = (asoc->base.sk->sk_rcvbuf - rx_count) >> 1;
      	else
      		asoc->rwnd = 0;
      
      We advertise to sender (half of) actual space we have. Half is in the braces
      depending whether you would like to observe size of socket buffer as SO_RECVBUF
      or twice the amount, i.e. size is the one visible from userspace, that is,
      from kernelspace.
      In this way sender is given with good approximation of our buffer space,
      regardless of the buffer policy - we always advertise what we have. Proposed
      solution fixes described problems and removes necessity for rwnd restoration
      algorithm. Finally, as proposed solution is simplification, some lines of code,
      along with some bytes in struct sctp_association are saved.
      
      Version 2 of the patch addressed comments from Vlad. Name of the function is set
      to be more descriptive, and two parts of code are changed, in one removing the
      superfluous call to sctp_assoc_rwnd_update since call would not result in update
      of rwnd, and the other being reordering of the code in a way that call to
      sctp_assoc_rwnd_update updates rwnd. Version 3 corrected change introduced in v2
      in a way that existing function is not reordered/copied in line, but it is
      correctly called. Thanks Vlad for suggesting.
      Signed-off-by: default avatarMatija Glavinic Pecotic <matija.glavinic-pecotic.ext@nsn.com>
      Reviewed-by: default avatarAlexander Sverdlin <alexander.sverdlin@nsn.com>
      Acked-by: default avatarVlad Yasevich <vyasevich@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ef2820a7