1. 20 Sep, 2016 2 commits
  2. 19 Sep, 2016 36 commits
    • David S. Miller's avatar
      Merge tag 'rxrpc-rewrite-20160917-2' of... · e867e87a
      David S. Miller authored
      Merge tag 'rxrpc-rewrite-20160917-2' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
      
      
      
      David Howells says:
      
      ====================
      rxrpc: Tracepoint addition and improvement
      
      Here is a set of patches that add some more tracepoints and improve a couple
      of existing ones.  New additions include:
      
       (1) Connection refcount tracking.
      
       (2) Client connection state machine tracking.
      
       (3) Tx and Rx packet lifecycle.
      
       (4) ACK reception and transmission.
      
       (5) recvmsg processing.
      
      Updates include:
      
       (1) Print the symbolic packet name in the Rx packet tracepoint.
      
       (2) Additional call refcount trace events.
      
       (3) Improvements to sk_buff tracking with AF_RXRPC.
      
      In addition:
      
       (1) Config option to inject packet loss during both transmission and
           reception.
      
       (2) Removal of some printks.
      
      This series needs to be applied on top of the previously posted fixes.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e867e87a
    • David S. Miller's avatar
      Merge tag 'rxrpc-rewrite-20160917-1' of... · 5b0c6fc8
      David S. Miller authored
      Merge tag 'rxrpc-rewrite-20160917-1' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
      
      
      
      David Howells says:
      
      ====================
      rxrpc: Fixes & miscellany
      
      Here are some more AF_RXRPC fix patches with a couple of miscellaneous
      changes also.  Fixes include:
      
       (1) Make RxRPC IPv6 support conditional on IPv6 being available.
      
       (2) Move the condition check in rxrpc_locate_data() into the caller and
           check the error return.
      
       (3) Fix the detection of the last received packet in recvmsg.
      
       (4) Account calls that need acceptance and clean up any unaccepted ones if
           the socket gets closed.
      
       (5) Fix the cleanup of client connections.
      
       (6) Fix the soft-ACK parsing and the retransmission of packets based on
           those ACKs.
      
       (7) Suppress transmission of an ACK when there's no pending ACK to
           transmit because another thread stole it.
      
      And some miscellany:
      
       (8) Whitespace removal.
      
       (9) Switch-value consistency in rxrpc_send_call_packet().
      
      (10) Fix the basic transmission packet size to allow for spur-of-the-moment
           jumbo DATA packet production.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5b0c6fc8
    • David S. Miller's avatar
      Merge branch 'net-sched-singly-linked-list' · 029ac211
      David S. Miller authored
      
      
      Florian Westphal says:
      
      ====================
      sched: convert queues to single-linked list
      
      During Netfilter Workshop 2016 Eric Dumazet pointed out that qdisc
      schedulers use doubly-linked lists, even though single-linked list
      would be enough.
      
      The double-linked skb lists incur one extra write on enqueue/dequeue
      operations (to change ->prev pointer of next list elem).
      
      This series converts qdiscs to single-linked version, listhead
      maintains pointers to first (for dequeue) and last skb (for enqueue).
      
      Most qdiscs don't queue at all and instead use a leaf qdisc (typically
      pfifo_fast) so only a few schedulers needed changes.
      
      I briefly tested netem and htb and they seemed fine.
      
      UDP_STREAM netperf with 64 byte packets via veth+pfifo_fast shows
      a small (~2%) improvement.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      029ac211
    • Florian Westphal's avatar
      sched: add and use qdisc_skb_head helpers · 48da34b7
      Florian Westphal authored
      
      
      This change replaces sk_buff_head struct in Qdiscs with new qdisc_skb_head.
      
      Its similar to the skb_buff_head api, but does not use skb->prev pointers.
      
      Qdiscs will commonly enqueue at the tail of a list and dequeue at head.
      While skb_buff_head works fine for this, enqueue/dequeue needs to also
      adjust the prev pointer of next element.
      
      The ->prev pointer is not required for qdiscs so we can just leave
      it undefined and avoid one cacheline write access for en/dequeue.
      Suggested-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      48da34b7
    • Florian Westphal's avatar
      sched: replace __skb_dequeue with __qdisc_dequeue_head · ed760cb8
      Florian Westphal authored
      
      
      After previous patch these functions are identical.
      Replace __skb_dequeue in qdiscs with __qdisc_dequeue_head.
      
      Next patch will then make __qdisc_dequeue_head handle
      single-linked list instead of strcut sk_buff_head argument.
      
      Doesn't change generated code.
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ed760cb8
    • Florian Westphal's avatar
      sched: remove qdisc arg from __qdisc_dequeue_head · ec323368
      Florian Westphal authored
      
      
      Moves qdisc stat accouting to qdisc_dequeue_head.
      
      The only direct caller of the __qdisc_dequeue_head version open-codes
      this now.
      
      This allows us to later use __qdisc_dequeue_head as a replacement
      of __skb_dequeue() (which operates on sk_buff_head list).
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ec323368
    • Florian Westphal's avatar
      sched: don't use skb queue helpers · 97d0678f
      Florian Westphal authored
      
      
      A followup change will replace the sk_buff_head in the qdisc
      struct with a slightly different list.
      
      Use of the sk_buff_head helpers will thus cause compiler
      warnings.
      
      Open-code these accesses in an extra change to ease review.
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      97d0678f
    • Florian Westphal's avatar
      pie: use qdisc_dequeue_head wrapper · 1486587b
      Florian Westphal authored
      
      
      Doesn't change generated code.
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1486587b
    • Wei Yongjun's avatar
      cxgb4: Fix return value check in cfg_queues_uld() · 106323b9
      Wei Yongjun authored
      Fix the retrn value check which testing the wrong variable
      in cfg_queues_uld().
      
      Fixes: 94cdb8bb
      
       ("cxgb4: Add support for dynamic allocation of
      resources for ULD")
      Signed-off-by: default avatarWei Yongjun <weiyongjun1@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      106323b9
    • David S. Miller's avatar
      Merge branch 'mediatek-hw-lro' · 4646651e
      David S. Miller authored
      
      
      Nelson Chang says:
      
      ====================
      net: ethernet: mediatek: add HW LRO functions
      
      The series add the large receive offload (LRO) functions by hardware and
      the ethtool functions to configure RX flows of HW LRO.
      
      changes since v3:
      - Respin the patch by the newer driver
      - Move the dts description of hwlro to optional properties
      
      changes since v2:
      - Add ndo_fix_features to prevent NETIF_F_LRO off while RX flow is programmed
      - Rephrase the dts property is a capability if the hardware supports LRO
      
      changes since v1:
      - Add HW LRO support
      - Add ethtool hooks to set LRO RX flows
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4646651e
    • Nelson Chang's avatar
      net: ethernet: mediatek: add the dts property to set if the HW supports LRO · 004e6cc6
      Nelson Chang authored
      
      
      Add the dts property for the capability if the hardware supports LRO.
      Signed-off-by: default avatarNelson Chang <nelson.chang@mediatek.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      004e6cc6
    • Nelson Chang's avatar
      net: ethernet: mediatek: add ethtool functions to configure RX flows of HW LRO · 7aab747e
      Nelson Chang authored
      
      
      The codes add ethtool functions to set RX flows for HW LRO. Because the
      HW LRO hardware can only recognize the destination IP of TCP/IP RX flows,
      the ethtool command to add HW LRO flow is as below:
      ethtool -N [devname] flow-type tcp4 dst-ip [ip_addr] loc [0~1]
      
      Otherwise, cause the hardware can set total four destination IPs, each
      GMAC (GMAC1/GMAC2) can set two IPs separately at most.
      Signed-off-by: default avatarNelson Chang <nelson.chang@mediatek.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7aab747e
    • Nelson Chang's avatar
      net: ethernet: mediatek: add HW LRO functions of PDMA RX rings · ee406810
      Nelson Chang authored
      
      
      The codes add the large receive offload (LRO) functions by hardware as below:
      1) PDMA has total four RX rings that one is the normal ring, and others can
         be configured as LRO rings.
      2) Only TCP/IP RX flows can be offloaded. The hardware can set four IP
         addresses at most, if the destination IP of the RX flow matches one of
         them, it has the chance to be offloaded.
      3) There three RX flows can be offloaded at most, and one flow is mapped to
         one RX ring.
      4) If there are more than three candidate RX flows, the hardware can
         choose three of them by throughput comparison results.
      Signed-off-by: default avatarNelson Chang <nelson.chang@mediatek.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ee406810
    • Hariprasad Shenai's avatar
      chcr/cxgb4i/cxgbit/RDMA/cxgb4: Allocate resources dynamically for all cxgb4 ULD's · 0fbc81b3
      Hariprasad Shenai authored
      
      
      Allocate resources dynamically to cxgb4's Upper layer driver's(ULD) like
      cxgbit, iw_cxgb4 and cxgb4i. Allocate resources when they register with
      cxgb4 driver and free them while unregistering. All the queues and the
      interrupts for them will be allocated during ULD probe only and freed
      during remove.
      Signed-off-by: default avatarHariprasad Shenai <hariprasad@chelsio.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0fbc81b3
    • Christophe Jaillet's avatar
      sctp: Remove some redundant code · e8bc8f9a
      Christophe Jaillet authored
      In commit 311b2177
      
       ("sctp: simplify sk_receive_queue locking"), a call
      to 'skb_queue_splice_tail_init()' has been made explicit. Previously it was
      hidden in 'sctp_skb_list_tail()'
      
      Now, the code around it looks redundant. The '_init()' part of
      'skb_queue_splice_tail_init()' should already do the same.
      Signed-off-by: default avatarChristophe JAILLET <christophe.jaillet@wanadoo.fr>
      Acked-by: default avatarMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Acked-by: default avatarNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e8bc8f9a
    • Jesper Dangaard Brouer's avatar
      mlx4: fix XDP_TX is acting like XDP_PASS on TX ring full · 95357907
      Jesper Dangaard Brouer authored
      The XDP_TX action can fail transmitting the frame in case the TX ring
      is full or port is down.  In case of TX failure it should drop the
      frame, and not as now call 'break' which is the same as XDP_PASS.
      
      Fixes: 9ecc2d86
      
       ("net/mlx4_en: add xdp forwarding and data write support")
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Reviewed-by: default avatarBrenden Blanco <bblanco@plumgrid.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      95357907
    • David S. Miller's avatar
      Merge branch 'ipvlan-l3' · 8ddda653
      David S. Miller authored
      
      
      Mahesh Bandewar says:
      
      ====================
      IPvlan introduce l3s mode
      
      Same old problem with new approach especially from suggestions from
      earlier patch-series.
      
      First thing is that this is introduced as a new mode rather than
      modifying the old (L3) mode. So the behavior of the existing modes is
      preserved as it is and the new L3s mode obeys iptables so that intended
      conn-tracking can work.
      
      To do this, the code uses newly added l3mdev_rcv() handler and an
      Iptables hook. l3mdev_rcv() to perform an inbound route lookup with the
      correct (IPvlan slave) interface and then IPtable-hook at LOCAL_INPUT
      to change the input device from master to the slave to complete the
      formality.
      
      Supporting stack changes are trivial changes to export symbol to get
      IPv4 equivalent code exported for IPv6 and to allow netfilter hook
      registration code to allow caller to hold RTNL. Please look into
      individual patches for details.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8ddda653
    • Mahesh Bandewar's avatar
      ipvlan: Introduce l3s mode · 4fbae7d8
      Mahesh Bandewar authored
      
      
      In a typical IPvlan L3 setup where master is in default-ns and
      each slave is into different (slave) ns. In this setup egress
      packet processing for traffic originating from slave-ns will
      hit all NF_HOOKs in slave-ns as well as default-ns. However same
      is not true for ingress processing. All these NF_HOOKs are
      hit only in the slave-ns skipping them in the default-ns.
      IPvlan in L3 mode is restrictive and if admins want to deploy
      iptables rules in default-ns, this asymmetric data path makes it
      impossible to do so.
      
      This patch makes use of the l3_rcv() (added as part of l3mdev
      enhancements) to perform input route lookup on RX packets without
      changing the skb->dev and then uses nf_hook at NF_INET_LOCAL_IN
      to change the skb->dev just before handing over skb to L4.
      Signed-off-by: default avatarMahesh Bandewar <maheshb@google.com>
      CC: David Ahern <dsa@cumulusnetworks.com>
      Reviewed-by: default avatarDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4fbae7d8
    • Mahesh Bandewar's avatar
      net: Add _nf_(un)register_hooks symbols · e8bffe0c
      Mahesh Bandewar authored
      
      
      Add _nf_register_hooks() and _nf_unregister_hooks() calls which allow
      caller to hold RTNL mutex.
      Signed-off-by: default avatarMahesh Bandewar <maheshb@google.com>
      CC: Pablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e8bffe0c
    • Mahesh Bandewar's avatar
      ipv6: Export p6_route_input_lookup symbol · d409b847
      Mahesh Bandewar authored
      
      
      Make ip6_route_input_lookup available outside of ipv6 the module
      similar to ip_route_input_noref in the IPv4 world.
      Signed-off-by: default avatarMahesh Bandewar <maheshb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d409b847
    • David S. Miller's avatar
      Merge branch 'net-offloaded-stats' · a5ea31f5
      David S. Miller authored
      
      
      Jiri Pirko says:
      
      ====================
      net: return offloaded stats as default and expose original sw stats
      
      The problem we try to handle is about offloaded forwarded packets
      which are not seen by kernel. Let me try to draw it:
      
          port1                       port2 (HW stats are counted here)
            \                          /
             \                        /
              \                      /
               --(A)---- ASIC --(B)--
                          |
                         (C)
                          |
                         CPU (SW stats are counted here)
      
      Now we have couple of flows for TX and RX (direction does not matter here):
      
      1) port1->A->ASIC->C->CPU
      
         For this flow, HW and SW stats are equal.
      
      2) port1->A->ASIC->C->CPU->C->ASIC->B->port2
      
         For this flow, HW and SW stats are equal.
      
      3) port1->A->ASIC->B->port2
      
         For this flow, SW stats are 0.
      
      The purpose of this patchset is to provide facility for user to
      find out the difference between flows 1+2 and 3. In other words, user
      will be able to see the statistics for the slow-path (through kernel).
      
      Also note that HW stats are what someone calls "accumulated" stats.
      Every packet counted by SW is also counted by HW. Not the other way around.
      
      As a default the accumulated stats (HW) will be exposed to user
      so the userspace apps can react properly.
      
      This patchset add the SW stats (flows 1+2) under offload related stats, so
      in the future we can expose other offload related stat in a similar way.
      
      ---
      v9->v10:
      - patch 2/3
       - removed unnecessary ()s as pointed out by Nik
      v8->v9:
      - patch 2/3
       - add using of idxattr and prividx
      v7->v8:
      - patch 2/3
       - move helping const from uapi to rtnetlink
       - cancel driver xstat nesting if it is empty
      v6->v7:
      - patch 1/3:
       - ndo interface changed to get the wanted stats type as an input.
       - change commit message.
      - patch 2/3:
       - create a nesting for offloaded stat and put SW stats under it.
       - change the ndo call to indicate which offload stats we wants.
       - change commit message.
      - patch 3/3:
       - change ndo implementation to match the changes in the previous patches.
       - change commit message.
      v5->v6:
      - patch 2/4 was dropped as requested by Roopa
      - patch 1/3:
       - comment changed to indicate that default stats are combined stats
       - commit massage changed
      - patch 2/3: (previously 3/4)
       - SW stats return nothing if there is no SW stats ndo
      v4->v5:
      - updated cover letter
      - patch3/4:
        - using memcpy directly to copy stats as requested by DaveM
      v3->v4:
      - patch1/4:
        - fixed "return ()" pointed out by EricD
      - patch2/4:
        - fixed if_nlmsg_size as pointed out by EricD
      v2->v3:
      - patch1/4:
        - added dev_have_sw_stats helper
      - patch2/4:
        - avoided memcpy as requested by DaveM
      - patch3/4:
        - use new dev_have_sw_stats helper
      v1->v2:
      - patch3/4:
        - fixed NULL initialization
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a5ea31f5
    • Nogah Frankel's avatar
      mlxsw: spectrum: Implement offload stats ndo and expose HW stats by default · fc1bbb0f
      Nogah Frankel authored
      
      
      Change the default statistics ndo to return HW statistics
      (like the one returned by ethtool_ops).
      The HW stats are collected to a cache by delayed work every 1 sec.
      Implement the offload stat ndo.
      Add a function to get SW statistics, to be called from this function.
      Signed-off-by: default avatarNogah Frankel <nogahf@mellanox.com>
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fc1bbb0f
    • Nogah Frankel's avatar
      net: core: Add offload stats to if_stats_msg · 69ae6ad2
      Nogah Frankel authored
      
      
      Add a nested attribute of offload stats to if_stats_msg
      named IFLA_STATS_LINK_OFFLOAD_XSTATS.
      Under it, add SW stats, meaning stats only per packets that went via
      slowpath to the cpu, named IFLA_OFFLOAD_XSTATS_CPU_HIT.
      Signed-off-by: default avatarNogah Frankel <nogahf@mellanox.com>
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Acked-by: default avatarNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      69ae6ad2
    • Nogah Frankel's avatar
      netdevice: Add offload statistics ndo · 2c9d85d4
      Nogah Frankel authored
      
      
      Add a new ndo to return statistics for offloaded operation.
      Since there can be many different offloaded operation with many
      stats types, the ndo gets an attribute id by which it knows which
      stats are wanted. The ndo also gets a void pointer to be cast according
      to the attribute id.
      Signed-off-by: default avatarNogah Frankel <nogahf@mellanox.com>
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Reviewed-by: default avatarNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2c9d85d4
    • David S. Miller's avatar
      Merge tag 'mac80211-next-for-davem-2016-09-16' of... · c13ed534
      David S. Miller authored
      Merge tag 'mac80211-next-for-davem-2016-09-16' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next
      
      
      
      Johannes Berg says:
      
      ====================
      This time we have various things - all across the board:
       * MU-MIMO sniffer support in mac80211
       * a create_singlethread_workqueue() cleanup
       * interface dump filtering that was documented but not implemented
       * support for the new radiotap timestamp field
       * send delBA in two unexpected conditions (as required by the spec)
       * connect keys cleanups - allow only WEP with index 0-3
       * per-station aggregation limit to work around broken APs
       * debugfs improvement for the integrated codel algorithm
      and various other small improvements and cleanups.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c13ed534
    • Colin Ian King's avatar
      net: r6040: add in missing white space in error message text · 22da7349
      Colin Ian King authored
      
      
      A couple of dev_err messages span two lines and the literal
      string is missing a white space between words. Add the white
      space and join the two lines into one.
      Signed-off-by: default avatarColin Ian King <colin.king@canonical.com>
      Acked-by: default avatarFLorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      22da7349
    • Eric Dumazet's avatar
      pkt_sched: fq: use proper locking in fq_dump_stats() · 695b4ec0
      Eric Dumazet authored
      When fq is used on 32bit kernels, we need to lock the qdisc before
      copying 64bit fields.
      
      Otherwise "tc -s qdisc ..." might report bogus values.
      
      Fixes: afe4fd06
      
       ("pkt_sched: fq: Fair Queue packet scheduler")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      695b4ec0
    • Thadeu Lima de Souza Cascardo's avatar
      openvswitch: use percpu flow stats · db74a333
      Thadeu Lima de Souza Cascardo authored
      
      
      Instead of using flow stats per NUMA node, use it per CPU. When using
      megaflows, the stats lock can be a bottleneck in scalability.
      
      On a E5-2690 12-core system, usual throughput went from ~4Mpps to
      ~15Mpps when forwarding between two 40GbE ports with a single flow
      configured on the datapath.
      
      This has been tested on a system with possible CPUs 0-7,16-23. After
      module removal, there were no corruption on the slab cache.
      Signed-off-by: default avatarThadeu Lima de Souza Cascardo <cascardo@redhat.com>
      Cc: pravin shelar <pshelar@ovn.org>
      Acked-by: default avatarPravin B Shelar <pshelar@ovn.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      db74a333
    • Thadeu Lima de Souza Cascardo's avatar
      openvswitch: fix flow stats accounting when node 0 is not possible · 40773966
      Thadeu Lima de Souza Cascardo authored
      
      
      On a system with only node 1 as possible, all statistics is going to be
      accounted on node 0 as it will have a single writer.
      
      However, when getting and clearing the statistics, node 0 is not going
      to be considered, as it's not a possible node.
      
      Tested that statistics are not zero on a system with only node 1
      possible. Also compile-tested with CONFIG_NUMA off.
      Signed-off-by: default avatarThadeu Lima de Souza Cascardo <cascardo@redhat.com>
      Acked-by: default avatarPravin B Shelar <pshelar@ovn.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      40773966
    • David S. Miller's avatar
      Merge branch 'sctp-transmit-errs' · 829ff348
      David S. Miller authored
      Xin Long says:
      
      ====================
      sctp: fix the transmit err process
      
      This patchset is to improve the transmit err process and also fix some
      issues.
      
      After this patchset, once the chunks are enqueued successfully, even
      if the chunks fail to send out, no matter because of nodst or nomem,
      no err retruns back to users any more. Instead, they are taken care
      of by retransmit.
      
      v1->v2:
        - add more details to the changelog in patch 1/6
        - add Fixes: tag in patch 2/6, 3/6
        - also revert 69b5777f
      
       in patch 3/6
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      829ff348
    • Xin Long's avatar
      sctp: not return ENOMEM err back in sctp_packet_transmit · 41001faf
      Xin Long authored
      
      
      As David and Marcelo's suggestion, ENOMEM err shouldn't return back to
      user in transmit path. Instead, sctp's retransmit would take care of
      the chunks that fail to send because of ENOMEM.
      
      This patch is only to do some release job when alloc_skb fails, not to
      return ENOMEM back any more.
      
      Besides, it also cleans up sctp_packet_transmit's err path, and fixes
      some issues in err path:
      
       - It didn't free the head skb in nomem: path.
       - No need to check nskb in no_route: path.
       - It should goto err: path if alloc_skb fails for head.
       - Not all the NOMEMs should free nskb.
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      41001faf
    • Xin Long's avatar
      sctp: make sctp_outq_flush/tail/uncork return void · 83dbc3d4
      Xin Long authored
      
      
      sctp_outq_flush return value is meaningless now, this patch is
      to make sctp_outq_flush return void, as well as sctp_outq_fail
      and sctp_outq_uncork.
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      83dbc3d4
    • Xin Long's avatar
      sctp: save transmit error to sk_err in sctp_outq_flush · 64519440
      Xin Long authored
      
      
      Every time when sctp calls sctp_outq_flush, it sends out the chunks of
      control queue, retransmit queue and data queue. Even if some trunks are
      failed to transmit, it still has to flush all the transports, as it's
      the only chance to clean that transmit_list.
      
      So the latest transmit error here should be returned back. This transmit
      error is an internal error of sctp stack.
      
      I checked all the places where it uses the transmit error (the return
      value of sctp_outq_flush), most of them are actually just save it to
      sk_err.
      
      Except for sctp_assoc/endpoint_bh_rcv, they will drop the chunk if
      it's failed to send a REPLY, which is actually incorrect, as we can't
      be sure the error that sctp_outq_flush returns is from sending that
      REPLY.
      
      So it's meaningless for sctp_outq_flush to return error back.
      
      This patch is to save transmit error to sk_err in sctp_outq_flush, the
      new error can update the old value. Eventually, sctp_wait_for_* would
      check for it.
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      64519440
    • Xin Long's avatar
      sctp: free msg->chunks when sctp_primitive_SEND return err · b61c654f
      Xin Long authored
      Last patch "sctp: do not return the transmit err back to sctp_sendmsg"
      made sctp_primitive_SEND return err only when asoc state is unavailable.
      In this case, chunks are not enqueued, they have no chance to be freed if
      we don't take care of them later.
      
      This Patch is actually to revert commit 1cd4d5c4 ("sctp: remove the
      unused sctp_datamsg_free()"), commit 69b5777f ("sctp: hold the chunks
      only after the chunk is enqueued in outq") and commit 8b570dc9 ("sctp:
      only drop the reference on the datamsg after sending a msg"), to use
      sctp_datamsg_free to free the chunks of current msg.
      
      Fixes: 8b570dc9
      
       ("sctp: only drop the reference on the datamsg after sending a msg")
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b61c654f
    • Xin Long's avatar
      sctp: do not return the transmit err back to sctp_sendmsg · 66388f2c
      Xin Long authored
      Once a chunk is enqueued successfully, sctp queues can take care of it.
      Even if it is failed to transmit (like because of nomem), it should be
      put into retransmit queue.
      
      If sctp report this error to users, it confuses them, they may resend
      that msg, but actually in kernel sctp stack is in charge of retransmit
      it already.
      
      Besides, this error probably is not from the failure of transmitting
      current msg, but transmitting or retransmitting another msg's chunks,
      as sctp_outq_flush just tries to send out all transports' chunks.
      
      This patch is to make sctp_cmd_send_msg return avoid, and not return the
      transmit err back to sctp_sendmsg
      
      Fixes: 8b570dc9
      
       ("sctp: only drop the reference on the datamsg after sending a msg")
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      66388f2c
    • Xin Long's avatar
      sctp: remove the unnecessary state check in sctp_outq_tail · 2c89791e
      Xin Long authored
      
      
      Data Chunks are only sent by sctp_primitive_SEND, in which sctp checks
      the asoc's state through statetable before calling sctp_outq_tail. So
      there's no need to check the asoc's state again in sctp_outq_tail.
      
      Besides, sctp_do_sm is protected by lock_sock, even if sending msg is
      interrupted by timer events, the event's processes still need to acquire
      lock_sock first. It means no others CMDs can be enqueue into side effect
      list before CMD_SEND_MSG to change asoc->state, so it's safe to remove it.
      
      This patch is to remove redundant asoc->state check from sctp_outq_tail.
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2c89791e
  3. 17 Sep, 2016 2 commits
    • David S. Miller's avatar
      Merge branch 'ip_tunnel-collect_md' · fd9527f4
      David S. Miller authored
      
      
      Alexei Starovoitov says:
      
      ====================
      ip_tunnel: add collect_md mode to IPv4/IPv6 tunnels
      
      Similar to geneve, vxlan, gre tunnels implement 'collect metadata' mode
      in ipip, ipip6, ip6ip6 tunnels.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fd9527f4
    • Alexei Starovoitov's avatar
      samples/bpf: add comprehensive ipip, ipip6, ip6ip6 test · 173ca26e
      Alexei Starovoitov authored
      
      
      the test creates 3 namespaces with veth connected via bridge.
      First two namespaces simulate two different hosts with the same
      IPv4 and IPv6 addresses configured on the tunnel interface and they
      communicate with outside world via standard tunnels.
      Third namespace creates collect_md tunnel that is driven by BPF
      program which selects different remote host (either first or
      second namespace) based on tcp dest port number while tcp dst
      ip is the same.
      This scenario is rough approximation of load balancer use case.
      The tests check both traditional tunnel configuration and collect_md mode.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      173ca26e