1. 15 Nov, 2016 16 commits
    • pravin shelar's avatar
      9efdb92d
    • pravin shelar's avatar
      vxlan: simplify vxlan xmit · 0770b53b
      pravin shelar authored
      
      
      Existing vxlan xmit function handles two distinct cases.
      1. vxlan net device
      2. vxlan lwt device.
      By seperating initialization these two cases the egress path
      looks better.
      Signed-off-by: default avatarPravin B Shelar <pshelar@ovn.org>
      Acked-by: default avatarJiri Benc <jbenc@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0770b53b
    • pravin shelar's avatar
      vxlan: simplify RTF_LOCAL handling. · fee1fad7
      pravin shelar authored
      
      
      Avoid code duplicate code for handling RTF_LOCAL routes.
      Signed-off-by: default avatarPravin B Shelar <pshelar@ovn.org>
      Acked-by: default avatarJiri Benc <jbenc@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fee1fad7
    • pravin shelar's avatar
      vxlan: improve vxlan route lookup checks. · 655c3de1
      pravin shelar authored
      
      
      Move route sanity check to respective vxlan[4/6]_get_route functions.
      This allows us to perform all sanity checks before caching the dst so
      that we can avoid these checks on subsequent packets.
      This give move accurate metadata information for packet from
      fill_metadata_dst().
      Signed-off-by: default avatarPravin B Shelar <pshelar@ovn.org>
      Acked-by: default avatarJiri Benc <jbenc@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      655c3de1
    • pravin shelar's avatar
      vxlan: simplify exception handling · c46b7897
      pravin shelar authored
      
      
      vxlan egress path error handling has became complicated, it
      need to handle IPv4 and IPv6 tunnel cases.
      Earlier patch removes vlan handling from vxlan_build_skb(), so
      vxlan_build_skb does not need to free skb and we can simplify
      the xmit path by having single error handling for both type of
      tunnels.
      Signed-off-by: default avatarPravin B Shelar <pshelar@ovn.org>
      Acked-by: default avatarJiri Benc <jbenc@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c46b7897
    • pravin shelar's avatar
      vxlan: avoid checking socket multiple times. · 03dc52a8
      pravin shelar authored
      
      
      Check the vxlan socket in vxlan6_getroute().
      Signed-off-by: default avatarPravin B Shelar <pshelar@ovn.org>
      Acked-by: default avatarJiri Benc <jbenc@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      03dc52a8
    • pravin shelar's avatar
      vxlan: avoid vlan processing in vxlan device. · 4a4f86cc
      pravin shelar authored
      
      
      VxLan device does not have special handling for vlan taging on egress.
      Therefore it does not make sense to expose vlan offloading feature.
      This patch does not change vxlan functinality.
      Signed-off-by: default avatarPravin B Shelar <pshelar@ovn.org>
      Acked-by: default avatarJiri Benc <jbenc@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4a4f86cc
    • Paolo Abeni's avatar
      udplite: fix NULL pointer dereference · c915fe13
      Paolo Abeni authored
      The commit 850cbadd ("udp: use it's own memory accounting schema")
      assumes that the socket proto has memory accounting enabled,
      but this is not the case for UDPLITE.
      Fix it enabling memory accounting for UDPLITE and performing
      fwd allocated memory reclaiming on socket shutdown.
      UDP and UDPLITE share now the same memory accounting limits.
      Also drop the backlog receive operation, since is no more needed.
      
      Fixes: 850cbadd
      
       ("udp: use it's own memory accounting schema")
      Reported-by: default avatarAndrei Vagin <avagin@gmail.com>
      Suggested-by: default avatarEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c915fe13
    • David S. Miller's avatar
      Merge branch 'bpf-lru' · e6ca4f16
      David S. Miller authored
      
      
      Martin KaFai Lau says:
      
      ====================
      bpf: LRU map
      
      This patch set adds LRU map implementation to the existing BPF map
      family.
      
      The first few patches introduce the basic BPF LRU list
      implementation.
      
      The later patches introduce the LRU versions of the
      existing BPF_MAP_TYPE_LRU_[PERCPU_]HASH maps by leveraging
      the BPF LRU list.
      
      v2:
      - Added a percpu LRU list option which can be specified as
        a map attribute.
      
        [Note: percpu LRU list has nothing to do with the map's value]
      
      - Removed the cpu variable from the struct bpf_lru_locallist
        since it is not needed.
      
      - Changed the __bpf_lru_node_move_out to __bpf_lru_node_move_to_free in
        patch 1 to prepare the percpu LRU list in patch 2.
      
      - Moved the test_lru_map under selftests
      
      - Refactored a few things in the test codes
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e6ca4f16
    • Martin KaFai Lau's avatar
      bpf: Add tests for the LRU bpf_htab · 5db58faf
      Martin KaFai Lau authored
      This patch has some unit tests and a test_lru_dist.
      
      The test_lru_dist reads in the numeric keys from a file.
      The files used here are generated by a modified fio-genzipf tool
      originated from the fio test suit.  The sample data file can be
      found here: https://github.com/iamkafai/bpf-lru
      
      
      
      The zipf.* data files have 100k numeric keys and the key is also
      ranged from 1 to 100k.
      
      The test_lru_dist outputs the number of unique keys (nr_unique).
      F.e. The following means, 61239 of them is unique out of 100k keys.
      nr_misses means it cannot be found in the LRU map, so nr_misses
      must be >= nr_unique. test_lru_dist also simulates a perfect LRU
      map as a comparison:
      
      [root@arch-fb-vm1 ~]# ~/devshare/fb-kernel/linux/samples/bpf/test_lru_dist \
      /root/zipf.100k.a1_01.out 4000 1
      ...
      test_parallel_lru_dist (map_type:9 map_flags:0x0):
          task:0 BPF LRU: nr_unique:23093(/100000) nr_misses:31603(/100000)
          task:0 Perfect LRU: nr_unique:23093(/100000 nr_misses:34328(/100000)
      ....
      test_parallel_lru_dist (map_type:9 map_flags:0x2):
          task:0 BPF LRU: nr_unique:23093(/100000) nr_misses:31710(/100000)
          task:0 Perfect LRU: nr_unique:23093(/100000 nr_misses:34328(/100000)
      
      [root@arch-fb-vm1 ~]# ~/devshare/fb-kernel/linux/samples/bpf/test_lru_dist \
      /root/zipf.100k.a0_01.out 40000 1
      ...
      test_parallel_lru_dist (map_type:9 map_flags:0x0):
          task:0 BPF LRU: nr_unique:61239(/100000) nr_misses:67054(/100000)
          task:0 Perfect LRU: nr_unique:61239(/100000 nr_misses:66993(/100000)
      ...
      test_parallel_lru_dist (map_type:9 map_flags:0x2):
          task:0 BPF LRU: nr_unique:61239(/100000) nr_misses:67068(/100000)
          task:0 Perfect LRU: nr_unique:61239(/100000 nr_misses:66993(/100000)
      
      LRU map has also been added to map_perf_test:
      /* Global LRU */
      [root@kerneltest003.31.prn1 ~]# for i in 1 4 8; do echo -n "$i cpus: "; \
      ./map_perf_test 16 $i | awk '{r += $3}END{print r " updates"}'; done
       1 cpus: 2934082 updates
       4 cpus: 7391434 updates
       8 cpus: 6500576 updates
      
      /* Percpu LRU */
      [root@kerneltest003.31.prn1 ~]# for i in 1 4 8; do echo -n "$i cpus: "; \
      ./map_perf_test 32 $i | awk '{r += $3}END{print r " updates"}'; done
        1 cpus: 2896553 updates
        4 cpus: 9766395 updates
        8 cpus: 17460553 updates
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5db58faf
    • Martin KaFai Lau's avatar
      bpf: Add BPF_MAP_TYPE_LRU_PERCPU_HASH · 8f844938
      Martin KaFai Lau authored
      
      
      Provide a LRU version of the existing BPF_MAP_TYPE_PERCPU_HASH
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8f844938
    • Martin KaFai Lau's avatar
      bpf: Add BPF_MAP_TYPE_LRU_HASH · 29ba732a
      Martin KaFai Lau authored
      
      
      Provide a LRU version of the existing BPF_MAP_TYPE_HASH.
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      29ba732a
    • Martin KaFai Lau's avatar
      bpf: Refactor codes handling percpu map · fd91de7b
      Martin KaFai Lau authored
      
      
      Refactor the codes that populate the value
      of a htab_elem in a BPF_MAP_TYPE_PERCPU_HASH
      typed bpf_map.
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fd91de7b
    • Martin KaFai Lau's avatar
      bpf: Add percpu LRU list · 961578b6
      Martin KaFai Lau authored
      
      
      Instead of having a common LRU list, this patch allows a
      percpu LRU list which can be selected by specifying a map
      attribute.  The map attribute will be added in the later
      patch.
      
      While the common use case for LRU is #reads >> #updates,
      percpu LRU list allows bpf prog to absorb unusual #updates
      under pathological case (e.g. external traffic facing machine which
      could be under attack).
      
      Each percpu LRU is isolated from each other.  The LRU nodes (including
      free nodes) cannot be moved across different LRU Lists.
      
      Here are the update performance comparison between
      common LRU list and percpu LRU list (the test code is
      at the last patch):
      
      [root@kerneltest003.31.prn1 ~]# for i in 1 4 8; do echo -n "$i cpus: "; \
      ./map_perf_test 16 $i | awk '{r += $3}END{print r " updates"}'; done
       1 cpus: 2934082 updates
       4 cpus: 7391434 updates
       8 cpus: 6500576 updates
      
      [root@kerneltest003.31.prn1 ~]# for i in 1 4 8; do echo -n "$i cpus: "; \
      ./map_perf_test 32 $i | awk '{r += $3}END{printr " updates"}'; done
        1 cpus: 2896553 updates
        4 cpus: 9766395 updates
        8 cpus: 17460553 updates
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      961578b6
    • Martin KaFai Lau's avatar
      bpf: LRU List · 3a08c2fd
      Martin KaFai Lau authored
      
      
      Introduce bpf_lru_list which will provide LRU capability to
      the bpf_htab in the later patch.
      
      * General Thoughts:
      1. Target use case.  Read is more often than update.
         (i.e. bpf_lookup_elem() is more often than bpf_update_elem()).
         If bpf_prog does a bpf_lookup_elem() first and then an in-place
         update, it still counts as a read operation to the LRU list concern.
      2. It may be useful to think of it as a LRU cache
      3. Optimize the read case
         3.1 No lock in read case
         3.2 The LRU maintenance is only done during bpf_update_elem()
      4. If there is a percpu LRU list, it will lose the system-wise LRU
         property.  A completely isolated percpu LRU list has the best
         performance but the memory utilization is not ideal considering
         the work load may be imbalance.
      5. Hence, this patch starts the LRU implementation with a global LRU
         list with batched operations before accessing the global LRU list.
         As a LRU cache, #read >> #update/#insert operations, it will work well.
      6. There is a local list (for each cpu) which is named
         'struct bpf_lru_locallist'.  This local list is not used to sort
         the LRU property.  Instead, the local list is to batch enough
         operations before acquiring the lock of the global LRU list.  More
         details on this later.
      7. In the later patch, it allows a percpu LRU list by specifying a
         map-attribute for scalability reason and for use cases that need to
         prepare for the worst (and pathological) case like DoS attack.
         The percpu LRU list is completely isolated from each other and the
         LRU nodes (including free nodes) cannot be moved across the list.  The
         following description is for the global LRU list but mostly applicable
         to the percpu LRU list also.
      
      * Global LRU List:
      1. It has three sub-lists: active-list, inactive-list and free-list.
      2. The two list idea, active and inactive, is borrowed from the
         page cache.
      3. All nodes are pre-allocated and all sit at the free-list (of the
         global LRU list) at the beginning.  The pre-allocation reasoning
         is similar to the existing BPF_MAP_TYPE_HASH.  However,
         opting-out prealloc (BPF_F_NO_PREALLOC) is not supported in
         the LRU map.
      
      * Active/Inactive List (of the global LRU list):
      1. The active list, as its name says it, maintains the active set of
         the nodes.  We can think of it as the working set or more frequently
         accessed nodes.  The access frequency is approximated by a ref-bit.
         The ref-bit is set during the bpf_lookup_elem().
      2. The inactive list, as its name also says it, maintains a less
         active set of nodes.  They are the candidates to be removed
         from the bpf_htab when we are running out of free nodes.
      3. The ordering of these two lists is acting as a rough clock.
         The tail of the inactive list is the older nodes and
         should be released first if the bpf_htab needs free element.
      
      * Rotating the Active/Inactive List (of the global LRU list):
      1. It is the basic operation to maintain the LRU property of
         the global list.
      2. The active list is only rotated when the inactive list is running
         low.  This idea is similar to the current page cache.
         Inactive running low is currently defined as
         "# of inactive < # of active".
      3. The active list rotation always starts from the tail.  It moves
         node without ref-bit set to the head of the inactive list.
         It moves node with ref-bit set back to the head of the active
         list and then clears its ref-bit.
      4. The inactive rotation is pretty simply.
         It walks the inactive list and moves the nodes back to the head of
         active list if its ref-bit is set. The ref-bit is cleared after moving
         to the active list.
         If the node does not have ref-bit set, it just leave it as it is
         because it is already in the inactive list.
      
      * Shrinking the Inactive List (of the global LRU list):
      1. Shrinking is the operation to get free nodes when the bpf_htab is
         full.
      2. It usually only shrinks the inactive list to get free nodes.
      3. During shrinking, it will walk the inactive list from the tail,
         delete the nodes without ref-bit set from bpf_htab.
      4. If no free node found after step (3), it will forcefully get
         one node from the tail of inactive or active list.  Forcefully is
         in the sense that it ignores the ref-bit.
      
      * Local List:
      1. Each CPU has a 'struct bpf_lru_locallist'.  The purpose is to
         batch enough operations before acquiring the lock of the
         global LRU.
      2. A local list has two sub-lists, free-list and pending-list.
      3. During bpf_update_elem(), it will try to get from the free-list
         of (the current CPU local list).
      4. If the local free-list is empty, it will acquire from the
         global LRU list.  The global LRU list can either satisfy it
         by its global free-list or by shrinking the global inactive
         list.  Since we have acquired the global LRU list lock,
         it will try to get at most LOCAL_FREE_TARGET elements
         to the local free list.
      5. When a new element is added to the bpf_htab, it will
         first sit at the pending-list (of the local list) first.
         The pending-list will be flushed to the global LRU list
         when it needs to acquire free nodes from the global list
         next time.
      
      * Lock Consideration:
      The LRU list has a lock (lru_lock).  Each bucket of htab has a
      lock (buck_lock).  If both locks need to be acquired together,
      the lock order is always lru_lock -> buck_lock and this only
      happens in the bpf_lru_list.c logic.
      
      In hashtab.c, both locks are not acquired together (i.e. one
      lock is always released first before acquiring another lock).
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3a08c2fd
    • David S. Miller's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · bb598c1b
      David S. Miller authored
      
      
      Several cases of bug fixes in 'net' overlapping other changes in
      'net-next-.
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bb598c1b
  2. 14 Nov, 2016 24 commits