Skip to content
  • Lawrence Brakmo's avatar
    bpf: BPF support for sock_ops · 40304b2a
    Lawrence Brakmo authored
    
    
    Created a new BPF program type, BPF_PROG_TYPE_SOCK_OPS, and a corresponding
    struct that allows BPF programs of this type to access some of the
    socket's fields (such as IP addresses, ports, etc.). It uses the
    existing bpf cgroups infrastructure so the programs can be attached per
    cgroup with full inheritance support. The program will be called at
    appropriate times to set relevant connections parameters such as buffer
    sizes, SYN and SYN-ACK RTOs, etc., based on connection information such
    as IP addresses, port numbers, etc.
    
    Alghough there are already 3 mechanisms to set parameters (sysctls,
    route metrics and setsockopts), this new mechanism provides some
    distinct advantages. Unlike sysctls, it can set parameters per
    connection. In contrast to route metrics, it can also use port numbers
    and information provided by a user level program. In addition, it could
    set parameters probabilistically for evaluation purposes (i.e. do
    something different on 10% of the flows and compare results with the
    other 90% of the flows). Also, in cases where IPv6 addresses contain
    geographic information, the rules to make changes based on the distance
    (or RTT) between the hosts are much easier than route metric rules and
    can be global. Finally, unlike setsockopt, it oes not require
    application changes and it can be updated easily at any time.
    
    Although the bpf cgroup framework already contains a sock related
    program type (BPF_PROG_TYPE_CGROUP_SOCK), I created the new type
    (BPF_PROG_TYPE_SOCK_OPS) beccause the existing type expects to be called
    only once during the connections's lifetime. In contrast, the new
    program type will be called multiple times from different places in the
    network stack code.  For example, before sending SYN and SYN-ACKs to set
    an appropriate timeout, when the connection is established to set
    congestion control, etc. As a result it has "op" field to specify the
    type of operation requested.
    
    The purpose of this new program type is to simplify setting connection
    parameters, such as buffer sizes, TCP's SYN RTO, etc. For example, it is
    easy to use facebook's internal IPv6 addresses to determine if both hosts
    of a connection are in the same datacenter. Therefore, it is easy to
    write a BPF program to choose a small SYN RTO value when both hosts are
    in the same datacenter.
    
    This patch only contains the framework to support the new BPF program
    type, following patches add the functionality to set various connection
    parameters.
    
    This patch defines a new BPF program type: BPF_PROG_TYPE_SOCKET_OPS
    and a new bpf syscall command to load a new program of this type:
    BPF_PROG_LOAD_SOCKET_OPS.
    
    Two new corresponding structs (one for the kernel one for the user/BPF
    program):
    
    /* kernel version */
    struct bpf_sock_ops_kern {
            struct sock *sk;
            __u32  op;
            union {
                    __u32 reply;
                    __u32 replylong[4];
            };
    };
    
    /* user version
     * Some fields are in network byte order reflecting the sock struct
     * Use the bpf_ntohl helper macro in samples/bpf/bpf_endian.h to
     * convert them to host byte order.
     */
    struct bpf_sock_ops {
            __u32 op;
            union {
                    __u32 reply;
                    __u32 replylong[4];
            };
            __u32 family;
            __u32 remote_ip4;     /* In network byte order */
            __u32 local_ip4;      /* In network byte order */
            __u32 remote_ip6[4];  /* In network byte order */
            __u32 local_ip6[4];   /* In network byte order */
            __u32 remote_port;    /* In network byte order */
            __u32 local_port;     /* In host byte horder */
    };
    
    Currently there are two types of ops. The first type expects the BPF
    program to return a value which is then used by the caller (or a
    negative value to indicate the operation is not supported). The second
    type expects state changes to be done by the BPF program, for example
    through a setsockopt BPF helper function, and they ignore the return
    value.
    
    The reply fields of the bpf_sockt_ops struct are there in case a bpf
    program needs to return a value larger than an integer.
    
    Signed-off-by: default avatarLawrence Brakmo <brakmo@fb.com>
    Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
    Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
    Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    40304b2a