Skip to content
  • Tejun Heo's avatar
    sock, cgroup: add sock->sk_cgroup · bd1060a1
    Tejun Heo authored
    
    
    In cgroup v1, dealing with cgroup membership was difficult because the
    number of membership associations was unbound.  As a result, cgroup v1
    grew several controllers whose primary purpose is either tagging
    membership or pull in configuration knobs from other subsystems so
    that cgroup membership test can be avoided.
    
    net_cls and net_prio controllers are examples of the latter.  They
    allow configuring network-specific attributes from cgroup side so that
    network subsystem can avoid testing cgroup membership; unfortunately,
    these are not only cumbersome but also problematic.
    
    Both net_cls and net_prio aren't properly hierarchical.  Both inherit
    configuration from the parent on creation but there's no interaction
    afterwards.  An ancestor doesn't restrict the behavior in its subtree
    in anyway and configuration changes aren't propagated downwards.
    Especially when combined with cgroup delegation, this is problematic
    because delegatees can mess up whatever network configuration
    implemented at the system level.  net_prio would allow the delegatees
    to set whatever priority value regardless of CAP_NET_ADMIN and net_cls
    the same for classid.
    
    While it is possible to solve these issues from controller side by
    implementing hierarchical allowable ranges in both controllers, it
    would involve quite a bit of complexity in the controllers and further
    obfuscate network configuration as it becomes even more difficult to
    tell what's actually being configured looking from the network side.
    While not much can be done for v1 at this point, as membership
    handling is sane on cgroup v2, it'd be better to make cgroup matching
    behave like other network matches and classifiers than introducing
    further complications.
    
    In preparation, this patch updates sock->sk_cgrp_data handling so that
    it points to the v2 cgroup that sock was created in until either
    net_prio or net_cls is used.  Once either of the two is used,
    sock->sk_cgrp_data reverts to its previous role of carrying prioidx
    and classid.  This is to avoid adding yet another cgroup related field
    to struct sock.
    
    As the mode switching can happen at most once per boot, the switching
    mechanism is aimed at lowering hot path overhead.  It may leak a
    finite, likely small, number of cgroup refs and report spurious
    prioidx or classid on switching; however, dynamic updates of prioidx
    and classid have always been racy and lossy - socks between creation
    and fd installation are never updated, config changes don't update
    existing sockets at all, and prioidx may index with dead and recycled
    cgroup IDs.  Non-critical inaccuracies from small race windows won't
    make any noticeable difference.
    
    This patch doesn't make use of the pointer yet.  The following patch
    will implement netfilter match for cgroup2 membership.
    
    v2: Use sock_cgroup_data to avoid inflating struct sock w/ another
        cgroup specific field.
    
    v3: Add comments explaining why sock_data_prioidx() and
        sock_data_classid() use different fallback values.
    
    Signed-off-by: default avatarTejun Heo <tj@kernel.org>
    Cc: Daniel Borkmann <daniel@iogearbox.net>
    Cc: Daniel Wagner <daniel.wagner@bmw-carit.de>
    CC: Neil Horman <nhorman@tuxdriver.com>
    Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    bd1060a1