• Paul Jackson's avatar
    [PATCH] cpusets: formalize intermediate GFP_KERNEL containment · 9bf2229f
    Paul Jackson authored
    
    
    This patch makes use of the previously underutilized cpuset flag
    'mem_exclusive' to provide what amounts to another layer of memory placement
    resolution.  With this patch, there are now the following four layers of
    memory placement available:
    
     1) The whole system (interrupt and GFP_ATOMIC allocations can use this),
     2) The nearest enclosing mem_exclusive cpuset (GFP_KERNEL allocations can use),
     3) The current tasks cpuset (GFP_USER allocations constrained to here), and
     4) Specific node placement, using mbind and set_mempolicy.
    
    These nest - each layer is a subset (same or within) of the previous.
    
    Layer (2) above is new, with this patch.  The call used to check whether a
    zone (its node, actually) is in a cpuset (in its mems_allowed, actually) is
    extended to take a gfp_mask argument, and its logic is extended, in the case
    that __GFP_HARDWALL is not set in the flag bits, to look up the cpuset
    hierarchy for the nearest enclosing mem_exclusive cpuset, to determine if
    placement is allowed.  The definition of GFP_USER, which used to be identical
    to GFP_KERNEL, is changed to also set the __GFP_HARDWALL bit, in the previous
    cpuset_gfp_hardwall_flag patch.
    
    GFP_ATOMIC and GFP_KERNEL allocations will stay within the current tasks
    cpuset, so long as any node therein is not too tight on memory, but will
    escape to the larger layer, if need be.
    
    The intended use is to allow something like a batch manager to handle several
    jobs, each job in its own cpuset, but using common kernel memory for caches
    and such.  Swapper and oom_kill activity is also constrained to Layer (2).  A
    task in or below one mem_exclusive cpuset should not cause swapping on nodes
    in another non-overlapping mem_exclusive cpuset, nor provoke oom_killing of a
    task in another such cpuset.  Heavy use of kernel memory for i/o caching and
    such by one job should not impact the memory available to jobs in other
    non-overlapping mem_exclusive cpusets.
    
    This patch enables providing hardwall, inescapable cpusets for memory
    allocations of each job, while sharing kernel memory allocations between
    several jobs, in an enclosing mem_exclusive cpuset.
    
    Like Dinakar's patch earlier to enable administering sched domains using the
    cpu_exclusive flag, this patch also provides a useful meaning to a cpuset flag
    that had previously done nothing much useful other than restrict what cpuset
    configurations were allowed.
    Signed-off-by: default avatarPaul Jackson <pj@sgi.com>
    Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
    9bf2229f