• Paul Jackson's avatar
    [PATCH] cpusets: automatic numa mempolicy rebinding · 68860ec1
    Paul Jackson authored
    
    
    This patch automatically updates a tasks NUMA mempolicy when its cpuset
    memory placement changes.  It does so within the context of the task,
    without any need to support low level external mempolicy manipulation.
    
    If a system is not using cpusets, or if running on a system with just the
    root (all-encompassing) cpuset, then this remap is a no-op.  Only when a
    task is moved between cpusets, or a cpusets memory placement is changed
    does the following apply.  Otherwise, the main routine below,
    rebind_policy() is not even called.
    
    When mixing cpusets, scheduler affinity, and NUMA mempolicies, the
    essential role of cpusets is to place jobs (several related tasks) on a set
    of CPUs and Memory Nodes, the essential role of sched_setaffinity is to
    manage a jobs processor placement within its allowed cpuset, and the
    essential role of NUMA mempolicy (mbind, set_mempolicy) is to manage a jobs
    memory placement within its allowed cpuset.
    
    However, CPU affinity and NUMA memory placement are managed within the
    kernel using absolute system wide numbering, not cpuset relative numbering.
    
    This is ok until a job is migrated to a different cpuset, or what's the
    same, a jobs cpuset is moved to different CPUs and Memory Nodes.
    
    Then the CPU affinity and NUMA memory placement of the tasks in the job
    need to be updated, to preserve their cpuset-relative position.  This can
    be done for CPU affinity using sched_setaffinity() from user code, as one
    task can modify anothers CPU affinity.  This cannot be done from an
    external task for NUMA memory placement, as that can only be modified in
    the context of the task using it.
    
    However, it easy enough to remap a tasks NUMA mempolicy automatically when
    a task is migrated, using the existing cpuset mechanism to trigger a
    refresh of a tasks memory placement after its cpuset has changed.  All that
    is needed is the old and new nodemask, and notice to the task that it needs
    to rebind its mempolicy.  The tasks mems_allowed has the old mask, the
    tasks cpuset has the new mask, and the existing
    cpuset_update_current_mems_allowed() mechanism provides the notice.  The
    bitmap/cpumask/nodemask remap operators provide the cpuset relative
    calculations.
    
    This patch leaves open a couple of issues:
    
     1) Updating vma and shmfs/tmpfs/hugetlbfs memory policies:
    
        These mempolicies may reference nodes outside of those allowed to
        the current task by its cpuset.  Tasks are migrated as part of jobs,
        which reside on what might be several cpusets in a subtree.  When such
        a job is migrated, all NUMA memory policy references to nodes within
        that cpuset subtree should be translated, and references to any nodes
        outside that subtree should be left untouched.  A future patch will
        provide the cpuset mechanism needed to mark such subtrees.  With that
        patch, we will be able to correctly migrate these other memory policies
        across a job migration.
    
     2) Updating cpuset, affinity and memory policies in user space:
    
        This is harder.  Any placement state stored in user space using
        system-wide numbering will be invalidated across a migration.  More
        work will be required to provide user code with a migration-safe means
        to manage its cpuset relative placement, while preserving the current
        API's that pass system wide numbers, not cpuset relative numbers across
        the kernel-user boundary.
    Signed-off-by: default avatarPaul Jackson <pj@sgi.com>
    Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
    68860ec1