• Paul Jackson's avatar
    [PATCH] cpuset: memory pressure meter · 3e0d98b9
    Paul Jackson authored
    
    
    Provide a simple per-cpuset metric of memory pressure, tracking the -rate-
    that the tasks in a cpuset call try_to_free_pages(), the synchronous
    (direct) memory reclaim code.
    
    This enables batch managers monitoring jobs running in dedicated cpusets to
    efficiently detect what level of memory pressure that job is causing.
    
    This is useful both on tightly managed systems running a wide mix of
    submitted jobs, which may choose to terminate or reprioritize jobs that are
    trying to use more memory than allowed on the nodes assigned them, and with
    tightly coupled, long running, massively parallel scientific computing jobs
    that will dramatically fail to meet required performance goals if they
    start to use more memory than allowed to them.
    
    This patch just provides a very economical way for the batch manager to
    monitor a cpuset for signs of memory pressure.  It's up to the batch
    manager or other user code to decide what to do about it and take action.
    
    ==> Unless this feature is enabled by writing "1" to the special file
        /dev/cpuset/memory_pressure_enabled, the hook in the rebalance
        code of __alloc_pages() for this metric reduces to simply noticing
        that the cpuset_memory_pressure_enabled flag is zero.  So only
        systems that enable this feature will compute the metric.
    
    Why a per-cpuset, running average:
    
        Because this meter is per-cpuset, rather than per-task or mm, the
        system load imposed by a batch scheduler monitoring this metric is
        sharply reduced on large systems, because a scan of the tasklist can be
        avoided on each set of queries.
    
        Because this meter is a running average, instead of an accumulating
        counter, a batch scheduler can detect memory pressure with a single
        read, instead of having to read and accumulate results for a period of
        time.
    
        Because this meter is per-cpuset rather than per-task or mm, the
        batch scheduler can obtain the key information, memory pressure in a
        cpuset, with a single read, rather than having to query and accumulate
        results over all the (dynamically changing) set of tasks in the cpuset.
    
    A per-cpuset simple digital filter (requires a spinlock and 3 words of data
    per-cpuset) is kept, and updated by any task attached to that cpuset, if it
    enters the synchronous (direct) page reclaim code.
    
    A per-cpuset file provides an integer number representing the recent
    (half-life of 10 seconds) rate of direct page reclaims caused by the tasks
    in the cpuset, in units of reclaims attempted per second, times 1000.
    Signed-off-by: default avatarPaul Jackson <pj@sgi.com>
    Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
    3e0d98b9