1. 29 Sep, 2006 2 commits
    • Paul Jackson's avatar
      [PATCH] cpuset: top_cpuset tracks hotplug changes to node_online_map · 38837fc7
      Paul Jackson authored
      
      
      Change the list of memory nodes allowed to tasks in the top (root) nodeset
      to dynamically track what cpus are online, using a call to a cpuset hook
      from the memory hotplug code.  Make this top cpus file read-only.
      
      On systems that have cpusets configured in their kernel, but that aren't
      actively using cpusets (for some distros, this covers the majority of
      systems) all tasks end up in the top cpuset.
      
      If that system does support memory hotplug, then these tasks cannot make
      use of memory nodes that are added after system boot, because the memory
      nodes are not allowed in the top cpuset.  This is a surprising regression
      over earlier kernels that didn't have cpusets enabled.
      
      One key motivation for this change is to remain consistent with the
      behaviour for the top_cpuset's 'cpus', which is also read-only, and which
      automatically tracks the cpu_online_map.
      
      This change also has the minor benefit that it fixes a long standing,
      little noticed, minor bug in cpusets.  The cpuset performance tweak to
      short circuit the cpuset_zone_allowed() check on systems with just a single
      cpuset (see 'number_of_cpusets', in linux/cpuset.h) meant that simply
      changing the 'mems' of the top_cpuset had no affect, even though the change
      (the write system call) appeared to succeed.  With the following change,
      that write to the 'mems' file fails -EACCES, and the 'mems' file stubbornly
      refuses to be changed via user space writes.  Thus no one should be mislead
      into thinking they've changed the top_cpusets's 'mems' when in affect they
      haven't.
      
      In order to keep the behaviour of cpusets consistent between systems
      actively making use of them and systems not using them, this patch changes
      the behaviour of the 'mems' file in the top (root) cpuset, making it read
      only, and making it automatically track the value of node_online_map.  Thus
      tasks in the top cpuset will have automatic use of hot plugged memory nodes
      allowed by their cpuset.
      
      [akpm@osdl.org: build fix]
      [bunk@stusta.de: build fix]
      Signed-off-by: default avatarPaul Jackson <pj@sgi.com>
      Signed-off-by: default avatarAdrian Bunk <bunk@stusta.de>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      38837fc7
    • Sukadev Bhattiprolu's avatar
      [PATCH] pidspace: is_init() · f400e198
      Sukadev Bhattiprolu authored
      This is an updated version of Eric Biederman's is_init() patch.
      (http://lkml.org/lkml/2006/2/6/280
      
      ).  It applies cleanly to 2.6.18-rc3 and
      replaces a few more instances of ->pid == 1 with is_init().
      
      Further, is_init() checks pid and thus removes dependency on Eric's other
      patches for now.
      
      Eric's original description:
      
      	There are a lot of places in the kernel where we test for init
      	because we give it special properties.  Most  significantly init
      	must not die.  This results in code all over the kernel test
      	->pid == 1.
      
      	Introduce is_init to capture this case.
      
      	With multiple pid spaces for all of the cases affected we are
      	looking for only the first process on the system, not some other
      	process that has pid == 1.
      Signed-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: default avatarSukadev Bhattiprolu <sukadev@us.ibm.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: Serge Hallyn <serue@us.ibm.com>
      Cc: Cedric Le Goater <clg@fr.ibm.com>
      Cc: <lxc-devel@lists.sourceforge.net>
      Acked-by: default avatarPaul Mackerras <paulus@samba.org>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      f400e198
  2. 27 Sep, 2006 1 commit
  3. 26 Sep, 2006 2 commits
  4. 27 Aug, 2006 2 commits
    • Nick Piggin's avatar
      [PATCH] cpuset: oom panic fix · 0d673a5a
      Nick Piggin authored
      
      
      cpuset_excl_nodes_overlap always returns 0 if current is exiting.  This caused
      customer's systems to panic in the OOM killer when processes were having
      trouble getting memory for the final put_user in mm_release.  Even though
      there were lots of processes to kill.
      
      Change to returning 1 in this case.  This achieves parity with !CONFIG_CPUSETS
      case, and was observed to fix the problem.
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Acked-by: default avatarPaul Jackson <pj@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      0d673a5a
    • Paul Jackson's avatar
      [PATCH] cpuset: top_cpuset tracks hotplug changes to cpu_online_map · 4c4d50f7
      Paul Jackson authored
      
      
      Change the list of cpus allowed to tasks in the top (root) cpuset to
      dynamically track what cpus are online, using a CPU hotplug notifier.  Make
      this top cpus file read-only.
      
      On systems that have cpusets configured in their kernel, but that aren't
      actively using cpusets (for some distros, this covers the majority of
      systems) all tasks end up in the top cpuset.
      
      If that system does support CPU hotplug, then these tasks cannot make use
      of CPUs that are added after system boot, because the CPUs are not allowed
      in the top cpuset.  This is a surprising regression over earlier kernels
      that didn't have cpusets enabled.
      
      In order to keep the behaviour of cpusets consistent between systems
      actively making use of them and systems not using them, this patch changes
      the behaviour of the 'cpus' file in the top (root) cpuset, making it read
      only, and making it automatically track the value of cpu_online_map.  Thus
      tasks in the top cpuset will have automatic use of hot plugged CPUs allowed
      by their cpuset.
      
      Thanks to Anton Blanchard and Nathan Lynch for reporting this problem,
      driving the fix, and earlier versions of this patch.
      Signed-off-by: default avatarPaul Jackson <pj@sgi.com>
      Cc: Nathan Lynch <ntl@pobox.com>
      Cc: Anton Blanchard <anton@samba.org>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      4c4d50f7
  5. 23 Jul, 2006 1 commit
    • Paul Jackson's avatar
      [PATCH] Cpuset: fix ABBA deadlock with cpu hotplug lock · abb5a5cc
      Paul Jackson authored
      
      
      Fix ABBA deadlock between lock_cpu_hotplug() and the cpuset
      callback_mutex lock.
      
      It only happens on cpu_exclusive cpusets, due to the dynamic
      sched domain code trying to take the cpu hotplug lock inside
      the cpuset callback_mutex lock.
      
      This bug has apparently been here for several months, but didn't
      get hit until the right customer load on a large system.
      
      This fix appears right from inspection, but it will take a few
      more days running it on that customers workload to be confident
      we nailed it.  We don't have any other reproducible test case.
      
      The cpu_hotplug_lock() tends to cover large runs of code.
      The other places that hold both that lock and the cpuset callback
      mutex lock always nest the cpuset lock inside the hotplug lock.
      This place tries to do the reverse, risking an ABBA deadlock.
      
      This is in the cpuset_rmdir() code, where we:
        * take the callback_mutex lock
        * mark the cpuset CS_REMOVED
        * call update_cpu_domains for cpu_exclusive cpusets
        * in that call, take the cpu_hotplug lock if the
          cpuset is marked for removal.
      
      Thanks to Jack Steiner for identifying this deadlock.
      
      The fix is to tear down the dynamic sched domain before we grab
      the cpuset callback_mutex lock.  This way, the two locks are
      serialized, with the hotplug lock taken and released before
      trying for the cpuset lock.
      
      I suspect that this bug was introduced when I changed the
      cpuset locking from one lock to two.  The dynamic sched domain
      dependency on cpu_exclusive cpusets and its hotplug hooks were
      added to this code earlier, when cpusets had only a single lock.
      It may well have been fine then.
      Signed-off-by: default avatarPaul Jackson <pj@sgi.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      abb5a5cc
  6. 30 Jun, 2006 2 commits
  7. 26 Jun, 2006 2 commits
    • Eric W. Biederman's avatar
      [PATCH] proc: Use struct pid not struct task_ref · 13b41b09
      Eric W. Biederman authored
      
      
      Incrementally update my proc-dont-lock-task_structs-indefinitely patches so
      that they work with struct pid instead of struct task_ref.
      
      Mostly this is a straight 1-1 substitution.
      Signed-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      13b41b09
    • Eric W. Biederman's avatar
      [PATCH] proc: don't lock task_structs indefinitely · 99f89551
      Eric W. Biederman authored
      
      
      Every inode in /proc holds a reference to a struct task_struct.  If a
      directory or file is opened and remains open after the the task exits this
      pinning continues.  With 8K stacks on a 32bit machine the amount pinned per
      file descriptor is about 10K.
      
      Normally I would figure a reasonable per user process limit is about 100
      processes.  With 80 processes, with a 1000 file descriptors each I can trigger
      the 00M killer on a 32bit kernel, because I have pinned about 800MB of useless
      data.
      
      This patch replaces the struct task_struct pointer with a pointer to a struct
      task_ref which has a struct task_struct pointer.  The so the pinning of dead
      tasks does not happen.
      
      The code now has to contend with the fact that the task may now exit at any
      time.  Which is a little but not muh more complicated.
      
      With this change it takes about 1000 processes each opening up 1000 file
      descriptors before I can trigger the OOM killer.  Much better.
      
      [mlp@google.com: task_mmu small fixes]
      Signed-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Cc: Albert Cahalan <acahalan@gmail.com>
      Signed-off-by: default avatarPrasanna Meda <mlp@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      99f89551
  8. 23 Jun, 2006 2 commits
    • David Quigley's avatar
      [PATCH] SELinux: add security hook call to mediate attach_task (kernel/cpuset.c) · 22fb52dd
      David Quigley authored
      
      
      Add a security hook call to enable security modules to control the ability
      to attach a task to a cpuset.  While limited control over this operation is
      possible via permission checks on the pseudo fs interface, those checks are
      not sufficient to control access to the target task, which is looked up in
      this function.  The existing task_setscheduler hook is re-used for this
      operation since this falls under the same class of operations.
      Signed-off-by: default avatarDavid Quigley <dpquigl@tycho.nsa.gov>
      Acked-by: default avatarStephen Smalley <sds@tycho.nsa.gov>
      Signed-off-by: default avatarJames Morris <jmorris@namei.org>
      Acked-by: default avatarPaul Jackson <pj@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      22fb52dd
    • David Howells's avatar
      [PATCH] VFS: Permit filesystem to override root dentry on mount · 454e2398
      David Howells authored
      
      
      Extend the get_sb() filesystem operation to take an extra argument that
      permits the VFS to pass in the target vfsmount that defines the mountpoint.
      
      The filesystem is then required to manually set the superblock and root dentry
      pointers.  For most filesystems, this should be done with simple_set_mnt()
      which will set the superblock pointer and then set the root dentry to the
      superblock's s_root (as per the old default behaviour).
      
      The get_sb() op now returns an integer as there's now no need to return the
      superblock pointer.
      
      This patch permits a superblock to be implicitly shared amongst several mount
      points, such as can be done with NFS to avoid potential inode aliasing.  In
      such a case, simple_set_mnt() would not be called, and instead the mnt_root
      and mnt_sb would be set directly.
      
      The patch also makes the following changes:
      
       (*) the get_sb_*() convenience functions in the core kernel now take a vfsmount
           pointer argument and return an integer, so most filesystems have to change
           very little.
      
       (*) If one of the convenience function is not used, then get_sb() should
           normally call simple_set_mnt() to instantiate the vfsmount. This will
           always return 0, and so can be tail-called from get_sb().
      
       (*) generic_shutdown_super() now calls shrink_dcache_sb() to clean up the
           dcache upon superblock destruction rather than shrink_dcache_anon().
      
           This is required because the superblock may now have multiple trees that
           aren't actually bound to s_root, but that still need to be cleaned up. The
           currently called functions assume that the whole tree is rooted at s_root,
           and that anonymous dentries are not the roots of trees which results in
           dentries being left unculled.
      
           However, with the way NFS superblock sharing are currently set to be
           implemented, these assumptions are violated: the root of the filesystem is
           simply a dummy dentry and inode (the real inode for '/' may well be
           inaccessible), and all the vfsmounts are rooted on anonymous[*] dentries
           with child trees.
      
           [*] Anonymous until discovered from another tree.
      
       (*) The documentation has been adjusted, including the additional bit of
           changing ext2_* into foo_* in the documentation.
      
      [akpm@osdl.org: convert ipath_fs, do other stuff]
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Acked-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Cc: Nathan Scott <nathans@sgi.com>
      Cc: Roland Dreier <rolandd@cisco.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      454e2398
  9. 21 May, 2006 2 commits
  10. 31 Mar, 2006 3 commits
  11. 24 Mar, 2006 6 commits
    • Paul Jackson's avatar
      [PATCH] cpuset: remove useless local variable initialization · 29afd49b
      Paul Jackson authored
      
      
      Remove a useless variable initialization in cpuset __cpuset_zone_allowed().
       The local variable 'allowed' is unconditionally set before use, later on
      in the code, so does not need to be initialized.
      
      Not that it seems to matter to the code generated any, as the compiler
      optimizes out the superfluous assignment anyway.
      Signed-off-by: default avatarPaul Jackson <pj@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      29afd49b
    • Paul Jackson's avatar
      [PATCH] cpuset: don't need to mark cpuset_mems_generation atomic · 151a4420
      Paul Jackson authored
      
      
      Drop the atomic_t marking on the cpuset static global
      cpuset_mems_generation.  Since all access to it is guarded by the global
      manage_mutex, there is no need for further serialization of this value.
      Signed-off-by: default avatarPaul Jackson <pj@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      151a4420
    • Paul Jackson's avatar
      [PATCH] cpuset: remove unnecessary NULL check · 8488bc35
      Paul Jackson authored
      
      
      Remove a no longer needed test for NULL cpuset pointer, with a little
      comment explaining why the test isn't needed.
      Signed-off-by: default avatarPaul Jackson <pj@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      8488bc35
    • Paul Jackson's avatar
      [PATCH] cpuset memory spread basic implementation · 825a46af
      Paul Jackson authored
      
      
      This patch provides the implementation and cpuset interface for an alternative
      memory allocation policy that can be applied to certain kinds of memory
      allocations, such as the page cache (file system buffers) and some slab caches
      (such as inode caches).
      
      The policy is called "memory spreading." If enabled, it spreads out these
      kinds of memory allocations over all the nodes allowed to a task, instead of
      preferring to place them on the node where the task is executing.
      
      All other kinds of allocations, including anonymous pages for a tasks stack
      and data regions, are not affected by this policy choice, and continue to be
      allocated preferring the node local to execution, as modified by the NUMA
      mempolicy.
      
      There are two boolean flag files per cpuset that control where the kernel
      allocates pages for the file system buffers and related in kernel data
      structures.  They are called 'memory_spread_page' and 'memory_spread_slab'.
      
      If the per-cpuset boolean flag file 'memory_spread_page' is set, then the
      kernel will spread the file system buffers (page cache) evenly over all the
      nodes that the faulting task is allowed to use, instead of preferring to put
      those pages on the node where the task is running.
      
      If the per-cpuset boolean flag file 'memory_spread_slab' is set, then the
      kernel will spread some file system related slab caches, such as for inodes
      and dentries evenly over all the nodes that the faulting task is allowed to
      use, instead of preferring to put those pages on the node where the task is
      running.
      
      The implementation is simple.  Setting the cpuset flags 'memory_spread_page'
      or 'memory_spread_cache' turns on the per-process flags PF_SPREAD_PAGE or
      PF_SPREAD_SLAB, respectively, for each task that is in the cpuset or
      subsequently joins that cpuset.  In subsequent patches, the page allocation
      calls for the affected page cache and slab caches are modified to perform an
      inline check for these flags, and if set, a call to a new routine
      cpuset_mem_spread_node() returns the node to prefer for the allocation.
      
      The cpuset_mem_spread_node() routine is also simple.  It uses the value of a
      per-task rotor cpuset_mem_spread_rotor to select the next node in the current
      tasks mems_allowed to prefer for the allocation.
      
      This policy can provide substantial improvements for jobs that need to place
      thread local data on the corresponding node, but that need to access large
      file system data sets that need to be spread across the several nodes in the
      jobs cpuset in order to fit.  Without this patch, especially for jobs that
      might have one thread reading in the data set, the memory allocation across
      the nodes in the jobs cpuset can become very uneven.
      
      A couple of Copyright year ranges are updated as well.  And a couple of email
      addresses that can be found in the MAINTAINERS file are removed.
      Signed-off-by: default avatarPaul Jackson <pj@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      825a46af
    • Paul Jackson's avatar
      [PATCH] cpuset use combined atomic_inc_return calls · 8a39cc60
      Paul Jackson authored
      
      
      Replace pairs of calls to <atomic_inc, atomic_read>, with a single call
      atomic_inc_return, saving a few bytes of source and kernel text.
      Signed-off-by: default avatarPaul Jackson <pj@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      8a39cc60
    • Paul Jackson's avatar
      [PATCH] cpuset cleanup not not operators · 7b5b9ef0
      Paul Jackson authored
      
      
      Since the test_bit() bit operator is boolean (return 0 or 1), the double not
      "!!" operations needed to convert a scalar (zero or not zero) to a boolean are
      not needed.
      Signed-off-by: default avatarPaul Jackson <pj@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      7b5b9ef0
  12. 23 Mar, 2006 1 commit
  13. 15 Feb, 2006 1 commit
  14. 03 Feb, 2006 1 commit
  15. 15 Jan, 2006 2 commits
  16. 09 Jan, 2006 10 commits
    • Jes Sorensen's avatar
      [PATCH] mutex subsystem, semaphore to mutex: VFS, ->i_sem · 1b1dcc1b
      Jes Sorensen authored
      
      
      This patch converts the inode semaphore to a mutex. I have tested it on
      XFS and compiled as much as one can consider on an ia64. Anyway your
      luck with it might be different.
      Modified-by: default avatarIngo Molnar <mingo@elte.hu>
      
      (finished the conversion)
      Signed-off-by: default avatarJes Sorensen <jes@sgi.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      1b1dcc1b
    • Eric Dumazet's avatar
      [PATCH] shrink dentry struct · 5160ee6f
      Eric Dumazet authored
      
      
      Some long time ago, dentry struct was carefully tuned so that on 32 bits
      UP, sizeof(struct dentry) was exactly 128, ie a power of 2, and a multiple
      of memory cache lines.
      
      Then RCU was added and dentry struct enlarged by two pointers, with nice
      results for SMP, but not so good on UP, because breaking the above tuning
      (128 + 8 = 136 bytes)
      
      This patch reverts this unwanted side effect, by using an union (d_u),
      where d_rcu and d_child are placed so that these two fields can share their
      memory needs.
      
      At the time d_free() is called (and d_rcu is really used), d_child is known
      to be empty and not touched by the dentry freeing.
      
      Lockless lookups only access d_name, d_parent, d_lock, d_op, d_flags (so
      the previous content of d_child is not needed if said dentry was unhashed
      but still accessed by a CPU because of RCU constraints)
      
      As dentry cache easily contains millions of entries, a size reduction is
      worth the extra complexity of the ugly C union.
      Signed-off-by: default avatarEric Dumazet <dada1@cosmosbay.com>
      Cc: Dipankar Sarma <dipankar@in.ibm.com>
      Cc: Maneesh Soni <maneesh@in.ibm.com>
      Cc: Miklos Szeredi <miklos@szeredi.hu>
      Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
      Cc: Ian Kent <raven@themaw.net>
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Al Viro <viro@ftp.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Cc: Neil Brown <neilb@cse.unsw.edu.au>
      Cc: James Morris <jmorris@namei.org>
      Cc: Stephen Smalley <sds@epoch.ncsc.mil>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      5160ee6f
    • Paul Jackson's avatar
      [PATCH] cpuset: skip rcu check if task is in root cpuset · 03a285f5
      Paul Jackson authored
      
      
      For systems that aren't using cpusets, but have them CONFIG_CPUSET enabled in
      their kernel (eventually this may be most distribution kernels), this patch
      removes even the minimal rcu_read_lock() from the memory page allocation path.
      
      Actually, it removes that rcu call for any task that is in the root cpuset
      (top_cpuset), which on systems not actively using cpusets, is all tasks.
      
      We don't need the rcu check for tasks in the top_cpuset, because the
      top_cpuset is statically allocated, so at no risk of being freed out from
      underneath us.
      Signed-off-by: default avatarPaul Jackson <pj@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      03a285f5
    • Paul Jackson's avatar
      [PATCH] cpuset: mark number_of_cpusets read_mostly · 7edc5962
      Paul Jackson authored
      
      
      Mark cpuset global 'number_of_cpusets' as __read_mostly.
      
      This global is accessed everytime a zone is considered in the zonelist loops
      beneath __alloc_pages, looking for a free memory page.  If number_of_cpusets
      is just one, then we can short circuit the mems_allowed check.
      
      Since this global is read alot on a hot path, and written rarely, it is an
      excellent candidate for __read_mostly.
      
      Thanks to Christoph Lameter for the suggestion.
      Signed-off-by: default avatarPaul Jackson <pj@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      7edc5962
    • Paul Jackson's avatar
      [PATCH] cpuset: use rcu directly optimization · 6b9c2603
      Paul Jackson authored
      
      
      Optimize the cpuset impact on page allocation, the most performance critical
      cpuset hook in the kernel.
      
      On each page allocation, the cpuset hook needs to check for a possible change
      in the current tasks cpuset.  It can now handle the common case, of no change,
      without taking any spinlock or semaphore, thanks to RCU.
      
      Convert a spinlock on the current task to an rcu_read_lock(), saving
      approximately a memory barrier and an atomic op, depending on architecture.
      
      This is done by adding rcu_assign_pointer() and synchronize_rcu() calls to the
      write side of the task->cpuset pointer, in cpuset.c:attach_task(), to delay
      freeing up a detached cpuset until after any critical sections referencing
      that pointer.
      
      Thanks to Andi Kleen, Nick Piggin and Eric Dumazet for ideas.
      Signed-off-by: default avatarPaul Jackson <pj@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      6b9c2603
    • Paul Jackson's avatar
      [PATCH] cpuset: remove test for null cpuset from alloc code path · c417f024
      Paul Jackson authored
      
      
      Remove a couple of more lines of code from the cpuset hooks in the page
      allocation code path.
      
      There was a check for a NULL cpuset pointer in the routine
      cpuset_update_task_memory_state() that was only needed during system boot,
      after the memory subsystem was initialized, before the cpuset subsystem was
      initialized, to catch a NULL task->cpuset pointer.
      
      Add a cpuset_init_early() routine, just before the mem_init() call in
      init/main.c, that sets up just enough of the init tasks cpuset structure to
      render cpuset_update_task_memory_state() calls harmless.
      Signed-off-by: default avatarPaul Jackson <pj@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      c417f024
    • Paul Jackson's avatar
      [PATCH] cpuset: migrate all tasks in cpuset at once · 04c19fa6
      Paul Jackson authored
      
      
      Given the mechanism in the previous patch to handle rebinding the per-vma
      mempolicies of all tasks in a cpuset that changes its memory placement, it is
      now easier to handle the page migration requirements of such tasks at the same
      time.
      
      The previous code didn't actually attempt to migrate the pages of the tasks in
      a cpuset whose memory placement changed until the next time each such task
      tried to allocate memory.  This was undesirable, as users invoking memory page
      migration exected to happen when the placement changed, not some unspecified
      time later when the task needed more memory.
      
      It is now trivial to handle the page migration at the same time as the per-vma
      rebinding is done.
      
      The routine cpuset.c:update_nodemask(), which handles changing a cpusets
      memory placement ('mems') now checks for the special case of being asked to
      write a placement that is the same as before.  It was harmless enough before
      to just recompute everything again, even though nothing had changed.  But page
      migration is a heavy weight operation - moving pages about.  So now it is
      worth avoiding that if asked to move a cpuset to its current location.
      Signed-off-by: default avatarPaul Jackson <pj@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      04c19fa6
    • Paul Jackson's avatar
      [PATCH] cpuset: rebind vma mempolicies fix · 4225399a
      Paul Jackson authored
      
      
      Fix more of longstanding bug in cpuset/mempolicy interaction.
      
      NUMA mempolicies (mm/mempolicy.c) are constrained by the current tasks cpuset
      to just the Memory Nodes allowed by that cpuset.  The kernel maintains
      internal state for each mempolicy, tracking what nodes are used for the
      MPOL_INTERLEAVE, MPOL_BIND or MPOL_PREFERRED policies.
      
      When a tasks cpuset memory placement changes, whether because the cpuset
      changed, or because the task was attached to a different cpuset, then the
      tasks mempolicies have to be rebound to the new cpuset placement, so as to
      preserve the cpuset-relative numbering of the nodes in that policy.
      
      An earlier fix handled such mempolicy rebinding for mempolicies attached to a
      task.
      
      This fix rebinds mempolicies attached to vma's (address ranges in a tasks
      address space.) Due to the need to hold the task->mm->mmap_sem semaphore while
      updating vma's, the rebinding of vma mempolicies has to be done when the
      cpuset memory placement is changed, at which time mmap_sem can be safely
      acquired.  The tasks mempolicy is rebound later, when the task next attempts
      to allocate memory and notices that its task->cpuset_mems_generation is
      out-of-date with its cpusets mems_generation.
      
      Because walking the tasklist to find all tasks attached to a changing cpuset
      requires holding tasklist_lock, a spinlock, one cannot update the vma's of the
      affected tasks while doing the tasklist scan.  In general, one cannot acquire
      a semaphore (which can sleep) while already holding a spinlock (such as
      tasklist_lock).  So a list of mm references has to be built up during the
      tasklist scan, then the tasklist lock dropped, then for each mm, its mmap_sem
      acquired, and the vma's in that mm rebound.
      
      Once the tasklist lock is dropped, affected tasks may fork new tasks, before
      their mm's are rebound.  A kernel global 'cpuset_being_rebound' is set to
      point to the cpuset being rebound (there can only be one; cpuset modifications
      are done under a global 'manage_sem' semaphore), and the mpol_copy code that
      is used to copy a tasks mempolicies during fork catches such forking tasks,
      and ensures their children are also rebound.
      
      When a task is moved to a different cpuset, it is easier, as there is only one
      task involved.  It's mm->vma's are scanned, using the same
      mpol_rebind_policy() as used above.
      
      It may happen that both the mpol_copy hook and the update done via the
      tasklist scan update the same mm twice.  This is ok, as the mempolicies of
      each vma in an mm keep track of what mems_allowed they are relative to, and
      safely no-op a second request to rebind to the same nodes.
      Signed-off-by: default avatarPaul Jackson <pj@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      4225399a
    • Paul Jackson's avatar
      [PATCH] cpuset: number_of_cpusets optimization · 202f72d5
      Paul Jackson authored
      
      
      Easy little optimization hack to avoid actually having to call
      cpuset_zone_allowed() and check mems_allowed, in the main page allocation
      routine, __alloc_pages().  This saves several CPU cycles per page allocation
      on systems not using cpusets.
      
      A counter is updated each time a cpuset is created or removed, and whenever
      there is only one cpuset in the system, it must be the root cpuset, which
      contains all CPUs and all Memory Nodes.  In that case, when the counter is
      one, all allocations are allowed.
      Signed-off-by: default avatarPaul Jackson <pj@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      202f72d5
    • Paul Jackson's avatar
      [PATCH] cpuset: numa_policy_rebind cleanup · 74cb2155
      Paul Jackson authored
      
      
      Cleanup, reorganize and make more robust the mempolicy.c code to rebind
      mempolicies relative to the containing cpuset after a tasks memory placement
      changes.
      
      The real motivator for this cleanup patch is to lay more groundwork for the
      upcoming patch to correctly rebind NUMA mempolicies that are attached to vma's
      after the containing cpuset memory placement changes.
      
      NUMA mempolicies are constrained by the cpuset their task is a member of.
      When either (1) a task is moved to a different cpuset, or (2) the 'mems'
      mems_allowed of a cpuset is changed, then the NUMA mempolicies have embedded
      node numbers (for MPOL_BIND, MPOL_INTERLEAVE and MPOL_PREFERRED) that need to
      be recalculated, relative to their new cpuset placement.
      
      The old code used an unreliable method of determining what was the old
      mems_allowed constraining the mempolicy.  It just looked at the tasks
      mems_allowed value.  This sort of worked with the present code, that just
      rebinds the -task- mempolicy, and leaves any -vma- mempolicies broken,
      referring to the old nodes.  But in an upcoming patch, the vma mempolicies
      will be rebound as well.  Then the order in which the various task and vma
      mempolicies are updated will no longer be deterministic, and one can no longer
      count on the task->mems_allowed holding the old value for as long as needed.
      It's not even clear if the current code was guaranteed to work reliably for
      task mempolicies.
      
      So I added a mems_allowed field to each mempolicy, stating exactly what
      mems_allowed the policy is relative to, and updated synchronously and reliably
      anytime that the mempolicy is rebound.
      
      Also removed a useless wrapper routine, numa_policy_rebind(), and had its
      caller, cpuset_update_task_memory_state(), call directly to the rewritten
      policy_rebind() routine, and made that rebind routine extern instead of
      static, and added a "mpol_" prefix to its name, making it
      mpol_rebind_policy().
      Signed-off-by: default avatarPaul Jackson <pj@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      74cb2155