1. 09 May, 2007 1 commit
  2. 08 May, 2007 3 commits
  3. 07 May, 2007 1 commit
  4. 12 Feb, 2007 2 commits
  5. 30 Dec, 2006 1 commit
  6. 13 Dec, 2006 1 commit
    • Paul Jackson's avatar
      [PATCH] cpuset: rework cpuset_zone_allowed api · 02a0e53d
      Paul Jackson authored
      
      
      Elaborate the API for calling cpuset_zone_allowed(), so that users have to
      explicitly choose between the two variants:
      
        cpuset_zone_allowed_hardwall()
        cpuset_zone_allowed_softwall()
      
      Until now, whether or not you got the hardwall flavor depended solely on
      whether or not you or'd in the __GFP_HARDWALL gfp flag to the gfp_mask
      argument.
      
      If you didn't specify __GFP_HARDWALL, you implicitly got the softwall
      version.
      
      Unfortunately, this meant that users would end up with the softwall version
      without thinking about it.  Since only the softwall version might sleep,
      this led to bugs with possible sleeping in interrupt context on more than
      one occassion.
      
      The hardwall version requires that the current tasks mems_allowed allows
      the node of the specified zone (or that you're in interrupt or that
      __GFP_THISNODE is set or that you're on a one cpuset system.)
      
      The softwall version, depending on the gfp_mask, might allow a node if it
      was allowed in the nearest enclusing cpuset marked mem_exclusive (which
      requires taking the cpuset lock 'callback_mutex' to evaluate.)
      
      This patch removes the cpuset_zone_allowed() call, and forces the caller to
      explicitly choose between the hardwall and the softwall case.
      
      If the caller wants the gfp_mask to determine this choice, they should (1)
      be sure they can sleep or that __GFP_HARDWALL is set, and (2) invoke the
      cpuset_zone_allowed_softwall() routine.
      
      This adds another 100 or 200 bytes to the kernel text space, due to the few
      lines of nearly duplicate code at the top of both cpuset_zone_allowed_*
      routines.  It should save a few instructions executed for the calls that
      turned into calls of cpuset_zone_allowed_hardwall, thanks to not having to
      set (before the call) then check (within the call) the __GFP_HARDWALL flag.
      
      For the most critical call, from get_page_from_freelist(), the same
      instructions are executed as before -- the old cpuset_zone_allowed()
      routine it used to call is the same code as the
      cpuset_zone_allowed_softwall() routine that it calls now.
      
      Not a perfect win, but seems worth it, to reduce this chance of hitting a
      sleeping with irq off complaint again.
      Signed-off-by: default avatarPaul Jackson <pj@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      02a0e53d
  7. 08 Dec, 2006 1 commit
  8. 07 Dec, 2006 4 commits
  9. 10 Oct, 2006 1 commit
  10. 01 Oct, 2006 1 commit
  11. 29 Sep, 2006 4 commits
    • Paul Jackson's avatar
      [PATCH] cpuset: fix obscure attach_task vs exiting race · 181b6480
      Paul Jackson authored
      
      
      Fix obscure race condition in kernel/cpuset.c attach_task() code.
      
      There is basically zero chance of anyone accidentally being harmed by this
      race.
      
      It requires a special 'micro-stress' load and a special timing loop hacks
      in the kernel to hit in less than an hour, and even then you'd have to hit
      it hundreds or thousands of times, followed by some unusual and senseless
      cpuset configuration requests, including removing the top cpuset, to cause
      any visibly harm affects.
      
      One could, with perhaps a few days or weeks of such effort, get the
      reference count on the top cpuset below zero, and manage to crash the
      kernel by asking to remove the top cpuset.
      
      I found it by code inspection.
      
      The race was introduced when 'the_top_cpuset_hack' was introduced, and one
      piece of code was not updated.  An old check for a possibly null task
      cpuset pointer needed to be changed to a check for a task marked
      PF_EXITING.  The pointer can't be null anymore, thanks to
      the_top_cpuset_hack (documented in kernel/cpuset.c).  But the task could
      have gone into PF_EXITING state after it was found in the task_list scan.
      
      If a task is PF_EXITING in this code, it is possible that its task->cpuset
      pointer is pointing to the top cpuset due to the_top_cpuset_hack, rather
      than because the top_cpuset was that tasks last valid cpuset.  In that
      case, the wrong cpuset reference counter would be decremented.
      
      The fix is trivial.  Instead of failing the system call if the tasks cpuset
      pointer is null here, fail it if the task is in PF_EXITING state.
      
      The code for 'the_top_cpuset_hack' that changes an exiting tasks cpuset to
      the top_cpuset is done without locking, so could happen at anytime.  But it
      is done during the exit handling, after the PF_EXITING flag is set.  So if
      we verify that a task is still not PF_EXITING after we copy out its cpuset
      pointer (into 'oldcs', below), we know that 'oldcs' is not one of these
      hack references to the top_cpuset.
      Signed-off-by: default avatarPaul Jackson <pj@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      181b6480
    • Paul Jackson's avatar
      [PATCH] cpuset: hotunplug cpus and mems in all cpusets · b1aac8bb
      Paul Jackson authored
      
      
      The cpuset code handling hot unplug of CPUs or Memory Nodes was incorrect -
      it could remove a CPU or Node from the top cpuset, while leaving it still
      in some child cpusets.
      
      One basic rule of cpusets is that each cpusets cpus and mems are subsets of
      its parents.  The cpuset hot unplug code violated this rule.
      
      So the cpuset hotunplug handler must walk down the tree, removing any
      removed CPU or Node from all cpusets.
      
      However, it is not allowed to make a cpusets cpus or mems become empty.
      They can only transition from empty to non-empty, not back.
      
      So if the last CPU or Node would be removed from a cpuset by the above
      walk, we scan back up the cpuset hierarchy, finding the nearest ancestor
      that still has something online, and copy its CPU or Memory placement.
      Signed-off-by: default avatarPaul Jackson <pj@sgi.com>
      Cc: Nathan Lynch <ntl@pobox.com>
      Cc: Anton Blanchard <anton@samba.org>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      b1aac8bb
    • Paul Jackson's avatar
      [PATCH] cpuset: top_cpuset tracks hotplug changes to node_online_map · 38837fc7
      Paul Jackson authored
      
      
      Change the list of memory nodes allowed to tasks in the top (root) nodeset
      to dynamically track what cpus are online, using a call to a cpuset hook
      from the memory hotplug code.  Make this top cpus file read-only.
      
      On systems that have cpusets configured in their kernel, but that aren't
      actively using cpusets (for some distros, this covers the majority of
      systems) all tasks end up in the top cpuset.
      
      If that system does support memory hotplug, then these tasks cannot make
      use of memory nodes that are added after system boot, because the memory
      nodes are not allowed in the top cpuset.  This is a surprising regression
      over earlier kernels that didn't have cpusets enabled.
      
      One key motivation for this change is to remain consistent with the
      behaviour for the top_cpuset's 'cpus', which is also read-only, and which
      automatically tracks the cpu_online_map.
      
      This change also has the minor benefit that it fixes a long standing,
      little noticed, minor bug in cpusets.  The cpuset performance tweak to
      short circuit the cpuset_zone_allowed() check on systems with just a single
      cpuset (see 'number_of_cpusets', in linux/cpuset.h) meant that simply
      changing the 'mems' of the top_cpuset had no affect, even though the change
      (the write system call) appeared to succeed.  With the following change,
      that write to the 'mems' file fails -EACCES, and the 'mems' file stubbornly
      refuses to be changed via user space writes.  Thus no one should be mislead
      into thinking they've changed the top_cpusets's 'mems' when in affect they
      haven't.
      
      In order to keep the behaviour of cpusets consistent between systems
      actively making use of them and systems not using them, this patch changes
      the behaviour of the 'mems' file in the top (root) cpuset, making it read
      only, and making it automatically track the value of node_online_map.  Thus
      tasks in the top cpuset will have automatic use of hot plugged memory nodes
      allowed by their cpuset.
      
      [akpm@osdl.org: build fix]
      [bunk@stusta.de: build fix]
      Signed-off-by: default avatarPaul Jackson <pj@sgi.com>
      Signed-off-by: default avatarAdrian Bunk <bunk@stusta.de>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      38837fc7
    • Sukadev Bhattiprolu's avatar
      [PATCH] pidspace: is_init() · f400e198
      Sukadev Bhattiprolu authored
      This is an updated version of Eric Biederman's is_init() patch.
      (http://lkml.org/lkml/2006/2/6/280
      
      ).  It applies cleanly to 2.6.18-rc3 and
      replaces a few more instances of ->pid == 1 with is_init().
      
      Further, is_init() checks pid and thus removes dependency on Eric's other
      patches for now.
      
      Eric's original description:
      
      	There are a lot of places in the kernel where we test for init
      	because we give it special properties.  Most  significantly init
      	must not die.  This results in code all over the kernel test
      	->pid == 1.
      
      	Introduce is_init to capture this case.
      
      	With multiple pid spaces for all of the cases affected we are
      	looking for only the first process on the system, not some other
      	process that has pid == 1.
      Signed-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: default avatarSukadev Bhattiprolu <sukadev@us.ibm.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: Serge Hallyn <serue@us.ibm.com>
      Cc: Cedric Le Goater <clg@fr.ibm.com>
      Cc: <lxc-devel@lists.sourceforge.net>
      Acked-by: default avatarPaul Mackerras <paulus@samba.org>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      f400e198
  12. 27 Sep, 2006 1 commit
  13. 26 Sep, 2006 2 commits
  14. 27 Aug, 2006 2 commits
    • Nick Piggin's avatar
      [PATCH] cpuset: oom panic fix · 0d673a5a
      Nick Piggin authored
      
      
      cpuset_excl_nodes_overlap always returns 0 if current is exiting.  This caused
      customer's systems to panic in the OOM killer when processes were having
      trouble getting memory for the final put_user in mm_release.  Even though
      there were lots of processes to kill.
      
      Change to returning 1 in this case.  This achieves parity with !CONFIG_CPUSETS
      case, and was observed to fix the problem.
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Acked-by: default avatarPaul Jackson <pj@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      0d673a5a
    • Paul Jackson's avatar
      [PATCH] cpuset: top_cpuset tracks hotplug changes to cpu_online_map · 4c4d50f7
      Paul Jackson authored
      
      
      Change the list of cpus allowed to tasks in the top (root) cpuset to
      dynamically track what cpus are online, using a CPU hotplug notifier.  Make
      this top cpus file read-only.
      
      On systems that have cpusets configured in their kernel, but that aren't
      actively using cpusets (for some distros, this covers the majority of
      systems) all tasks end up in the top cpuset.
      
      If that system does support CPU hotplug, then these tasks cannot make use
      of CPUs that are added after system boot, because the CPUs are not allowed
      in the top cpuset.  This is a surprising regression over earlier kernels
      that didn't have cpusets enabled.
      
      In order to keep the behaviour of cpusets consistent between systems
      actively making use of them and systems not using them, this patch changes
      the behaviour of the 'cpus' file in the top (root) cpuset, making it read
      only, and making it automatically track the value of cpu_online_map.  Thus
      tasks in the top cpuset will have automatic use of hot plugged CPUs allowed
      by their cpuset.
      
      Thanks to Anton Blanchard and Nathan Lynch for reporting this problem,
      driving the fix, and earlier versions of this patch.
      Signed-off-by: default avatarPaul Jackson <pj@sgi.com>
      Cc: Nathan Lynch <ntl@pobox.com>
      Cc: Anton Blanchard <anton@samba.org>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      4c4d50f7
  15. 23 Jul, 2006 1 commit
    • Paul Jackson's avatar
      [PATCH] Cpuset: fix ABBA deadlock with cpu hotplug lock · abb5a5cc
      Paul Jackson authored
      
      
      Fix ABBA deadlock between lock_cpu_hotplug() and the cpuset
      callback_mutex lock.
      
      It only happens on cpu_exclusive cpusets, due to the dynamic
      sched domain code trying to take the cpu hotplug lock inside
      the cpuset callback_mutex lock.
      
      This bug has apparently been here for several months, but didn't
      get hit until the right customer load on a large system.
      
      This fix appears right from inspection, but it will take a few
      more days running it on that customers workload to be confident
      we nailed it.  We don't have any other reproducible test case.
      
      The cpu_hotplug_lock() tends to cover large runs of code.
      The other places that hold both that lock and the cpuset callback
      mutex lock always nest the cpuset lock inside the hotplug lock.
      This place tries to do the reverse, risking an ABBA deadlock.
      
      This is in the cpuset_rmdir() code, where we:
        * take the callback_mutex lock
        * mark the cpuset CS_REMOVED
        * call update_cpu_domains for cpu_exclusive cpusets
        * in that call, take the cpu_hotplug lock if the
          cpuset is marked for removal.
      
      Thanks to Jack Steiner for identifying this deadlock.
      
      The fix is to tear down the dynamic sched domain before we grab
      the cpuset callback_mutex lock.  This way, the two locks are
      serialized, with the hotplug lock taken and released before
      trying for the cpuset lock.
      
      I suspect that this bug was introduced when I changed the
      cpuset locking from one lock to two.  The dynamic sched domain
      dependency on cpu_exclusive cpusets and its hotplug hooks were
      added to this code earlier, when cpusets had only a single lock.
      It may well have been fine then.
      Signed-off-by: default avatarPaul Jackson <pj@sgi.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      abb5a5cc
  16. 30 Jun, 2006 2 commits
  17. 26 Jun, 2006 2 commits
    • Eric W. Biederman's avatar
      [PATCH] proc: Use struct pid not struct task_ref · 13b41b09
      Eric W. Biederman authored
      
      
      Incrementally update my proc-dont-lock-task_structs-indefinitely patches so
      that they work with struct pid instead of struct task_ref.
      
      Mostly this is a straight 1-1 substitution.
      Signed-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      13b41b09
    • Eric W. Biederman's avatar
      [PATCH] proc: don't lock task_structs indefinitely · 99f89551
      Eric W. Biederman authored
      
      
      Every inode in /proc holds a reference to a struct task_struct.  If a
      directory or file is opened and remains open after the the task exits this
      pinning continues.  With 8K stacks on a 32bit machine the amount pinned per
      file descriptor is about 10K.
      
      Normally I would figure a reasonable per user process limit is about 100
      processes.  With 80 processes, with a 1000 file descriptors each I can trigger
      the 00M killer on a 32bit kernel, because I have pinned about 800MB of useless
      data.
      
      This patch replaces the struct task_struct pointer with a pointer to a struct
      task_ref which has a struct task_struct pointer.  The so the pinning of dead
      tasks does not happen.
      
      The code now has to contend with the fact that the task may now exit at any
      time.  Which is a little but not muh more complicated.
      
      With this change it takes about 1000 processes each opening up 1000 file
      descriptors before I can trigger the OOM killer.  Much better.
      
      [mlp@google.com: task_mmu small fixes]
      Signed-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Cc: Albert Cahalan <acahalan@gmail.com>
      Signed-off-by: default avatarPrasanna Meda <mlp@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      99f89551
  18. 23 Jun, 2006 2 commits
    • David Quigley's avatar
      [PATCH] SELinux: add security hook call to mediate attach_task (kernel/cpuset.c) · 22fb52dd
      David Quigley authored
      
      
      Add a security hook call to enable security modules to control the ability
      to attach a task to a cpuset.  While limited control over this operation is
      possible via permission checks on the pseudo fs interface, those checks are
      not sufficient to control access to the target task, which is looked up in
      this function.  The existing task_setscheduler hook is re-used for this
      operation since this falls under the same class of operations.
      Signed-off-by: default avatarDavid Quigley <dpquigl@tycho.nsa.gov>
      Acked-by: default avatarStephen Smalley <sds@tycho.nsa.gov>
      Signed-off-by: default avatarJames Morris <jmorris@namei.org>
      Acked-by: default avatarPaul Jackson <pj@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      22fb52dd
    • David Howells's avatar
      [PATCH] VFS: Permit filesystem to override root dentry on mount · 454e2398
      David Howells authored
      
      
      Extend the get_sb() filesystem operation to take an extra argument that
      permits the VFS to pass in the target vfsmount that defines the mountpoint.
      
      The filesystem is then required to manually set the superblock and root dentry
      pointers.  For most filesystems, this should be done with simple_set_mnt()
      which will set the superblock pointer and then set the root dentry to the
      superblock's s_root (as per the old default behaviour).
      
      The get_sb() op now returns an integer as there's now no need to return the
      superblock pointer.
      
      This patch permits a superblock to be implicitly shared amongst several mount
      points, such as can be done with NFS to avoid potential inode aliasing.  In
      such a case, simple_set_mnt() would not be called, and instead the mnt_root
      and mnt_sb would be set directly.
      
      The patch also makes the following changes:
      
       (*) the get_sb_*() convenience functions in the core kernel now take a vfsmount
           pointer argument and return an integer, so most filesystems have to change
           very little.
      
       (*) If one of the convenience function is not used, then get_sb() should
           normally call simple_set_mnt() to instantiate the vfsmount. This will
           always return 0, and so can be tail-called from get_sb().
      
       (*) generic_shutdown_super() now calls shrink_dcache_sb() to clean up the
           dcache upon superblock destruction rather than shrink_dcache_anon().
      
           This is required because the superblock may now have multiple trees that
           aren't actually bound to s_root, but that still need to be cleaned up. The
           currently called functions assume that the whole tree is rooted at s_root,
           and that anonymous dentries are not the roots of trees which results in
           dentries being left unculled.
      
           However, with the way NFS superblock sharing are currently set to be
           implemented, these assumptions are violated: the root of the filesystem is
           simply a dummy dentry and inode (the real inode for '/' may well be
           inaccessible), and all the vfsmounts are rooted on anonymous[*] dentries
           with child trees.
      
           [*] Anonymous until discovered from another tree.
      
       (*) The documentation has been adjusted, including the additional bit of
           changing ext2_* into foo_* in the documentation.
      
      [akpm@osdl.org: convert ipath_fs, do other stuff]
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Acked-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Cc: Nathan Scott <nathans@sgi.com>
      Cc: Roland Dreier <rolandd@cisco.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      454e2398
  19. 21 May, 2006 2 commits
  20. 31 Mar, 2006 3 commits
  21. 24 Mar, 2006 3 commits