1. 12 Sep, 2005 1 commit
    • Paul Jackson's avatar
      [PATCH] cpuset semaphore depth check optimize · b3426599
      Paul Jackson authored
      
      
      Optimize the deadlock avoidance check on the global cpuset
      semaphore cpuset_sem.  Instead of adding a depth counter to the
      task struct of each task, rather just two words are enough, one
      to store the depth and the other the current cpuset_sem holder.
      
      Thanks to Nikita Danilov for the idea.
      Signed-off-by: default avatarPaul Jackson <pj@sgi.com>
      
      [ We may want to change this further, but at least it's now
        a totally internal decision to the cpusets code ]
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      b3426599
  2. 10 Sep, 2005 1 commit
    • Paul Jackson's avatar
      [PATCH] cpuset semaphore depth check deadlock fix · 4247bdc6
      Paul Jackson authored
      
      
      The cpusets-formalize-intermediate-gfp_kernel-containment patch
      has a deadlock problem.
      
      This patch was part of a set of four patches to make more
      extensive use of the cpuset 'mem_exclusive' attribute to
      manage kernel GFP_KERNEL memory allocations and to constrain
      the out-of-memory (oom) killer.
      
      A task that is changing cpusets in particular ways on a system
      when it is very short of free memory could double trip over
      the global cpuset_sem semaphore (get the lock and then deadlock
      trying to get it again).
      
      The second attempt to get cpuset_sem would be in the routine
      cpuset_zone_allowed().  This was discovered by code inspection.
      I can not reproduce the problem except with an artifically
      hacked kernel and a specialized stress test.
      
      In real life you cannot hit this unless you are manipulating
      cpusets, and are very unlikely to hit it unless you are rapidly
      modifying cpusets on a memory tight system.  Even then it would
      be a rare occurence.
      
      If you did hit it, the task double tripping over cpuset_sem
      would deadlock in the kernel, and any other task also trying
      to manipulate cpusets would deadlock there too, on cpuset_sem.
      Your batch manager would be wedged solid (if it was cpuset
      savvy), but classic Unix shells and utilities would work well
      enough to reboot the system.
      
      The unusual condition that led to this bug is that unlike most
      semaphores, cpuset_sem _can_ be acquired while in the page
      allocation code, when __alloc_pages() calls cpuset_zone_allowed.
      So it easy to mistakenly perform the following sequence:
        1) task makes system call to alter a cpuset
        2) take cpuset_sem
        3) try to allocate memory
        4) memory allocator, via cpuset_zone_allowed, trys to take cpuset_sem
        5) deadlock
      
      The reason that this is not a serious bug for most users
      is that almost all calls to allocate memory don't require
      taking cpuset_sem.  Only some code paths off the beaten
      track require taking cpuset_sem -- which is good.  Taking
      a global semaphore on the main code path for allocating
      memory would not scale well.
      
      This patch fixes this deadlock by wrapping the up() and down()
      calls on cpuset_sem in kernel/cpuset.c with code that tracks
      the nesting depth of the current task on that semaphore, and
      only does the real down() if the task doesn't hold the lock
      already, and only does the real up() if the nesting depth
      (number of unmatched downs) is exactly one.
      
      The previous required use of refresh_mems(), anytime that
      the cpuset_sem semaphore was acquired and the code executed
      while holding that semaphore might try to allocate memory, is
      no longer required.  Two refresh_mems() calls were removed
      thanks to this.  This is a good change, as failing to get
      all the necessary refresh_mems() calls placed was a primary
      source of bugs in this cpuset code.  The only remaining call
      to refresh_mems() is made while doing a memory allocation,
      if certain task memory placement data needs to be updated
      from its cpuset, due to the cpuset having been changed behind
      the tasks back.
      Signed-off-by: default avatarPaul Jackson <pj@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      4247bdc6
  3. 09 Sep, 2005 1 commit
  4. 07 Sep, 2005 3 commits
    • John Hawkes's avatar
      [PATCH] cpusets: re-enable "dynamic sched domains" · 0811bab2
      John Hawkes authored
      
      
      Revert the hack introduced last week.
      Signed-off-by: default avatarJohn Hawkes <hawkes@sgi.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      0811bab2
    • Paul Jackson's avatar
      [PATCH] cpusets: confine oom_killer to mem_exclusive cpuset · ef08e3b4
      Paul Jackson authored
      
      
      Now the real motivation for this cpuset mem_exclusive patch series seems
      trivial.
      
      This patch keeps a task in or under one mem_exclusive cpuset from provoking an
      oom kill of a task under a non-overlapping mem_exclusive cpuset.  Since only
      interrupt and GFP_ATOMIC allocations are allowed to escape mem_exclusive
      containment, there is little to gain from oom killing a task under a
      non-overlapping mem_exclusive cpuset, as almost all kernel and user memory
      allocation must come from disjoint memory nodes.
      
      This patch enables configuring a system so that a runaway job under one
      mem_exclusive cpuset cannot cause the killing of a job in another such cpuset
      that might be using very high compute and memory resources for a prolonged
      time.
      Signed-off-by: default avatarPaul Jackson <pj@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      ef08e3b4
    • Paul Jackson's avatar
      [PATCH] cpusets: formalize intermediate GFP_KERNEL containment · 9bf2229f
      Paul Jackson authored
      
      
      This patch makes use of the previously underutilized cpuset flag
      'mem_exclusive' to provide what amounts to another layer of memory placement
      resolution.  With this patch, there are now the following four layers of
      memory placement available:
      
       1) The whole system (interrupt and GFP_ATOMIC allocations can use this),
       2) The nearest enclosing mem_exclusive cpuset (GFP_KERNEL allocations can use),
       3) The current tasks cpuset (GFP_USER allocations constrained to here), and
       4) Specific node placement, using mbind and set_mempolicy.
      
      These nest - each layer is a subset (same or within) of the previous.
      
      Layer (2) above is new, with this patch.  The call used to check whether a
      zone (its node, actually) is in a cpuset (in its mems_allowed, actually) is
      extended to take a gfp_mask argument, and its logic is extended, in the case
      that __GFP_HARDWALL is not set in the flag bits, to look up the cpuset
      hierarchy for the nearest enclosing mem_exclusive cpuset, to determine if
      placement is allowed.  The definition of GFP_USER, which used to be identical
      to GFP_KERNEL, is changed to also set the __GFP_HARDWALL bit, in the previous
      cpuset_gfp_hardwall_flag patch.
      
      GFP_ATOMIC and GFP_KERNEL allocations will stay within the current tasks
      cpuset, so long as any node therein is not too tight on memory, but will
      escape to the larger layer, if need be.
      
      The intended use is to allow something like a batch manager to handle several
      jobs, each job in its own cpuset, but using common kernel memory for caches
      and such.  Swapper and oom_kill activity is also constrained to Layer (2).  A
      task in or below one mem_exclusive cpuset should not cause swapping on nodes
      in another non-overlapping mem_exclusive cpuset, nor provoke oom_killing of a
      task in another such cpuset.  Heavy use of kernel memory for i/o caching and
      such by one job should not impact the memory available to jobs in other
      non-overlapping mem_exclusive cpusets.
      
      This patch enables providing hardwall, inescapable cpusets for memory
      allocations of each job, while sharing kernel memory allocations between
      several jobs, in an enclosing mem_exclusive cpuset.
      
      Like Dinakar's patch earlier to enable administering sched domains using the
      cpu_exclusive flag, this patch also provides a useful meaning to a cpuset flag
      that had previously done nothing much useful other than restrict what cpuset
      configurations were allowed.
      Signed-off-by: default avatarPaul Jackson <pj@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      9bf2229f
  5. 26 Aug, 2005 2 commits
    • Paul Jackson's avatar
      [PATCH] completely disable cpu_exclusive sched domain · 212d6d22
      Paul Jackson authored
      
      
      At the suggestion of Nick Piggin and Dinakar, totally disable
      the facility to allow cpu_exclusive cpusets to define dynamic
      sched domains in Linux 2.6.13, in order to avoid problems
      first reported by John Hawkes (corrupt sched data structures
      and kernel oops).
      
      This has been built for ppc64, i386, ia64, x86_64, sparc, alpha.
      It has been built, booted and tested for cpuset functionality
      on an SN2 (ia64).
      
      Dinakar or Nick - could you verify that it for sure does avoid
      the problems Hawkes reported.  Hawkes is out of town, and I don't
      have the recipe to reproduce what he found.
      Signed-off-by: default avatarPaul Jackson <pj@sgi.com>
      Acked-by: default avatarNick Piggin <npiggin@suse.de>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      212d6d22
    • Paul Jackson's avatar
      [PATCH] undo partial cpu_exclusive sched domain disabling · ca2f3daf
      Paul Jackson authored
      
      
      The partial disabling of Dinakar's new facility to allow
      cpu_exclusive cpusets to define dynamic sched domains
      doesn't go far enough.  At the suggestion of Nick Piggin
      and Dinakar, let us instead totally disable this facility
      for 2.6.13, in order to avoid problems first reported
      by John Hawkes (corrupt sched data structures and kernel oops).
      
      This patch removes the partial disabling code in 2.6.13-rc7,
      in anticipation of the next patch, which will totally disable
      it instead.
      Signed-off-by: default avatarPaul Jackson <pj@sgi.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      ca2f3daf
  6. 24 Aug, 2005 2 commits
    • Paul Jackson's avatar
      [PATCH] cpu_exclusive sched domains build fix · 3725822f
      Paul Jackson authored
      
      
      As reported by Paul Mackerras <paulus@samba.org>, the previous patch
      "cpu_exclusive sched domains fix" broke the ppc64 build with
      CONFIC_CPUSET, yielding error messages:
      
      kernel/cpuset.c: In function 'update_cpu_domains':
      kernel/cpuset.c:648: error: invalid lvalue in unary '&'
      kernel/cpuset.c:648: error: invalid lvalue in unary '&'
      
      On some arch's, the node_to_cpumask() is a function, returning
      a cpumask_t.  But the for_each_cpu_mask() requires an lvalue mask.
      
      The following patch fixes this build failure by making a copy
      of the cpumask_t on the stack.
      Signed-off-by: default avatarPaul Jackson <pj@sgi.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      3725822f
    • Paul Jackson's avatar
      [PATCH] cpu_exclusive sched domains on partial nodes temp fix · d10689b6
      Paul Jackson authored
      
      
      This keeps the kernel/cpuset.c routine update_cpu_domains() from
      invoking the sched.c routine partition_sched_domains() if the cpuset in
      question doesn't fall on node boundaries.
      
      I have boot tested this on an SN2, and with the help of a couple of ad
      hoc printk's, determined that it does indeed avoid calling the
      partition_sched_domains() routine on partial nodes.
      
      I did not directly verify that this avoids setting up bogus sched
      domains or avoids the oops that Hawkes saw.
      
      This patch imposes a silent artificial constraint on which cpusets can
      be used to define dynamic sched domains.
      
      This patch should allow proceeding with this new feature in 2.6.13 for
      the configurations in which it is useful (node alligned sched domains)
      while avoiding trying to setup sched domains in the less useful cases
      that can cause the kernel corruption and oops.
      Signed-off-by: default avatarPaul Jackson <pj@sgi.com>
      Acked-by: default avatarIngo Molnar <mingo@elte.hu>
      Acked-by: default avatarDinakar Guniguntala <dino@in.ibm.com>
      Acked-by: default avatarJohn Hawkes <hawkes@sgi.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      d10689b6
  7. 09 Aug, 2005 1 commit
    • Paul Jackson's avatar
      [PATCH] cpuset release ABBA deadlock fix · 3077a260
      Paul Jackson authored
      
      
      Fix possible cpuset_sem ABBA deadlock if 'notify_on_release' set.
      
      For a particular usage pattern, creating and destroying cpusets fairly
      frequently using notify_on_release, on a very large system, this deadlock
      can be seen every few days.  If you are not using the cpuset
      notify_on_release feature, you will never see this deadlock.
      
      The existing code, on task exit (or cpuset deletion) did:
      
        get cpuset_sem
        if cpuset marked notify_on_release and is ready to release:
          compute cpuset path relative to /dev/cpuset mount point
          call_usermodehelper() forks /sbin/cpuset_release_agent with path
        drop cpuset_sem
      
      Unfortunately, the fork in call_usermodehelper can allocate memory, and
      allocating memory can require cpuset_sem, if the mems_generation values
      changed in the interim.  This results in an ABBA deadlock, trying to obtain
      cpuset_sem when it is already held by the current task.
      
      To fix this, I put the cpuset path (which must be computed while holding
      cpuset_sem) in a temporary buffer, to be used in the call_usermodehelper
      call of /sbin/cpuset_release_agent only _after_ dropping cpuset_sem.
      
      So the new logic is:
      
        get cpuset_sem
        if cpuset marked notify_on_release and is ready to release:
          compute cpuset path relative to /dev/cpuset mount point
          stash path in kmalloc'd buffer
        drop cpuset_sem
        call_usermodehelper() forks /sbin/cpuset_release_agent with path
        free path
      
      The sharp eyed reader might notice that this patch does not contain any
      calls to kmalloc.  The existing code in the check_for_release() routine was
      already kmalloc'ing a buffer to hold the cpuset path.  In the old code, it
      just held the buffer for a few lines, over the cpuset_release_agent() call
      that in turn invoked call_usermodehelper().  In the new code, with the
      application of this patch, it returns that buffer via the new char
      **ppathbuf parameter, for later use and freeing in cpuset_release_agent(),
      which is called after cpuset_sem is dropped.  Whereas the old code has just
      one call to cpuset_release_agent(), right in the check_for_release()
      routine, the new code has three calls to cpuset_release_agent(), from the
      various places that a cpuset can be released.
      
      This patch has been build and booted on SN2, and passed a stress test that
      previously hit the deadlock within a few seconds.
      Signed-off-by: default avatarPaul Jackson <pj@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      3077a260
  8. 27 Jul, 2005 1 commit
  9. 25 Jun, 2005 1 commit
  10. 23 Jun, 2005 1 commit
  11. 27 May, 2005 1 commit
    • Paul Jackson's avatar
      [PATCH] cpuset exit NULL dereference fix · 2efe86b8
      Paul Jackson authored
      
      
      There is a race in the kernel cpuset code, between the code
      to handle notify_on_release, and the code to remove a cpuset.
      The notify_on_release code can end up trying to access a
      cpuset that has been removed.  In the most common case, this
      causes a NULL pointer dereference from the routine cpuset_path.
      However all manner of bad things are possible, in theory at least.
      
      The existing code decrements the cpuset use count, and if the
      count goes to zero, processes the notify_on_release request,
      if appropriate.  However, once the count goes to zero, unless we
      are holding the global cpuset_sem semaphore, there is nothing to
      stop another task from immediately removing the cpuset entirely,
      and recycling its memory.
      
      The obvious fix would be to always hold the cpuset_sem
      semaphore while decrementing the use count and dealing with
      notify_on_release.  However we don't want to force a global
      semaphore into the mainline task exit path, as that might create
      a scaling problem.
      
      The actual fix is almost as easy - since this is only an issue
      for cpusets using notify_on_release, which the top level big
      cpusets don't normally need to use, only take the cpuset_sem
      for cpusets using notify_on_release.
      
      This code has been run for hours without a hiccup, while running
      a cpuset create/destroy stress test that could crash the existing
      kernel in seconds.  This patch applies to the current -linus
      git kernel.
      Signed-off-by: default avatarPaul Jackson <pj@sgi.com>
      Acked-by: default avatarSimon Derr <simon.derr@bull.net>
      Acked-by: default avatarDinakar Guniguntala <dino@in.ibm.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      2efe86b8
  12. 16 Apr, 2005 2 commits
    • Benoit Boissinot's avatar
      [PATCH] cpuset: remove function attribute const · 9a848896
      Benoit Boissinot authored
      
      
      gcc-4 warns with
      include/linux/cpuset.h:21: warning: type qualifiers ignored on function
      return type
      
      cpuset_cpus_allowed is declared with const
      extern const cpumask_t cpuset_cpus_allowed(const struct task_struct *p);
      
      First const should be __attribute__((const)), but the gcc manual
      explains that:
      
      "Note that a function that has pointer arguments and examines the data
      pointed to must not be declared const. Likewise, a function that calls a
      non-const function usually must not be const. It does not make sense for
      a const function to return void."
      
      The following patch remove const from the function declaration.
      Signed-off-by: default avatarBenoit Boissinot <benoit.boissinot@ens-lyon.org>
      Acked-by: default avatarPaul Jackson <pj@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      9a848896
    • Linus Torvalds's avatar
      Linux-2.6.12-rc2 · 1da177e4
      Linus Torvalds authored
      Initial git repository build. I'm not bothering with the full history,
      even though we have it. We can create a separate "historical" git
      archive of that later if we want to, and in the meantime it's about
      3.2GB when imported into git - space that would just make the early
      git days unnecessarily complicated, when we don't have a lot of good
      infrastructure for it.
      
      Let it rip!
      1da177e4