1. 24 Jul, 2008 2 commits
    • Mel Gorman's avatar
      hugetlb: move hugetlb_acct_memory() · fc1b8a73
      Mel Gorman authored
      
      
      This is a patchset to give reliable behaviour to a process that
      successfully calls mmap(MAP_PRIVATE) on a hugetlbfs file.  Currently, it
      is possible for the process to be killed due to a small hugepage pool size
      even if it calls mlock().
      
      MAP_SHARED mappings on hugetlbfs reserve huge pages at mmap() time.  This
      guarantees all future faults against the mapping will succeed.  This
      allows local allocations at first use improving NUMA locality whilst
      retaining reliability.
      
      MAP_PRIVATE mappings do not reserve pages.  This can result in an
      application being SIGKILLed later if a huge page is not available at fault
      time.  This makes huge pages usage very ill-advised in some cases as the
      unexpected application failure cannot be detected and handled as it is
      immediately fatal.  Although an application may force instantiation of the
      pages using mlock(), this may lead to poor memory placement and the
      process may still be killed when performing COW.
      
      This patchset introduces a reliability guarantee for the process which
      creates a private mapping, i.e.  the process that calls mmap() on a
      hugetlbfs file successfully.  The first patch of the set is purely
      mechanical code move to make later diffs easier to read.  The second patch
      will guarantee faults up until the process calls fork().  After patch two,
      as long as the child keeps the mappings, the parent is no longer
      guaranteed to be reliable.  Patch 3 guarantees that the parent will always
      successfully COW by unmapping the pages from the child in the event there
      are insufficient pages in the hugepage pool in allocate a new page, be it
      via a static or dynamic pool.
      
      Existing hugepage-aware applications are unlikely to be affected by this
      change.  For much of hugetlbfs's history, pages were pre-faulted at mmap()
      time or mmap() failed which acts in a reserve-like manner.  If the pool is
      sized correctly already so that parent and child can fault reliably, the
      application will not even notice the reserves.  It's only when the pool is
      too small for the application to function perfectly reliably that the
      reserves come into play.
      
      Credit goes to Andy Whitcroft for cleaning up a number of mistakes during
      review before the patches were released.
      
      This patch:
      
      A later patch in this set needs to call hugetlb_acct_memory() before it is
      defined.  This patch moves the function without modification.  This makes
      later diffs easier to read.
      
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Acked-by: default avatarAdam Litke <agl@us.ibm.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fc1b8a73
    • Adrian Bunk's avatar
      mm/hugetlb.c: fix duplicate variable · 75353bed
      Adrian Bunk authored
      
      
      It's confusing that set_max_huge_pages() contained two different
      variables named "ret", and although the code works correctly this should
      be fixed.
      
      The inner of the two variables can simply be removed.
      
      Spotted by sparse.
      
      Signed-off-by: default avatarAdrian Bunk <bunk@kernel.org>
      Cc:  "KOSAKI Motohiro" <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      75353bed
  2. 06 Jun, 2008 1 commit
    • Nick Piggin's avatar
      hugetlb: fix lockdep error · 46478758
      Nick Piggin authored
      
      
      =============================================
      [ INFO: possible recursive locking detected ]
      2.6.26-rc4 #30
      ---------------------------------------------
      heap-overflow/2250 is trying to acquire lock:
       (&mm->page_table_lock){--..}, at: [<c0000000000cf2e8>] .copy_hugetlb_page_range+0x108/0x280
      
      but task is already holding lock:
       (&mm->page_table_lock){--..}, at: [<c0000000000cf2dc>] .copy_hugetlb_page_range+0xfc/0x280
      
      other info that might help us debug this:
      3 locks held by heap-overflow/2250:
       #0:  (&mm->mmap_sem){----}, at: [<c000000000050e44>] .dup_mm+0x134/0x410
       #1:  (&mm->mmap_sem/1){--..}, at: [<c000000000050e54>] .dup_mm+0x144/0x410
       #2:  (&mm->page_table_lock){--..}, at: [<c0000000000cf2dc>] .copy_hugetlb_page_range+0xfc/0x280
      
      stack backtrace:
      Call Trace:
      [c00000003b2774e0] [c000000000010ce4] .show_stack+0x74/0x1f0 (unreliable)
      [c00000003b2775a0] [c0000000003f10e0] .dump_stack+0x20/0x34
      [c00000003b277620] [c0000000000889bc] .__lock_acquire+0xaac/0x1080
      [c00000003b277740] [c000000000089000] .lock_acquire+0x70/0xb0
      [c00000003b2777d0] [c0000000003ee15c] ._spin_lock+0x4c/0x80
      [c00000003b277870] [c0000000000cf2e8] .copy_hugetlb_page_range+0x108/0x280
      [c00000003b277950] [c0000000000bcaa8] .copy_page_range+0x558/0x790
      [c00000003b277ac0] [c000000000050fe0] .dup_mm+0x2d0/0x410
      [c00000003b277ba0] [c000000000051d24] .copy_process+0xb94/0x1020
      [c00000003b277ca0] [c000000000052244] .do_fork+0x94/0x310
      [c00000003b277db0] [c000000000011240] .sys_clone+0x60/0x80
      [c00000003b277e30] [c0000000000078c4] .ppc_clone+0x8/0xc
      
      Fix is the same way that mm/memory.c copy_page_range does the
      lockdep annotation.
      
      Acked-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: default avatarAdam Litke <agl@us.ibm.com>
      Acked-by: default avatarNishanth Aravamudan <nacc@us.ibm.com>
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      46478758
  3. 29 Apr, 2008 2 commits
    • Nishanth Aravamudan's avatar
      page allocator: explicitly retry hugepage allocations · 551883ae
      Nishanth Aravamudan authored
      
      
      Add __GFP_REPEAT to hugepage allocations.  Do so to not necessitate userspace
      putting pressure on the VM by repeated echo's into /proc/sys/vm/nr_hugepages
      to grow the pool.  With the previous patch to allow for large-order
      __GFP_REPEAT attempts to loop for a bit (as opposed to indefinitely), this
      increases the likelihood of getting hugepages when the system experiences (or
      recently experienced) load.
      
      Mel tested the patchset on an x86_32 laptop.  With the patches, it was easier
      to use the proc interface to grow the hugepage pool.  The following is the
      output of a script that grows the pool as much as possible running on
      2.6.25-rc9.
      
      Allocating hugepages test
      -------------------------
      Disabling OOM Killer for current test process
      Starting page count: 0
      Attempt 1: 57 pages Progress made with 57 pages
      Attempt 2: 73 pages Progress made with 16 pages
      Attempt 3: 74 pages Progress made with 1 pages
      Attempt 4: 75 pages Progress made with 1 pages
      Attempt 5: 77 pages Progress made with 2 pages
      
      77 pages was the most it allocated but it took 5 attempts from userspace
      to get it. With the 3 patches in this series applied,
      
      Allocating hugepages test
      -------------------------
      Disabling OOM Killer for current test process
      Starting page count: 0
      Attempt 1: 75 pages Progress made with 75 pages
      Attempt 2: 76 pages Progress made with 1 pages
      Attempt 3: 79 pages Progress made with 3 pages
      
      And 79 pages was the most it got. Your patches were able to allocate the
      bulk of possible pages on the first attempt.
      
      Signed-off-by: default avatarNishanth Aravamudan <nacc@us.ibm.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Tested-by: default avatarMel Gorman <mel@csn.ul.ie>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      551883ae
    • Harvey Harrison's avatar
      mm: fix integer as NULL pointer warnings · 7b8ee84d
      Harvey Harrison authored
      
      
      mm/hugetlb.c:207:11: warning: Using plain integer as NULL pointer
      
      Signed-off-by: default avatarHarvey Harrison <harvey.harrison@gmail.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7b8ee84d
  4. 28 Apr, 2008 9 commits
    • Gerald Schaefer's avatar
      hugetlbfs: common code update for s390 · 7f2e9525
      Gerald Schaefer authored
      
      
      Huge ptes have a special type on s390 and cannot be handled with the standard
      pte functions in certain cases, e.g.  because of a different location of the
      invalid bit.  This patch adds some new architecture- specific functions to
      hugetlb common code, as a prerequisite for the s390 large page support.
      
      This won't affect other architectures in functionality, but I need to add some
      new dummy inline functions to the headers.
      
      Acked-by: default avatarMartin Schwidefsky <schwidefsky@de.ibm.com>
      Signed-off-by: default avatarGerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "David S. Miller" <davem@davemloft.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7f2e9525
    • Gerald Schaefer's avatar
      hugetlbfs: add missing TLB flush to hugetlb_cow() · 8fe627ec
      Gerald Schaefer authored
      
      
      A cow break on a hugetlbfs page with page_count > 1 will set a new pte with
      set_huge_pte_at(), w/o any tlb flush operation.  The old pte will remain in
      the tlb and subsequent write access to the page will result in a page fault
      loop, for as long as it may take until the tlb is flushed from somewhere else.
       This patch introduces an architecture-specific huge_ptep_clear_flush()
      function, which is called before the the set_huge_pte_at() in hugetlb_cow().
      
      ATTENTION: This is just a nop on all architectures for now, the s390
      implementation will come with our large page patch later.  Other architectures
      should define their own huge_ptep_clear_flush() if needed.
      
      Acked-by: default avatarMartin Schwidefsky <schwidefsky@de.ibm.com>
      Signed-off-by: default avatarGerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "David S. Miller" <davem@davemloft.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8fe627ec
    • Lee Schermerhorn's avatar
      mempolicy: rework mempolicy Reference Counting [yet again] · 52cd3b07
      Lee Schermerhorn authored
      
      
      After further discussion with Christoph Lameter, it has become clear that my
      earlier attempts to clean up the mempolicy reference counting were a bit of
      overkill in some areas, resulting in superflous ref/unref in what are usually
      fast paths.  In other areas, further inspection reveals that I botched the
      unref for interleave policies.
      
      A separate patch, suitable for upstream/stable trees, fixes up the known
      errors in the previous attempt to fix reference counting.
      
      This patch reworks the memory policy referencing counting and, one hopes,
      simplifies the code.  Maybe I'll get it right this time.
      
      See the update to the numa_memory_policy.txt document for a discussion of
      memory policy reference counting that motivates this patch.
      
      Summary:
      
      Lookup of mempolicy, based on (vma, address) need only add a reference for
      shared policy, and we need only unref the policy when finished for shared
      policies.  So, this patch backs out all of the unneeded extra reference
      counting added by my previous attempt.  It then unrefs only shared policies
      when we're finished with them, using the mpol_cond_put() [conditional put]
      helper function introduced by this patch.
      
      Note that shmem_swapin() calls read_swap_cache_async() with a dummy vma
      containing just the policy.  read_swap_cache_async() can call alloc_page_vma()
      multiple times, so we can't let alloc_page_vma() unref the shared policy in
      this case.  To avoid this, we make a copy of any non-null shared policy and
      remove the MPOL_F_SHARED flag from the copy.  This copy occurs before reading
      a page [or multiple pages] from swap, so the overhead should not be an issue
      here.
      
      I introduced a new static inline function "mpol_cond_copy()" to copy the
      shared policy to an on-stack policy and remove the flags that would require a
      conditional free.  The current implementation of mpol_cond_copy() assumes that
      the struct mempolicy contains no pointers to dynamically allocated structures
      that must be duplicated or reference counted during copy.
      
      Signed-off-by: default avatarLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      52cd3b07
    • Lee Schermerhorn's avatar
      mempolicy: rename mpol_free to mpol_put · f0be3d32
      Lee Schermerhorn authored
      
      
      This is a change that was requested some time ago by Mel Gorman.  Makes sense
      to me, so here it is.
      
      Note: I retain the name "mpol_free_shared_policy()" because it actually does
      free the shared_policy, which is NOT a reference counted object.  However, ...
      
      The mempolicy object[s] referenced by the shared_policy are reference counted,
      so mpol_put() is used to release the reference held by the shared_policy.  The
      mempolicy might not be freed at this time, because some task attached to the
      shared object associated with the shared policy may be in the process of
      allocating a page based on the mempolicy.  In that case, the task performing
      the allocation will hold a reference on the mempolicy, obtained via
      mpol_shared_policy_lookup().  The mempolicy will be freed when all tasks
      holding such a reference have called mpol_put() for the mempolicy.
      
      Signed-off-by: default avatarLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f0be3d32
    • Adam Litke's avatar
      Subject: [PATCH] hugetlb: vmstat events for huge page allocations · 3b116300
      Adam Litke authored
      
      
      Allocating huge pages directly from the buddy allocator is not guaranteed to
      succeed.  Success depends on several factors (such as the amount of physical
      memory available and the level of fragmentation).  With the addition of
      dynamic hugetlb pool resizing, allocations can occur much more frequently.
      For these reasons it is desirable to keep track of huge page allocation
      successes and failures.
      
      Add two new vmstat entries to track huge page allocations that succeed and
      fail.  The presence of the two entries is contingent upon CONFIG_HUGETLB_PAGE
      being enabled.
      
      [akpm@linux-foundation.org: reduced ifdeffery]
      Signed-off-by: default avatarAdam Litke <agl@us.ibm.com>
      Signed-off-by: default avatarEric Munson <ebmunson@us.ibm.com>
      Tested-by: default avatarMel Gorman <mel@csn.ul.ie>
      Reviewed-by: default avatarAndy Whitcroft <apw@shadowen.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3b116300
    • Adam Litke's avatar
      hugetlb: decrease hugetlb_lock cycling in gather_surplus_huge_pages · 19fc3f0a
      Adam Litke authored
      
      
      To reduce hugetlb_lock acquisitions and releases when freeing excess surplus
      pages, scan the page list in two parts.  First, transfer the needed pages to
      the hugetlb pool.  Then drop the lock and free the remaining pages back to the
      buddy allocator.
      
      In the common case there are zero excess pages and no lock operations are
      required.
      
      Thanks Mel Gorman for this improvement.
      
      Signed-off-by: default avatarAdam Litke <agl@us.ibm.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      19fc3f0a
    • Mel Gorman's avatar
      mm: filter based on a nodemask as well as a gfp_mask · 19770b32
      Mel Gorman authored
      
      
      The MPOL_BIND policy creates a zonelist that is used for allocations
      controlled by that mempolicy.  As the per-node zonelist is already being
      filtered based on a zone id, this patch adds a version of __alloc_pages() that
      takes a nodemask for further filtering.  This eliminates the need for
      MPOL_BIND to create a custom zonelist.
      
      A positive benefit of this is that allocations using MPOL_BIND now use the
      local node's distance-ordered zonelist instead of a custom node-id-ordered
      zonelist.  I.e., pages will be allocated from the closest allowed node with
      available memory.
      
      [Lee.Schermerhorn@hp.com: Mempolicy: update stale documentation and comments]
      [Lee.Schermerhorn@hp.com: Mempolicy: make dequeue_huge_page_vma() obey MPOL_BIND nodemask]
      [Lee.Schermerhorn@hp.com: Mempolicy: make dequeue_huge_page_vma() obey MPOL_BIND nodemask rework]
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Acked-by: default avatarChristoph Lameter <clameter@sgi.com>
      Signed-off-by: default avatarLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      19770b32
    • Mel Gorman's avatar
      mm: have zonelist contains structs with both a zone pointer and zone_idx · dd1a239f
      Mel Gorman authored
      
      
      Filtering zonelists requires very frequent use of zone_idx().  This is costly
      as it involves a lookup of another structure and a substraction operation.  As
      the zone_idx is often required, it should be quickly accessible.  The node idx
      could also be stored here if it was found that accessing zone->node is
      significant which may be the case on workloads where nodemasks are heavily
      used.
      
      This patch introduces a struct zoneref to store a zone pointer and a zone
      index.  The zonelist then consists of an array of these struct zonerefs which
      are looked up as necessary.  Helpers are given for accessing the zone index as
      well as the node index.
      
      [kamezawa.hiroyu@jp.fujitsu.com: Suggested struct zoneref instead of embedding information in pointers]
      [hugh@veritas.com: mm-have-zonelist: fix memcg ooms]
      [hugh@veritas.com: just return do_try_to_free_pages]
      [hugh@veritas.com: do_try_to_free_pages gfp_mask redundant]
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Acked-by: default avatarChristoph Lameter <clameter@sgi.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dd1a239f
    • Mel Gorman's avatar
      mm: use two zonelist that are filtered by GFP mask · 54a6eb5c
      Mel Gorman authored
      
      
      Currently a node has two sets of zonelists, one for each zone type in the
      system and a second set for GFP_THISNODE allocations.  Based on the zones
      allowed by a gfp mask, one of these zonelists is selected.  All of these
      zonelists consume memory and occupy cache lines.
      
      This patch replaces the multiple zonelists per-node with two zonelists.  The
      first contains all populated zones in the system, ordered by distance, for
      fallback allocations when the target/preferred node has no free pages.  The
      second contains all populated zones in the node suitable for GFP_THISNODE
      allocations.
      
      An iterator macro is introduced called for_each_zone_zonelist() that interates
      through each zone allowed by the GFP flags in the selected zonelist.
      
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Acked-by: default avatarChristoph Lameter <clameter@sgi.com>
      Signed-off-by: default avatarLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      54a6eb5c
  5. 26 Mar, 2008 2 commits
    • Nishanth Aravamudan's avatar
      hugetlb: fix potential livelock in return_unused_surplus_hugepages() · 11320d17
      Nishanth Aravamudan authored
      
      
      Running the counters testcase from libhugetlbfs results in on 2.6.25-rc5
      and 2.6.25-rc5-mm1:
      
          BUG: soft lockup - CPU#3 stuck for 61s! [counters:10531]
          NIP: c0000000000d1f3c LR: c0000000000d1f2c CTR: c0000000001b5088
          REGS: c000005db12cb360 TRAP: 0901   Not tainted  (2.6.25-rc5-autokern1)
          MSR: 8000000000009032 <EE,ME,IR,DR>  CR: 48008448  XER: 20000000
          TASK = c000005dbf3d6000[10531] 'counters' THREAD: c000005db12c8000 CPU: 3
          GPR00: 0000000000000004 c000005db12cb5e0 c000000000879228 0000000000000004
          GPR04: 0000000000000010 0000000000000000 0000000000200200 0000000000100100
          GPR08: c0000000008aba10 000000000000ffff 0000000000000004 0000000000000000
          GPR12: 0000000028000442 c000000000770080
          NIP [c0000000000d1f3c] .return_unused_surplus_pages+0x84/0x18c
          LR [c0000000000d1f2c] .return_unused_surplus_pages+0x74/0x18c
          Call Trace:
          [c000005db12cb5e0] [c000005db12cb670] 0xc000005db12cb670 (unreliable)
          [c000005db12cb670] [c0000000000d24c4] .hugetlb_acct_memory+0x2e0/0x354
          [c000005db12cb740] [c0000000001b5048] .truncate_hugepages+0x1d4/0x214
          [c000005db12cb890] [c0000000001b50a4] .hugetlbfs_delete_inode+0x1c/0x3c
          [c000005db12cb920] [c000000000103fd8] .generic_delete_inode+0xf8/0x1c0
          [c000005db12cb9b0] [c0000000001b5100] .hugetlbfs_drop_inode+0x3c/0x24c
          [c000005db12cba50] [c00000000010287c] .iput+0xdc/0xf8
          [c000005db12cbad0] [c0000000000fee54] .dentry_iput+0x12c/0x194
          [c000005db12cbb60] [c0000000000ff050] .d_kill+0x6c/0xa4
          [c000005db12cbbf0] [c0000000000ffb74] .dput+0x18c/0x1b0
          [c000005db12cbc70] [c0000000000e9e98] .__fput+0x1a4/0x1e8
          [c000005db12cbd10] [c0000000000e61ec] .filp_close+0xb8/0xe0
          [c000005db12cbda0] [c0000000000e62d0] .sys_close+0xbc/0x134
          [c000005db12cbe30] [c00000000000872c] syscall_exit+0x0/0x40
          Instruction dump:
          ebbe8038 38800010 e8bf0002 3bbd0008 7fa3eb78 38a50001 7ca507b4 4818df25
          60000000 38800010 38a00000 7c601b78 <7fa3eb78> 2f800010 409d0008 38000010
      
      This was tracked down to a potential livelock in
      return_unused_surplus_hugepages().  In the case where we have surplus
      pages on some node, but no free pages on the same node, we may never
      break out of the loop. To avoid this livelock, terminate the search if
      we iterate a number of times equal to the number of online nodes without
      freeing a page.
      
      Thanks to Andy Whitcroft and Adam Litke for helping with debugging and
      the patch.
      
      Signed-off-by: default avatarNishanth Aravamudan <nacc@us.ibm.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      11320d17
    • Nishanth Aravamudan's avatar
      hugetlb: indicate surplus huge page counts in per-node meminfo · a1de0919
      Nishanth Aravamudan authored
      
      
      Currently we show the surplus hugetlb pool state in /proc/meminfo, but
      not in the per-node meminfo files, even though we track the information
      on a per-node basis. Printing it there can help track down dynamic pool
      bugs including the one in the follow-on patch.
      
      Signed-off-by: default avatarNishanth Aravamudan <nacc@us.ibm.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a1de0919
  6. 11 Mar, 2008 1 commit
    • Adam Litke's avatar
      hugetlb: correct page count for surplus huge pages · 2668db91
      Adam Litke authored
      
      
      Free pages in the hugetlb pool are free and as such have a reference count of
      zero.  Regular allocations into the pool from the buddy are "freed" into the
      pool which results in their page_count dropping to zero.  However, surplus
      pages can be directly utilized by the caller without first being freed to the
      pool.  Therefore, a call to put_page_testzero() is in order so that such a
      page will be handed to the caller with a correct count.
      
      This has not affected end users because the bad page count is reset before the
      page is handed off.  However, under CONFIG_DEBUG_VM this triggers a BUG when
      the page count is validated.
      
      Thanks go to Mel for first spotting this issue and providing an initial fix.
      
      Signed-off-by: default avatarAdam Litke <agl@us.ibm.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2668db91
  7. 05 Mar, 2008 2 commits
    • Nishanth Aravamudan's avatar
      hugetlb: fix pool shrinking while in restricted cpuset · 348e1e04
      Nishanth Aravamudan authored
      
      
      Adam Litke noticed that currently we grow the hugepage pool independent of any
      cpuset the running process may be in, but when shrinking the pool, the cpuset
      is checked.  This leads to inconsistency when shrinking the pool in a
      restricted cpuset -- an administrator may have been able to grow the pool on a
      node restricted by a containing cpuset, but they cannot shrink it there.
      
      There are two options: either prevent growing of the pool outside of the
      cpuset or allow shrinking outside of the cpuset.  >From previous discussions
      on linux-mm, /proc/sys/vm/nr_hugepages is an administrative interface that
      should not be restricted by cpusets.  So allow shrinking the pool by removing
      pages from nodes outside of current's cpuset.
      
      Signed-off-by: default avatarNishanth Aravamudan <nacc@us.ibm.com>
      Acked-by: default avatarAdam Litke <agl@us.ibm.com>
      Cc: William Irwin <wli@holomorphy.com>
      Cc: Lee Schermerhorn <Lee.Schermerhonr@hp.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Paul Jackson <pj@sgi.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      348e1e04
    • Adam Litke's avatar
      hugetlb: close a difficult to trigger reservation race · ac09b3a1
      Adam Litke authored
      
      
      A hugetlb reservation may be inadequately backed in the event of racing
      allocations and frees when utilizing surplus huge pages.  Consider the
      following series of events in processes A and B:
      
       A) Allocates some surplus pages to satisfy a reservation
       B) Frees some huge pages
       A) A notices the extra free pages and drops hugetlb_lock to free some of
          its surplus pages back to the buddy allocator.
       B) Allocates some huge pages
       A) Reacquires hugetlb_lock and returns from gather_surplus_huge_pages()
      
      Avoid this by commiting the reservation after pages have been allocated but
      before dropping the lock to free excess pages.  For parity, release the
      reservation in return_unused_surplus_pages().
      
      This patch also corrects the cpuset_mems_nr() error path in
      hugetlb_acct_memory().  If the cpuset check fails, uncommit the
      reservation, but also be sure to return any surplus huge pages that may
      have been allocated to back the failed reservation.
      
      Thanks to Andy Whitcroft for discovering this.
      
      Signed-off-by: default avatarAdam Litke <agl@us.ibm.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ac09b3a1
  8. 24 Feb, 2008 1 commit
  9. 14 Feb, 2008 1 commit
  10. 08 Feb, 2008 1 commit
  11. 05 Feb, 2008 1 commit
    • Nick Piggin's avatar
      mm: fix PageUptodate data race · 0ed361de
      Nick Piggin authored
      
      
      After running SetPageUptodate, preceeding stores to the page contents to
      actually bring it uptodate may not be ordered with the store to set the
      page uptodate.
      
      Therefore, another CPU which checks PageUptodate is true, then reads the
      page contents can get stale data.
      
      Fix this by having an smp_wmb before SetPageUptodate, and smp_rmb after
      PageUptodate.
      
      Many places that test PageUptodate, do so with the page locked, and this
      would be enough to ensure memory ordering in those places if
      SetPageUptodate were only called while the page is locked.  Unfortunately
      that is not always the case for some filesystems, but it could be an idea
      for the future.
      
      Also bring the handling of anonymous page uptodateness in line with that of
      file backed page management, by marking anon pages as uptodate when they
      _are_ uptodate, rather than when our implementation requires that they be
      marked as such.  Doing allows us to get rid of the smp_wmb's in the page
      copying functions, which were especially added for anonymous pages for an
      analogous memory ordering problem.  Both file and anonymous pages are
      handled with the same barriers.
      
      FAQ:
      Q. Why not do this in flush_dcache_page?
      A. Firstly, flush_dcache_page handles only one side (the smb side) of the
      ordering protocol; we'd still need smp_rmb somewhere. Secondly, hiding away
      memory barriers in a completely unrelated function is nasty; at least in the
      PageUptodate macros, they are located together with (half) the operations
      involved in the ordering. Thirdly, the smp_wmb is only required when first
      bringing the page uptodate, wheras flush_dcache_page should be called each time
      it is written to through the kernel mapping. It is logically the wrong place to
      put it.
      
      Q. Why does this increase my text size / reduce my performance / etc.
      A. Because it is adding the necessary instructions to eliminate the data-race.
      
      Q. Can it be improved?
      A. Yes, eg. if you were to create a rule that all SetPageUptodate operations
      run under the page lock, we could avoid the smp_rmb places where PageUptodate
      is queried under the page lock. Requires audit of all filesystems and at least
      some would need reworking. That's great you're interested, I'm eagerly awaiting
      your patches.
      
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0ed361de
  12. 24 Jan, 2008 1 commit
    • Larry Woodman's avatar
      fix hugepages leak due to pagetable page sharing · c5c99429
      Larry Woodman authored
      
      
      The shared page table code for hugetlb memory on x86 and x86_64
      is causing a leak.  When a user of hugepages exits using this code
      the system leaks some of the hugepages.
      
      -------------------------------------------------------
      Part of /proc/meminfo just before database startup:
      HugePages_Total:  5500
      HugePages_Free:   5500
      HugePages_Rsvd:      0
      Hugepagesize:     2048 kB
      
      Just before shutdown:
      HugePages_Total:  5500
      HugePages_Free:   4475
      HugePages_Rsvd:      0
      Hugepagesize:     2048 kB
      
      After shutdown:
      HugePages_Total:  5500
      HugePages_Free:   4988
      HugePages_Rsvd:
      0 Hugepagesize:     2048 kB
      ----------------------------------------------------------
      
      The problem occurs durring a fork, in copy_hugetlb_page_range().  It
      locates the dst_pte using huge_pte_alloc().  Since huge_pte_alloc() calls
      huge_pmd_share() it will share the pmd page if can, yet the main loop in
      copy_hugetlb_page_range() does a get_page() on every hugepage.  This is a
      violation of the shared hugepmd pagetable protocol and creates additional
      referenced to the hugepages causing a leak when the unmap of the VMA
      occurs.  We can skip the entire replication of the ptes when the hugepage
      pagetables are shared.  The attached patch skips copying the ptes and the
      get_page() calls if the hugetlbpage pagetable is shared.
      
      [akpm@linux-foundation.org: coding-style cleanups]
      Signed-off-by: default avatarLarry Woodman <lwoodman@redhat.com>
      Signed-off-by: default avatarAdam Litke <agl@us.ibm.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Ken Chen <kenchen@google.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c5c99429
  13. 14 Jan, 2008 1 commit
  14. 18 Dec, 2007 2 commits
    • Nishanth Aravamudan's avatar
      Revert "hugetlb: Add hugetlb_dynamic_pool sysctl" · 368d2c63
      Nishanth Aravamudan authored
      This reverts commit 54f9f80d
      
       ("hugetlb:
      Add hugetlb_dynamic_pool sysctl")
      
      Given the new sysctl nr_overcommit_hugepages, the boolean dynamic pool
      sysctl is not needed, as its semantics can be expressed by 0 in the
      overcommit sysctl (no dynamic pool) and non-0 in the overcommit sysctl
      (pool enabled).
      
      (Needed in 2.6.24 since it reverts a post-2.6.23 userspace-visible change)
      
      Signed-off-by: default avatarNishanth Aravamudan <nacc@us.ibm.com>
      Acked-by: default avatarAdam Litke <agl@us.ibm.com>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      368d2c63
    • Nishanth Aravamudan's avatar
      hugetlb: introduce nr_overcommit_hugepages sysctl · d1c3fb1f
      Nishanth Aravamudan authored
      
      
      hugetlb: introduce nr_overcommit_hugepages sysctl
      
      While examining the code to support /proc/sys/vm/hugetlb_dynamic_pool, I
      became convinced that having a boolean sysctl was insufficient:
      
      1) To support per-node control of hugepages, I have previously submitted
      patches to add a sysfs attribute related to nr_hugepages. However, with
      a boolean global value and per-mount quota enforcement constraining the
      dynamic pool, adding corresponding control of the dynamic pool on a
      per-node basis seems inconsistent to me.
      
      2) Administration of the hugetlb dynamic pool with multiple hugetlbfs
      mount points is, arguably, more arduous than it needs to be. Each quota
      would need to be set separately, and the sum would need to be monitored.
      
      To ease the administration, and to help make the way for per-node
      control of the static & dynamic hugepage pool, I added a separate
      sysctl, nr_overcommit_hugepages. This value serves as a high watermark
      for the overall hugepage pool, while nr_hugepages serves as a low
      watermark. The boolean sysctl can then be removed, as the condition
      
      	nr_overcommit_hugepages > 0
      
      indicates the same administrative setting as
      
      	hugetlb_dynamic_pool == 1
      
      Quotas still serve as local enforcement of the size of the pool on a
      per-mount basis.
      
      A few caveats:
      
      1) There is a race whereby the global surplus huge page counter is
      incremented before a hugepage has allocated. Another process could then
      try grow the pool, and fail to convert a surplus huge page to a normal
      huge page and instead allocate a fresh huge page. I believe this is
      benign, as no memory is leaked (the actual pages are still tracked
      correctly) and the counters won't go out of sync.
      
      2) Shrinking the static pool while a surplus is in effect will allow the
      number of surplus huge pages to exceed the overcommit value. As long as
      this condition holds, however, no more surplus huge pages will be
      allowed on the system until one of the two sysctls are increased
      sufficiently, or the surplus huge pages go out of use and are freed.
      
      Successfully tested on x86_64 with the current libhugetlbfs snapshot,
      modified to use the new sysctl.
      
      Signed-off-by: default avatarNishanth Aravamudan <nacc@us.ibm.com>
      Acked-by: default avatarAdam Litke <agl@us.ibm.com>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d1c3fb1f
  15. 11 Dec, 2007 1 commit
  16. 15 Nov, 2007 8 commits
    • Ken Chen's avatar
      hugetlb: fix i_blocks accounting · 45c682a6
      Ken Chen authored
      
      
      For administrative purpose, we want to query actual block usage for
      hugetlbfs file via fstat.  Currently, hugetlbfs always return 0.  Fix that
      up since kernel already has all the information to track it properly.
      
      Signed-off-by: default avatarKen Chen <kenchen@google.com>
      Acked-by: default avatarAdam Litke <agl@us.ibm.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      45c682a6
    • Adrian Bunk's avatar
      mm/hugetlb.c: make a function static · 8cde045c
      Adrian Bunk authored
      
      
      return_unused_surplus_pages() can become static.
      
      Signed-off-by: default avatarAdrian Bunk <bunk@kernel.org>
      Acked-by: default avatarAdam Litke <agl@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8cde045c
    • Adam Litke's avatar
      hugetlb: enforce quotas during reservation for shared mappings · 90d8b7e6
      Adam Litke authored
      
      
      When a MAP_SHARED mmap of a hugetlbfs file succeeds, huge pages are reserved
      to guarantee no problems will occur later when instantiating pages.  If quotas
      are in force, page instantiation could fail due to a race with another process
      or an oversized (but approved) shared mapping.
      
      To prevent these scenarios, debit the quota for the full reservation amount up
      front and credit the unused quota when the reservation is released.
      
      Signed-off-by: default avatarAdam Litke <agl@us.ibm.com>
      Cc: Ken Chen <kenchen@google.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: David Gibson <hermes@gibson.dropbear.id.au>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      90d8b7e6
    • Adam Litke's avatar
      hugetlb: allow bulk updating in hugetlb_*_quota() · 9a119c05
      Adam Litke authored
      
      
      Add a second parameter 'delta' to hugetlb_get_quota and hugetlb_put_quota to
      allow bulk updating of the sbinfo->free_blocks counter.  This will be used by
      the next patch in the series.
      
      Signed-off-by: default avatarAdam Litke <agl@us.ibm.com>
      Cc: Ken Chen <kenchen@google.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: David Gibson <hermes@gibson.dropbear.id.au>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9a119c05
    • Adam Litke's avatar
      hugetlb: debit quota in alloc_huge_page · 2fc39cec
      Adam Litke authored
      
      
      Now that quota is credited by free_huge_page(), calls to hugetlb_get_quota()
      seem out of place.  The alloc/free API is unbalanced because we handle the
      hugetlb_put_quota() but expect the caller to open-code hugetlb_get_quota().
      Move the get inside alloc_huge_page to clean up this disparity.
      
      This patch has been kept apart from the previous patch because of the somewhat
      dodgy ERR_PTR() use herein.  Moving the quota logic means that
      alloc_huge_page() has two failure modes.  Quota failure must result in a
      SIGBUS while a standard allocation failure is OOM.  Unfortunately, ERR_PTR()
      doesn't like the small positive errnos we have in VM_FAULT_* so they must be
      negated before they are used.
      
      Does anyone take issue with the way I am using PTR_ERR.  If so, what are your
      thoughts on how to clean this up (without needing an if,else if,else block at
      each alloc_huge_page() callsite)?
      
      Signed-off-by: default avatarAdam Litke <agl@us.ibm.com>
      Cc: Ken Chen <kenchen@google.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: David Gibson <hermes@gibson.dropbear.id.au>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2fc39cec
    • Adam Litke's avatar
      hugetlb: fix quota management for private mappings · c79fb75e
      Adam Litke authored
      
      
      The hugetlbfs quota management system was never taught to handle MAP_PRIVATE
      mappings when that support was added.  Currently, quota is debited at page
      instantiation and credited at file truncation.  This approach works correctly
      for shared pages but is incomplete for private pages.  In addition to
      hugetlb_no_page(), private pages can be instantiated by hugetlb_cow(); but
      this function does not respect quotas.
      
      Private huge pages are treated very much like normal, anonymous pages.  They
      are not "backed" by the hugetlbfs file and are not stored in the mapping's
      radix tree.  This means that private pages are invisible to
      truncate_hugepages() so that function will not credit the quota.
      
      This patch (based on a prototype provided by Ken Chen) moves quota crediting
      for all pages into free_huge_page().  page->private is used to store a pointer
      to the mapping to which this page belongs.  This is used to credit quota on
      the appropriate hugetlbfs instance.
      
      Signed-off-by: default avatarAdam Litke <agl@us.ibm.com>
      Cc: Ken Chen <kenchen@google.com>
      Cc: Ken Chen <kenchen@google.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: David Gibson <hermes@gibson.dropbear.id.au>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c79fb75e
    • Adam Litke's avatar
      hugetlb: split alloc_huge_page into private and shared components · 348ea204
      Adam Litke authored
      
      
      Hugetlbfs implements a quota system which can limit the amount of memory that
      can be used by the filesystem.  Before allocating a new huge page for a file,
      the quota is checked and debited.  The quota is then credited when truncating
      the file.  I found a few bugs in the code for both MAP_PRIVATE and MAP_SHARED
      mappings.  Before detailing the problems and my proposed solutions, we should
      agree on a definition of quotas that properly addresses both private and
      shared pages.  Since the purpose of quotas is to limit total memory
      consumption on a per-filesystem basis, I argue that all pages allocated by the
      fs (private and shared) should be charged against quota.
      
      Private Mappings
      ================
      
      The current code will debit quota for private pages sometimes, but will never
      credit it.  At a minimum, this causes a leak in the quota accounting which
      renders the accounting essentially useless as it is.  Shared pages have a one
      to one mapping with a hugetlbfs file and are easy to account by debiting on
      allocation and crediting on truncate.  Private pages are anonymous in nature
      and have a many to one relationship with their hugetlbfs files (due to copy on
      write).  Because private pages are not indexed by the mapping's radix tree,
      thier quota cannot be credited at file truncation time.  Crediting must be
      done when the page is unmapped and freed.
      
      Shared Pages
      ============
      
      I discovered an issue concerning the interaction between the MAP_SHARED
      reservation system and quotas.  Since quota is not checked until page
      instantiation, an over-quota mmap/reservation will initially succeed.  When
      instantiating the first over-quota page, the program will receive SIGBUS.
      This is inconsistent since the reservation is supposed to be a guarantee.  The
      solution is to debit the full amount of quota at reservation time and credit
      the unused portion when the reservation is released.
      
      This patch series brings quotas back in line by making the following
      modifications:
       * Private pages
         - Debit quota in alloc_huge_page()
         - Credit quota in free_huge_page()
       * Shared pages
         - Debit quota for entire reservation at mmap time
         - Credit quota for instantiated pages in free_huge_page()
         - Credit quota for unused reservation at munmap time
      
      This patch:
      
      The shared page reservation and dynamic pool resizing features have made the
      allocation of private vs.  shared huge pages quite different.  By splitting
      out the private/shared-specific portions of the process into their own
      functions, readability is greatly improved.  alloc_huge_page now calls the
      proper helper and performs common operations.
      
      [akpm@linux-foundation.org: coding-style cleanups]
      Signed-off-by: default avatarAdam Litke <agl@us.ibm.com>
      Cc: Ken Chen <kenchen@google.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: David Gibson <hermes@gibson.dropbear.id.au>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      348ea204
    • Adam Litke's avatar
      hugetlb: follow_hugetlb_page() for write access · 5b23dbe8
      Adam Litke authored
      
      
      When calling get_user_pages(), a write flag is passed in by the caller to
      indicate if write access is required on the faulted-in pages.  Currently,
      follow_hugetlb_page() ignores this flag and always faults pages for
      read-only access.  This can cause data corruption because a device driver
      that calls get_user_pages() with write set will not expect COW faults to
      occur on the returned pages.
      
      This patch passes the write flag down to follow_hugetlb_page() and makes
      sure hugetlb_fault() is called with the right write_access parameter.
      
      [ezk@cs.sunysb.edu: build fix]
      Signed-off-by: default avatarAdam Litke <agl@us.ibm.com>
      Reviewed-by: default avatarKen Chen <kenchen@google.com>
      Cc: David Gibson <hermes@gibson.dropbear.id.au>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Signed-off-by: default avatarErez Zadok <ezk@cs.sunysb.edu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5b23dbe8
  17. 19 Oct, 2007 1 commit
  18. 18 Oct, 2007 1 commit
  19. 16 Oct, 2007 2 commits
    • Adam Litke's avatar
      hugetlb: fix dynamic pool resize failure case · af767cbd
      Adam Litke authored
      
      
      When gather_surplus_pages() fails to allocate enough huge pages to satisfy
      the requested reservation, it frees what it did allocate back to the buddy
      allocator.  put_page() should be called instead of update_and_free_page()
      to ensure that pool counters are updated as appropriate and the page's
      refcount is decremented.
      
      Signed-off-by: default avatarAdam Litke <agl@us.ibm.com>
      Acked-by: default avatarDave Hansen <haveblue@us.ibm.com>
      Cc: David Gibson <hermes@gibson.dropbear.id.au>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Ken Chen <kenchen@google.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      af767cbd
    • Nishanth Aravamudan's avatar
      hugetlb: fix hugepage allocation with memoryless nodes · 63b4613c
      Nishanth Aravamudan authored
      Anton found a problem with the hugetlb pool allocation when some nodes have
      no memory (http://marc.info/?l=linux-mm&m=118133042025995&w=2
      
      ).  Lee worked
      on versions that tried to fix it, but none were accepted.  Christoph has
      created a set of patches which allow for GFP_THISNODE allocations to fail
      if the node has no memory.
      
      Currently, alloc_fresh_huge_page() returns NULL when it is not able to
      allocate a huge page on the current node, as specified by its custom
      interleave variable.  The callers of this function, though, assume that a
      failure in alloc_fresh_huge_page() indicates no hugepages can be allocated
      on the system period.  This might not be the case, for instance, if we have
      an uneven NUMA system, and we happen to try to allocate a hugepage on a
      node with less memory and fail, while there is still plenty of free memory
      on the other nodes.
      
      To correct this, make alloc_fresh_huge_page() search through all online
      nodes before deciding no hugepages can be allocated.  Add a helper function
      for actually allocating the hugepage.  Use a new global nid iterator to
      control which nid to allocate on.
      
      Note: we expect particular semantics for __GFP_THISNODE, which are now
      enforced even for memoryless nodes.  That is, there is should be no
      fallback to other nodes.  Therefore, we rely on the nid passed into
      alloc_pages_node() to be the nid the page comes from.  If this is
      incorrect, accounting will break.
      
      Tested on x86 !NUMA, x86 NUMA, x86_64 NUMA and ppc64 NUMA (with 2
      memoryless nodes).
      
      Before on the ppc64 box:
      Trying to clear the hugetlb pool
      Done.       0 free
      Trying to resize the pool to 100
      Node 0 HugePages_Free:     25
      Node 1 HugePages_Free:     75
      Node 2 HugePages_Free:      0
      Node 3 HugePages_Free:      0
      Done. Initially     100 free
      Trying to resize the pool to 200
      Node 0 HugePages_Free:     50
      Node 1 HugePages_Free:    150
      Node 2 HugePages_Free:      0
      Node 3 HugePages_Free:      0
      Done.     200 free
      
      After:
      Trying to clear the hugetlb pool
      Done.       0 free
      Trying to resize the pool to 100
      Node 0 HugePages_Free:     50
      Node 1 HugePages_Free:     50
      Node 2 HugePages_Free:      0
      Node 3 HugePages_Free:      0
      Done. Initially     100 free
      Trying to resize the pool to 200
      Node 0 HugePages_Free:    100
      Node 1 HugePages_Free:    100
      Node 2 HugePages_Free:      0
      Node 3 HugePages_Free:      0
      Done.     200 free
      
      Signed-off-by: default avatarNishanth Aravamudan <nacc@us.ibm.com>
      Acked-by: default avatarChristoph Lameter <clameter@sgi.com>
      Cc: Adam Litke <agl@us.ibm.com>
      Cc: David Gibson <hermes@gibson.dropbear.id.au>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Ken Chen <kenchen@google.com>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      63b4613c