1. 24 Feb, 2008 1 commit
  2. 14 Feb, 2008 1 commit
  3. 08 Feb, 2008 1 commit
  4. 05 Feb, 2008 1 commit
    • Nick Piggin's avatar
      mm: fix PageUptodate data race · 0ed361de
      Nick Piggin authored
      
      
      After running SetPageUptodate, preceeding stores to the page contents to
      actually bring it uptodate may not be ordered with the store to set the
      page uptodate.
      
      Therefore, another CPU which checks PageUptodate is true, then reads the
      page contents can get stale data.
      
      Fix this by having an smp_wmb before SetPageUptodate, and smp_rmb after
      PageUptodate.
      
      Many places that test PageUptodate, do so with the page locked, and this
      would be enough to ensure memory ordering in those places if
      SetPageUptodate were only called while the page is locked.  Unfortunately
      that is not always the case for some filesystems, but it could be an idea
      for the future.
      
      Also bring the handling of anonymous page uptodateness in line with that of
      file backed page management, by marking anon pages as uptodate when they
      _are_ uptodate, rather than when our implementation requires that they be
      marked as such.  Doing allows us to get rid of the smp_wmb's in the page
      copying functions, which were especially added for anonymous pages for an
      analogous memory ordering problem.  Both file and anonymous pages are
      handled with the same barriers.
      
      FAQ:
      Q. Why not do this in flush_dcache_page?
      A. Firstly, flush_dcache_page handles only one side (the smb side) of the
      ordering protocol; we'd still need smp_rmb somewhere. Secondly, hiding away
      memory barriers in a completely unrelated function is nasty; at least in the
      PageUptodate macros, they are located together with (half) the operations
      involved in the ordering. Thirdly, the smp_wmb is only required when first
      bringing the page uptodate, wheras flush_dcache_page should be called each time
      it is written to through the kernel mapping. It is logically the wrong place to
      put it.
      
      Q. Why does this increase my text size / reduce my performance / etc.
      A. Because it is adding the necessary instructions to eliminate the data-race.
      
      Q. Can it be improved?
      A. Yes, eg. if you were to create a rule that all SetPageUptodate operations
      run under the page lock, we could avoid the smp_rmb places where PageUptodate
      is queried under the page lock. Requires audit of all filesystems and at least
      some would need reworking. That's great you're interested, I'm eagerly awaiting
      your patches.
      
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0ed361de
  5. 24 Jan, 2008 1 commit
    • Larry Woodman's avatar
      fix hugepages leak due to pagetable page sharing · c5c99429
      Larry Woodman authored
      
      
      The shared page table code for hugetlb memory on x86 and x86_64
      is causing a leak.  When a user of hugepages exits using this code
      the system leaks some of the hugepages.
      
      -------------------------------------------------------
      Part of /proc/meminfo just before database startup:
      HugePages_Total:  5500
      HugePages_Free:   5500
      HugePages_Rsvd:      0
      Hugepagesize:     2048 kB
      
      Just before shutdown:
      HugePages_Total:  5500
      HugePages_Free:   4475
      HugePages_Rsvd:      0
      Hugepagesize:     2048 kB
      
      After shutdown:
      HugePages_Total:  5500
      HugePages_Free:   4988
      HugePages_Rsvd:
      0 Hugepagesize:     2048 kB
      ----------------------------------------------------------
      
      The problem occurs durring a fork, in copy_hugetlb_page_range().  It
      locates the dst_pte using huge_pte_alloc().  Since huge_pte_alloc() calls
      huge_pmd_share() it will share the pmd page if can, yet the main loop in
      copy_hugetlb_page_range() does a get_page() on every hugepage.  This is a
      violation of the shared hugepmd pagetable protocol and creates additional
      referenced to the hugepages causing a leak when the unmap of the VMA
      occurs.  We can skip the entire replication of the ptes when the hugepage
      pagetables are shared.  The attached patch skips copying the ptes and the
      get_page() calls if the hugetlbpage pagetable is shared.
      
      [akpm@linux-foundation.org: coding-style cleanups]
      Signed-off-by: default avatarLarry Woodman <lwoodman@redhat.com>
      Signed-off-by: default avatarAdam Litke <agl@us.ibm.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Ken Chen <kenchen@google.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c5c99429
  6. 14 Jan, 2008 1 commit
  7. 18 Dec, 2007 2 commits
    • Nishanth Aravamudan's avatar
      Revert "hugetlb: Add hugetlb_dynamic_pool sysctl" · 368d2c63
      Nishanth Aravamudan authored
      This reverts commit 54f9f80d
      
       ("hugetlb:
      Add hugetlb_dynamic_pool sysctl")
      
      Given the new sysctl nr_overcommit_hugepages, the boolean dynamic pool
      sysctl is not needed, as its semantics can be expressed by 0 in the
      overcommit sysctl (no dynamic pool) and non-0 in the overcommit sysctl
      (pool enabled).
      
      (Needed in 2.6.24 since it reverts a post-2.6.23 userspace-visible change)
      
      Signed-off-by: default avatarNishanth Aravamudan <nacc@us.ibm.com>
      Acked-by: default avatarAdam Litke <agl@us.ibm.com>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      368d2c63
    • Nishanth Aravamudan's avatar
      hugetlb: introduce nr_overcommit_hugepages sysctl · d1c3fb1f
      Nishanth Aravamudan authored
      
      
      hugetlb: introduce nr_overcommit_hugepages sysctl
      
      While examining the code to support /proc/sys/vm/hugetlb_dynamic_pool, I
      became convinced that having a boolean sysctl was insufficient:
      
      1) To support per-node control of hugepages, I have previously submitted
      patches to add a sysfs attribute related to nr_hugepages. However, with
      a boolean global value and per-mount quota enforcement constraining the
      dynamic pool, adding corresponding control of the dynamic pool on a
      per-node basis seems inconsistent to me.
      
      2) Administration of the hugetlb dynamic pool with multiple hugetlbfs
      mount points is, arguably, more arduous than it needs to be. Each quota
      would need to be set separately, and the sum would need to be monitored.
      
      To ease the administration, and to help make the way for per-node
      control of the static & dynamic hugepage pool, I added a separate
      sysctl, nr_overcommit_hugepages. This value serves as a high watermark
      for the overall hugepage pool, while nr_hugepages serves as a low
      watermark. The boolean sysctl can then be removed, as the condition
      
      	nr_overcommit_hugepages > 0
      
      indicates the same administrative setting as
      
      	hugetlb_dynamic_pool == 1
      
      Quotas still serve as local enforcement of the size of the pool on a
      per-mount basis.
      
      A few caveats:
      
      1) There is a race whereby the global surplus huge page counter is
      incremented before a hugepage has allocated. Another process could then
      try grow the pool, and fail to convert a surplus huge page to a normal
      huge page and instead allocate a fresh huge page. I believe this is
      benign, as no memory is leaked (the actual pages are still tracked
      correctly) and the counters won't go out of sync.
      
      2) Shrinking the static pool while a surplus is in effect will allow the
      number of surplus huge pages to exceed the overcommit value. As long as
      this condition holds, however, no more surplus huge pages will be
      allowed on the system until one of the two sysctls are increased
      sufficiently, or the surplus huge pages go out of use and are freed.
      
      Successfully tested on x86_64 with the current libhugetlbfs snapshot,
      modified to use the new sysctl.
      
      Signed-off-by: default avatarNishanth Aravamudan <nacc@us.ibm.com>
      Acked-by: default avatarAdam Litke <agl@us.ibm.com>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d1c3fb1f
  8. 11 Dec, 2007 1 commit
  9. 15 Nov, 2007 8 commits
    • Ken Chen's avatar
      hugetlb: fix i_blocks accounting · 45c682a6
      Ken Chen authored
      
      
      For administrative purpose, we want to query actual block usage for
      hugetlbfs file via fstat.  Currently, hugetlbfs always return 0.  Fix that
      up since kernel already has all the information to track it properly.
      
      Signed-off-by: default avatarKen Chen <kenchen@google.com>
      Acked-by: default avatarAdam Litke <agl@us.ibm.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      45c682a6
    • Adrian Bunk's avatar
      mm/hugetlb.c: make a function static · 8cde045c
      Adrian Bunk authored
      
      
      return_unused_surplus_pages() can become static.
      
      Signed-off-by: default avatarAdrian Bunk <bunk@kernel.org>
      Acked-by: default avatarAdam Litke <agl@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8cde045c
    • Adam Litke's avatar
      hugetlb: enforce quotas during reservation for shared mappings · 90d8b7e6
      Adam Litke authored
      
      
      When a MAP_SHARED mmap of a hugetlbfs file succeeds, huge pages are reserved
      to guarantee no problems will occur later when instantiating pages.  If quotas
      are in force, page instantiation could fail due to a race with another process
      or an oversized (but approved) shared mapping.
      
      To prevent these scenarios, debit the quota for the full reservation amount up
      front and credit the unused quota when the reservation is released.
      
      Signed-off-by: default avatarAdam Litke <agl@us.ibm.com>
      Cc: Ken Chen <kenchen@google.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: David Gibson <hermes@gibson.dropbear.id.au>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      90d8b7e6
    • Adam Litke's avatar
      hugetlb: allow bulk updating in hugetlb_*_quota() · 9a119c05
      Adam Litke authored
      
      
      Add a second parameter 'delta' to hugetlb_get_quota and hugetlb_put_quota to
      allow bulk updating of the sbinfo->free_blocks counter.  This will be used by
      the next patch in the series.
      
      Signed-off-by: default avatarAdam Litke <agl@us.ibm.com>
      Cc: Ken Chen <kenchen@google.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: David Gibson <hermes@gibson.dropbear.id.au>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9a119c05
    • Adam Litke's avatar
      hugetlb: debit quota in alloc_huge_page · 2fc39cec
      Adam Litke authored
      
      
      Now that quota is credited by free_huge_page(), calls to hugetlb_get_quota()
      seem out of place.  The alloc/free API is unbalanced because we handle the
      hugetlb_put_quota() but expect the caller to open-code hugetlb_get_quota().
      Move the get inside alloc_huge_page to clean up this disparity.
      
      This patch has been kept apart from the previous patch because of the somewhat
      dodgy ERR_PTR() use herein.  Moving the quota logic means that
      alloc_huge_page() has two failure modes.  Quota failure must result in a
      SIGBUS while a standard allocation failure is OOM.  Unfortunately, ERR_PTR()
      doesn't like the small positive errnos we have in VM_FAULT_* so they must be
      negated before they are used.
      
      Does anyone take issue with the way I am using PTR_ERR.  If so, what are your
      thoughts on how to clean this up (without needing an if,else if,else block at
      each alloc_huge_page() callsite)?
      
      Signed-off-by: default avatarAdam Litke <agl@us.ibm.com>
      Cc: Ken Chen <kenchen@google.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: David Gibson <hermes@gibson.dropbear.id.au>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2fc39cec
    • Adam Litke's avatar
      hugetlb: fix quota management for private mappings · c79fb75e
      Adam Litke authored
      
      
      The hugetlbfs quota management system was never taught to handle MAP_PRIVATE
      mappings when that support was added.  Currently, quota is debited at page
      instantiation and credited at file truncation.  This approach works correctly
      for shared pages but is incomplete for private pages.  In addition to
      hugetlb_no_page(), private pages can be instantiated by hugetlb_cow(); but
      this function does not respect quotas.
      
      Private huge pages are treated very much like normal, anonymous pages.  They
      are not "backed" by the hugetlbfs file and are not stored in the mapping's
      radix tree.  This means that private pages are invisible to
      truncate_hugepages() so that function will not credit the quota.
      
      This patch (based on a prototype provided by Ken Chen) moves quota crediting
      for all pages into free_huge_page().  page->private is used to store a pointer
      to the mapping to which this page belongs.  This is used to credit quota on
      the appropriate hugetlbfs instance.
      
      Signed-off-by: default avatarAdam Litke <agl@us.ibm.com>
      Cc: Ken Chen <kenchen@google.com>
      Cc: Ken Chen <kenchen@google.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: David Gibson <hermes@gibson.dropbear.id.au>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c79fb75e
    • Adam Litke's avatar
      hugetlb: split alloc_huge_page into private and shared components · 348ea204
      Adam Litke authored
      
      
      Hugetlbfs implements a quota system which can limit the amount of memory that
      can be used by the filesystem.  Before allocating a new huge page for a file,
      the quota is checked and debited.  The quota is then credited when truncating
      the file.  I found a few bugs in the code for both MAP_PRIVATE and MAP_SHARED
      mappings.  Before detailing the problems and my proposed solutions, we should
      agree on a definition of quotas that properly addresses both private and
      shared pages.  Since the purpose of quotas is to limit total memory
      consumption on a per-filesystem basis, I argue that all pages allocated by the
      fs (private and shared) should be charged against quota.
      
      Private Mappings
      ================
      
      The current code will debit quota for private pages sometimes, but will never
      credit it.  At a minimum, this causes a leak in the quota accounting which
      renders the accounting essentially useless as it is.  Shared pages have a one
      to one mapping with a hugetlbfs file and are easy to account by debiting on
      allocation and crediting on truncate.  Private pages are anonymous in nature
      and have a many to one relationship with their hugetlbfs files (due to copy on
      write).  Because private pages are not indexed by the mapping's radix tree,
      thier quota cannot be credited at file truncation time.  Crediting must be
      done when the page is unmapped and freed.
      
      Shared Pages
      ============
      
      I discovered an issue concerning the interaction between the MAP_SHARED
      reservation system and quotas.  Since quota is not checked until page
      instantiation, an over-quota mmap/reservation will initially succeed.  When
      instantiating the first over-quota page, the program will receive SIGBUS.
      This is inconsistent since the reservation is supposed to be a guarantee.  The
      solution is to debit the full amount of quota at reservation time and credit
      the unused portion when the reservation is released.
      
      This patch series brings quotas back in line by making the following
      modifications:
       * Private pages
         - Debit quota in alloc_huge_page()
         - Credit quota in free_huge_page()
       * Shared pages
         - Debit quota for entire reservation at mmap time
         - Credit quota for instantiated pages in free_huge_page()
         - Credit quota for unused reservation at munmap time
      
      This patch:
      
      The shared page reservation and dynamic pool resizing features have made the
      allocation of private vs.  shared huge pages quite different.  By splitting
      out the private/shared-specific portions of the process into their own
      functions, readability is greatly improved.  alloc_huge_page now calls the
      proper helper and performs common operations.
      
      [akpm@linux-foundation.org: coding-style cleanups]
      Signed-off-by: default avatarAdam Litke <agl@us.ibm.com>
      Cc: Ken Chen <kenchen@google.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: David Gibson <hermes@gibson.dropbear.id.au>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      348ea204
    • Adam Litke's avatar
      hugetlb: follow_hugetlb_page() for write access · 5b23dbe8
      Adam Litke authored
      
      
      When calling get_user_pages(), a write flag is passed in by the caller to
      indicate if write access is required on the faulted-in pages.  Currently,
      follow_hugetlb_page() ignores this flag and always faults pages for
      read-only access.  This can cause data corruption because a device driver
      that calls get_user_pages() with write set will not expect COW faults to
      occur on the returned pages.
      
      This patch passes the write flag down to follow_hugetlb_page() and makes
      sure hugetlb_fault() is called with the right write_access parameter.
      
      [ezk@cs.sunysb.edu: build fix]
      Signed-off-by: default avatarAdam Litke <agl@us.ibm.com>
      Reviewed-by: default avatarKen Chen <kenchen@google.com>
      Cc: David Gibson <hermes@gibson.dropbear.id.au>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Signed-off-by: default avatarErez Zadok <ezk@cs.sunysb.edu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5b23dbe8
  10. 19 Oct, 2007 1 commit
  11. 18 Oct, 2007 1 commit
  12. 16 Oct, 2007 8 commits
    • Adam Litke's avatar
      hugetlb: fix dynamic pool resize failure case · af767cbd
      Adam Litke authored
      
      
      When gather_surplus_pages() fails to allocate enough huge pages to satisfy
      the requested reservation, it frees what it did allocate back to the buddy
      allocator.  put_page() should be called instead of update_and_free_page()
      to ensure that pool counters are updated as appropriate and the page's
      refcount is decremented.
      
      Signed-off-by: default avatarAdam Litke <agl@us.ibm.com>
      Acked-by: default avatarDave Hansen <haveblue@us.ibm.com>
      Cc: David Gibson <hermes@gibson.dropbear.id.au>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Ken Chen <kenchen@google.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      af767cbd
    • Nishanth Aravamudan's avatar
      hugetlb: fix hugepage allocation with memoryless nodes · 63b4613c
      Nishanth Aravamudan authored
      Anton found a problem with the hugetlb pool allocation when some nodes have
      no memory (http://marc.info/?l=linux-mm&m=118133042025995&w=2
      
      ).  Lee worked
      on versions that tried to fix it, but none were accepted.  Christoph has
      created a set of patches which allow for GFP_THISNODE allocations to fail
      if the node has no memory.
      
      Currently, alloc_fresh_huge_page() returns NULL when it is not able to
      allocate a huge page on the current node, as specified by its custom
      interleave variable.  The callers of this function, though, assume that a
      failure in alloc_fresh_huge_page() indicates no hugepages can be allocated
      on the system period.  This might not be the case, for instance, if we have
      an uneven NUMA system, and we happen to try to allocate a hugepage on a
      node with less memory and fail, while there is still plenty of free memory
      on the other nodes.
      
      To correct this, make alloc_fresh_huge_page() search through all online
      nodes before deciding no hugepages can be allocated.  Add a helper function
      for actually allocating the hugepage.  Use a new global nid iterator to
      control which nid to allocate on.
      
      Note: we expect particular semantics for __GFP_THISNODE, which are now
      enforced even for memoryless nodes.  That is, there is should be no
      fallback to other nodes.  Therefore, we rely on the nid passed into
      alloc_pages_node() to be the nid the page comes from.  If this is
      incorrect, accounting will break.
      
      Tested on x86 !NUMA, x86 NUMA, x86_64 NUMA and ppc64 NUMA (with 2
      memoryless nodes).
      
      Before on the ppc64 box:
      Trying to clear the hugetlb pool
      Done.       0 free
      Trying to resize the pool to 100
      Node 0 HugePages_Free:     25
      Node 1 HugePages_Free:     75
      Node 2 HugePages_Free:      0
      Node 3 HugePages_Free:      0
      Done. Initially     100 free
      Trying to resize the pool to 200
      Node 0 HugePages_Free:     50
      Node 1 HugePages_Free:    150
      Node 2 HugePages_Free:      0
      Node 3 HugePages_Free:      0
      Done.     200 free
      
      After:
      Trying to clear the hugetlb pool
      Done.       0 free
      Trying to resize the pool to 100
      Node 0 HugePages_Free:     50
      Node 1 HugePages_Free:     50
      Node 2 HugePages_Free:      0
      Node 3 HugePages_Free:      0
      Done. Initially     100 free
      Trying to resize the pool to 200
      Node 0 HugePages_Free:    100
      Node 1 HugePages_Free:    100
      Node 2 HugePages_Free:      0
      Node 3 HugePages_Free:      0
      Done.     200 free
      
      Signed-off-by: default avatarNishanth Aravamudan <nacc@us.ibm.com>
      Acked-by: default avatarChristoph Lameter <clameter@sgi.com>
      Cc: Adam Litke <agl@us.ibm.com>
      Cc: David Gibson <hermes@gibson.dropbear.id.au>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Ken Chen <kenchen@google.com>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      63b4613c
    • Adam Litke's avatar
      hugetlb: fix pool resizing corner case · 6b0c880d
      Adam Litke authored
      
      
      When shrinking the size of the hugetlb pool via the nr_hugepages sysctl, we
      are careful to keep enough pages around to satisfy reservations.  But the
      calculation is flawed for the following scenario:
      
      Action                          Pool Counters (Total, Free, Resv)
      ======                          =============
      Set pool to 1 page              1 1 0
      Map 1 page MAP_PRIVATE          1 1 0
      Touch the page to fault it in   1 0 0
      Set pool to 3 pages             3 2 0
      Map 2 pages MAP_SHARED          3 2 2
      Set pool to 2 pages             2 1 2 <-- Mistake, should be 3 2 2
      Touch the 2 shared pages        2 0 1 <-- Program crashes here
      
      The last touch above will terminate the process due to lack of huge pages.
      
      This patch corrects the calculation so that it factors in pages being used
      for private mappings.  Andrew, this is a standalone fix suitable for
      mainline.  It is also now corrected in my latest dynamic pool resizing
      patchset which I will send out soon.
      
      Signed-off-by: default avatarAdam Litke <agl@us.ibm.com>
      Acked-by: default avatarKen Chen <kenchen@google.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6b0c880d
    • Adam Litke's avatar
      hugetlb: Add hugetlb_dynamic_pool sysctl · 54f9f80d
      Adam Litke authored
      
      
      The maximum size of the huge page pool can be controlled using the overall
      size of the hugetlb filesystem (via its 'size' mount option).  However in the
      common case the this will not be set as the pool is traditionally fixed in
      size at boot time.  In order to maintain the expected semantics, we need to
      prevent the pool expanding by default.
      
      This patch introduces a new sysctl controlling dynamic pool resizing.  When
      this is enabled the pool will expand beyond its base size up to the size of
      the hugetlb filesystem.  It is disabled by default.
      
      Signed-off-by: default avatarAdam Litke <agl@us.ibm.com>
      Acked-by: default avatarAndy Whitcroft <apw@shadowen.org>
      Acked-by: default avatarDave McCracken <dave.mccracken@oracle.com>
      Cc: William Irwin <bill.irwin@oracle.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: Ken Chen <kenchen@google.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      54f9f80d
    • Adam Litke's avatar
      hugetlb: Try to grow hugetlb pool for MAP_SHARED mappings · e4e574b7
      Adam Litke authored
      
      
      Shared mappings require special handling because the huge pages needed to
      fully populate the VMA must be reserved at mmap time.  If not enough pages are
      available when making the reservation, allocate all of the shortfall at once
      from the buddy allocator and add the pages directly to the hugetlb pool.  If
      they cannot be allocated, then fail the mapping.  The page surplus is
      accounted for in the same way as for private mappings; faulted surplus pages
      will be freed at unmap time.  Reserved, surplus pages that have not been used
      must be freed separately when their reservation has been released.
      
      Signed-off-by: default avatarAdam Litke <agl@us.ibm.com>
      Acked-by: default avatarAndy Whitcroft <apw@shadowen.org>
      Acked-by: default avatarDave McCracken <dave.mccracken@oracle.com>
      Cc: William Irwin <bill.irwin@oracle.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: Ken Chen <kenchen@google.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e4e574b7
    • Adam Litke's avatar
      hugetlb: Try to grow hugetlb pool for MAP_PRIVATE mappings · 7893d1d5
      Adam Litke authored
      
      
      Because we overcommit hugepages for MAP_PRIVATE mappings, it is possible that
      the hugetlb pool will be exhausted or completely reserved when a hugepage is
      needed to satisfy a page fault.  Before killing the process in this situation,
      try to allocate a hugepage directly from the buddy allocator.
      
      The explicitly configured pool size becomes a low watermark.  When dynamically
      grown, the allocated huge pages are accounted as a surplus over the watermark.
       As huge pages are freed on a node, surplus pages are released to the buddy
      allocator so that the pool will shrink back to the watermark.
      
      Surplus accounting also allows for friendlier explicit pool resizing.  When
      shrinking a pool that is fully in-use, increase the surplus so pages will be
      returned to the buddy allocator as soon as they are freed.  When growing a
      pool that has a surplus, consume the surplus first and then allocate new
      pages.
      
      Signed-off-by: default avatarAdam Litke <agl@us.ibm.com>
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Acked-by: default avatarAndy Whitcroft <apw@shadowen.org>
      Acked-by: default avatarDave McCracken <dave.mccracken@oracle.com>
      Cc: William Irwin <bill.irwin@oracle.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: Ken Chen <kenchen@google.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7893d1d5
    • Adam Litke's avatar
      hugetlb: Move update_and_free_page · 6af2acb6
      Adam Litke authored
      Dynamic huge page pool resizing.
      
      In most real-world scenarios, configuring the size of the hugetlb pool
      correctly is a difficult task.  If too few pages are allocated to the pool,
      applications using MAP_SHARED may fail to mmap() a hugepage region and
      applications using MAP_PRIVATE may receive SIGBUS.  Isolating too much memory
      in the hugetlb pool means it is not available for other uses, especially those
      programs not using huge pages.
      
      The obvious answer is to let the hugetlb pool grow and shrink in response to
      the runtime demand for huge pages.  The work Mel Gorman has been doing to
      establish a memory zone for movable memory allocations makes dynamically
      resizing the hugetlb pool reliable within the limits of that zone.  This patch
      series implements dynamic pool resizing for private and shared mappings while
      being careful to maintain existing semantics.  Please reply with your comments
      and feedback; even just to say whether it would be a useful feature to you.
      Thanks.
      
      How it works
      ============
      
      Upon depletion of the hugetlb pool, rather than reporting an error immediately,
      first try and allocate the needed huge pages directly from the buddy allocator.
      Care must be taken to avoid unbounded growth of the hugetlb pool, so the
      hugetlb filesystem quota is used to limit overall pool size.
      
      The real work begins when we decide there is a shortage of huge pages.  What
      happens next depends on whether the pages are for a private or shared mapping.
      Private mappings are straightforward.  At fault time, if alloc_huge_page()
      fails, we allocate a page from the buddy allocator and increment the source
      node's surplus_huge_pages counter.  When free_huge_page() is called for a page
      on a node with a surplus, the page is freed directly to the buddy allocator
      instead of the hugetlb pool.
      
      Because shared mappings require all of the pages to be reserved up front, some
      additional work must be done at mmap() to support them.  We determine the
      reservation shortage and allocate the required number of pages all at once.
      These pages are then added to the hugetlb pool and marked reserved.  Where that
      is not possible the mmap() will fail.  As with private mappings, the
      appropriate surplus counters are updated.  Since reserved huge pages won't
      necessarily be used by the process, we can't be sure that free_huge_page() will
      always be called to return surplus pages to the buddy allocator.  To prevent
      the huge page pool from bloating, we must free unused surplus pages when their
      reservation has ended.
      
      Controlling it
      ==============
      
      With the entire patch series applied, pool resizing is off by default so unless
      specific action is taken, the semantics are unchanged.
      
      To take advantage of the flexibility afforded by this patch series one must
      tolerate a change in semantics.  To control hugetlb pool growth, the following
      techniques can be employed:
      
       * A sysctl tunable to enable/disable the feature entirely
       * The size= mount option for hugetlbfs filesystems to limit pool size
      
      Performance
      ===========
      
      When contiguous memory is readily available, it is expected that the cost of
      dynamicly resizing the pool will be small.  This series has been performance
      tested with 'stream' to measure this cost.
      
      Stream (http://www.cs.virginia.edu/stream/
      
      ) was linked with libhugetlbfs to
      enable remapping of the text and data/bss segments into huge pages.
      
      Stream with small array
      -----------------------
      Baseline: 	nr_hugepages = 0, No libhugetlbfs segment remapping
      Preallocated:	nr_hugepages = 5, Text and data/bss remapping
      Dynamic:	nr_hugepages = 0, Text and data/bss remapping
      
      				Rate (MB/s)
      Function	Baseline	Preallocated	Dynamic
      Copy:		4695.6266	5942.8371	5982.2287
      Scale:		4451.5776	5017.1419	5658.7843
      Add:		5815.8849	7927.7827	8119.3552
      Triad:		5949.4144	8527.6492	8110.6903
      
      Stream with large array
      -----------------------
      Baseline: 	nr_hugepages =  0, No libhugetlbfs segment remapping
      Preallocated:	nr_hugepages = 67, Text and data/bss remapping
      Dynamic:	nr_hugepages =  0, Text and data/bss remapping
      
      				Rate (MB/s)
      Function	Baseline	Preallocated	Dynamic
      Copy:		2227.8281	2544.2732	2546.4947
      Scale:		2136.3208	2430.7294	2421.2074
      Add:		2773.1449	4004.0021	3999.4331
      Triad:		2748.4502	3777.0109	3773.4970
      
      * All numbers are averages taken from 10 consecutive runs with a maximum
        standard deviation of 1.3 percent noted.
      
      This patch:
      
      Simply move update_and_free_page() so that it can be reused later in this
      patch series.  The implementation is not changed.
      
      Signed-off-by: default avatarAdam Litke <agl@us.ibm.com>
      Acked-by: default avatarAndy Whitcroft <apw@shadowen.org>
      Acked-by: default avatarDave McCracken <dave.mccracken@oracle.com>
      Acked-by: default avatarWilliam Irwin <bill.irwin@oracle.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: Ken Chen <kenchen@google.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6af2acb6
    • KAMEZAWA Hiroyuki's avatar
      flush icache before set_pte() on ia64: flush icache at set_pte · 954ffcb3
      KAMEZAWA Hiroyuki authored
      Current ia64 kernel flushes icache by lazy_mmu_prot_update() *after*
      set_pte().  This is too late.  This patch removes lazy_mmu_prot_update and
      add modfied set_pte() for flushing if necessary.
      
      This patch flush icache of a page when
      	new pte has exec bit.
      	&& new pte has present bit
      	&& new pte is user's page.
      	&& (old *ptep is not present
                  || new pte's pfn is not same to old *ptep's ptn)
      	&& new pte's page has no Pg_arch_1 bit.
      	   Pg_arch_1 is set when a page is cache consistent.
      
      I think this condition checks are much easier to understand than considering
      "Where sync_icache_dcache() should be inserted ?".
      
      pte_user() for ia64 was removed by http://lkml.org/lkml/2007/6/12/67
      
       as
      clean-up. So, I added it again.
      
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Acked-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      954ffcb3
  13. 01 Oct, 2007 1 commit
  14. 19 Sep, 2007 1 commit
    • Lee Schermerhorn's avatar
      Fix NUMA Memory Policy Reference Counting · 480eccf9
      Lee Schermerhorn authored
      
      
      This patch proposes fixes to the reference counting of memory policy in the
      page allocation paths and in show_numa_map().  Extracted from my "Memory
      Policy Cleanups and Enhancements" series as stand-alone.
      
      Shared policy lookup [shmem] has always added a reference to the policy,
      but this was never unrefed after page allocation or after formatting the
      numa map data.
      
      Default system policy should not require additional ref counting, nor
      should the current task's task policy.  However, show_numa_map() calls
      get_vma_policy() to examine what may be [likely is] another task's policy.
      The latter case needs protection against freeing of the policy.
      
      This patch adds a reference count to a mempolicy returned by
      get_vma_policy() when the policy is a vma policy or another task's
      mempolicy.  Again, shared policy is already reference counted on lookup.  A
      matching "unref" [__mpol_free()] is performed in alloc_page_vma() for
      shared and vma policies, and in show_numa_map() for shared and another
      task's mempolicy.  We can call __mpol_free() directly, saving an admittedly
      inexpensive inline NULL test, because we know we have a non-NULL policy.
      
      Handling policy ref counts for hugepages is a bit trickier.
      huge_zonelist() returns a zone list that might come from a shared or vma
      'BIND policy.  In this case, we should hold the reference until after the
      huge page allocation in dequeue_hugepage().  The patch modifies
      huge_zonelist() to return a pointer to the mempolicy if it needs to be
      unref'd after allocation.
      
      Kernel Build [16cpu, 32GB, ia64] - average of 10 runs:
      
      		w/o patch	w/ refcount patch
      	    Avg	  Std Devn	   Avg	  Std Devn
      Real:	 100.59	    0.38	 100.63	    0.43
      User:	1209.60	    0.37	1209.91	    0.31
      System:   81.52	    0.42	  81.64	    0.34
      
      Signed-off-by: default avatarLee Schermerhorn <lee.schermerhorn@hp.com>
      Acked-by: default avatarAndi Kleen <ak@suse.de>
      Cc: Christoph Lameter <clameter@sgi.com>
      Acked-by: default avatarMel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      480eccf9
  15. 23 Aug, 2007 1 commit
  16. 24 Jul, 2007 1 commit
  17. 19 Jul, 2007 5 commits
    • Akinobu Mita's avatar
      hugetlb: use set_compound_page_dtor · f8af0bb8
      Akinobu Mita authored
      
      
      Use appropriate accessor function to set compound page destructor
      function.
      
      Cc:  William Irwin <wli@holomorphy.com>
      Signed-off-by: default avatarAkinobu Mita <akinobu.mita@gmail.com>
      Acked-by: default avatarAdam Litke <agl@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f8af0bb8
    • Hugh Dickins's avatar
      Remove nid_lock from alloc_fresh_huge_page · 7ed5cb2b
      Hugh Dickins authored
      
      
      The fix to that race in alloc_fresh_huge_page() which could give an illegal
      node ID did not need nid_lock at all: the fix was to replace static int nid
      by static int prev_nid and do the work on local int nid.  nid_lock did make
      sure that racers strictly roundrobin the nodes, but that's not something we
      need to enforce strictly.  Kill nid_lock.
      
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7ed5cb2b
    • Andrew Morton's avatar
      dequeue_huge_page() warning fix · 3abf7afd
      Andrew Morton authored
      
      
      mm/hugetlb.c: In function `dequeue_huge_page':
      mm/hugetlb.c:72: warning: 'nid' might be used uninitialized in this function
      
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Adam Litke <agl@us.ibm.com>
      Cc: David Gibson <hermes@gibson.dropbear.id.au>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3abf7afd
    • Nick Piggin's avatar
      mm: fault feedback #2 · 83c54070
      Nick Piggin authored
      
      
      This patch completes Linus's wish that the fault return codes be made into
      bit flags, which I agree makes everything nicer.  This requires requires
      all handle_mm_fault callers to be modified (possibly the modifications
      should go further and do things like fault accounting in handle_mm_fault --
      however that would be for another patch).
      
      [akpm@linux-foundation.org: fix alpha build]
      [akpm@linux-foundation.org: fix s390 build]
      [akpm@linux-foundation.org: fix sparc build]
      [akpm@linux-foundation.org: fix sparc64 build]
      [akpm@linux-foundation.org: fix ia64 build]
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Ian Molton <spyro@f2s.com>
      Cc: Bryan Wu <bryan.wu@analog.com>
      Cc: Mikael Starvik <starvik@axis.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Hirokazu Takata <takata@linux-m32r.org>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Roman Zippel <zippel@linux-m68k.org>
      Cc: Greg Ungerer <gerg@uclinux.org>
      Cc: Matthew Wilcox <willy@debian.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp>
      Cc: Richard Curnow <rc@rc0.org.uk>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
      Cc: Miles Bader <uclinux-v850@lsi.nec.co.jp>
      Cc: Chris Zankel <chris@zankel.net>
      Acked-by: default avatarKyle McMartin <kyle@mcmartin.ca>
      Acked-by: default avatarHaavard Skinnemoen <hskinnemoen@atmel.com>
      Acked-by: default avatarRalf Baechle <ralf@linux-mips.org>
      Acked-by: default avatarAndi Kleen <ak@muc.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      [ Still apparently needs some ARM and PPC loving - Linus ]
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      83c54070
    • Nick Piggin's avatar
      mm: fault feedback #1 · d0217ac0
      Nick Piggin authored
      
      
      Change ->fault prototype.  We now return an int, which contains
      VM_FAULT_xxx code in the low byte, and FAULT_RET_xxx code in the next byte.
       FAULT_RET_ code tells the VM whether a page was found, whether it has been
      locked, and potentially other things.  This is not quite the way he wanted
      it yet, but that's changed in the next patch (which requires changes to
      arch code).
      
      This means we no longer set VM_CAN_INVALIDATE in the vma in order to say
      that a page is locked which requires filemap_nopage to go away (because we
      can no longer remain backward compatible without that flag), but we were
      going to do that anyway.
      
      struct fault_data is renamed to struct vm_fault as Linus asked. address
      is now a void __user * that we should firmly encourage drivers not to use
      without really good reason.
      
      The page is now returned via a page pointer in the vm_fault struct.
      
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d0217ac0
  18. 17 Jul, 2007 2 commits
  19. 16 Jul, 2007 2 commits