1. 01 Feb, 2018 3 commits
  2. 30 Nov, 2017 1 commit
    • Dan Williams's avatar
      mm: fix device-dax pud write-faults triggered by get_user_pages() · 1501899a
      Dan Williams authored
      Currently only get_user_pages_fast() can safely handle the writable gup
      case due to its use of pud_access_permitted() to check whether the pud
      entry is writable.  In the gup slow path pud_write() is used instead of
      pud_access_permitted() and to date it has been unimplemented, just calls
      BUG_ON().
      
          kernel BUG at ./include/linux/hugetlb.h:244!
          [..]
          RIP: 0010:follow_devmap_pud+0x482/0x490
          [..]
          Call Trace:
           follow_page_mask+0x28c/0x6e0
           __get_user_pages+0xe4/0x6c0
           get_user_pages_unlocked+0x130/0x1b0
           get_user_pages_fast+0x89/0xb0
           iov_iter_get_pages_alloc+0x114/0x4a0
           nfs_direct_read_schedule_iovec+0xd2/0x350
           ? nfs_start_io_direct+0x63/0x70
           nfs_file_direct_read+0x1e0/0x250
           nfs_file_read+0x90/0xc0
      
      For now this just implements a simple check for the _PAGE_RW bit similar
      to pmd_write.  However, this implies that the gup-slow-path check is
      missing the extra checks that the gup-fast-path performs with
      pud_access_permitted.  Later patches will align all checks to use the
      'access_permitted' helper if the architecture provides it.
      
      Note that the generic 'access_permitted' helper fallback is the simple
      _PAGE_RW check on architectures that do not define the
      'access_permitted' helper(s).
      
      [dan.j.williams@intel.com: fix powerpc compile error]
        Link: http://lkml.kernel.org/r/151129126165.37405.16031785266675461397.stgit@dwillia2-desk3.amr.corp.intel.com
      Link: http://lkml.kernel.org/r/151043109938.2842.14834662818213616199.stgit@dwillia2-desk3.amr.corp.intel.com
      Fixes: a00cc7d9
      
       ("mm, x86: add support for PUD-sized transparent hugepages")
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Reported-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      Acked-by: Thomas Gleixner <tglx@linutronix.de>	[x86]
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1501899a
  3. 02 Nov, 2017 1 commit
    • Greg Kroah-Hartman's avatar
      License cleanup: add SPDX GPL-2.0 license identifier to files with no license · b2441318
      Greg Kroah-Hartman authored
      Many source files in the tree are missing licensing information, which
      makes it harder for compliance tools to determine the correct license.
      
      By default all files without license information are under the default
      license of the kernel, which is GPL version 2.
      
      Update the files which contain no license information with the 'GPL-2.0'
      SPDX license identifier.  The SPDX identifier is a legally binding
      shorthand, which can be used instead of the full boiler plate text.
      
      This patch is based on work done by Thomas Gleixner and Kate Stewart and
      Philippe Ombredanne.
      
      How this work was done:
      
      Patches were generated and checked against linux-4.14-rc6 for a subset of
      the use cases:
       - file had no licensing information it it.
       - file was a */uapi/* one with no licensing information in it,
       - file was a */uapi/* one with existing licensing information,
      
      Further patches will be generated in subsequent months to fix up cases
      where non-standard...
      b2441318
  4. 15 Aug, 2017 1 commit
    • Aneesh Kumar K.V's avatar
      mm/hugetlb: Allow arch to override and call the weak function · e24a1307
      Aneesh Kumar K.V authored
      
      
      When running in guest mode ppc64 supports a different mechanism for hugetlb
      allocation/reservation. The LPAR management application called HMC can
      be used to reserve a set of hugepages and we pass the details of
      reserved pages via device tree to the guest. (more details in
      htab_dt_scan_hugepage_blocks()) . We do the memblock_reserve of the range
      and later in the boot sequence, we add the reserved range to huge_boot_pages.
      
      But to enable 16G hugetlb on baremetal config (when we are not running as guest)
      we want to do memblock reservation during boot. Generic code already does this
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      e24a1307
  5. 10 Jul, 2017 5 commits
    • Michal Hocko's avatar
      hugetlb: add support for preferred node to alloc_huge_page_nodemask · 3e59fcb0
      Michal Hocko authored
      alloc_huge_page_nodemask tries to allocate from any numa node in the
      allowed node mask starting from lower numa nodes.  This might lead to
      filling up those low NUMA nodes while others are not used.  We can
      reduce this risk by introducing a concept of the preferred node similar
      to what we have in the regular page allocator.  We will start allocating
      from the preferred nid and then iterate over all allowed nodes in the
      zonelist order until we try them all.
      
      This is mimicing the page allocator logic except it operates on per-node
      mempools.  dequeue_huge_page_vma already does this so distill the
      zonelist logic into a more generic dequeue_huge_page_nodemask and use it
      in alloc_huge_page_nodemask.
      
      This will allow us to use proper per numa distance fallback also for
      alloc_huge_page_node which can use alloc_huge_page_nodemask now and we
      can get rid of alloc_huge_page_node helper which doesn't have any user
      anymore.
      
      Link: http://lkml.kernel.org/r/20170622193034.28972-3-mhocko@kernel.org
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Tested-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3e59fcb0
    • Michal Hocko's avatar
      mm, hugetlb: unclutter hugetlb allocation layers · aaf14e40
      Michal Hocko authored
      Patch series "mm, hugetlb: allow proper node fallback dequeue".
      
      While working on a hugetlb migration issue addressed in a separate
      patchset[1] I have noticed that the hugetlb allocations from the
      preallocated pool are quite subotimal.
      
       [1] //lkml.kernel.org/r/20170608074553.22152-1-mhocko@kernel.org
      
      There is no fallback mechanism implemented and no notion of preferred
      node.  I have tried to work around it but Vlastimil was right to push
      back for a more robust solution.  It seems that such a solution is to
      reuse zonelist approach we use for the page alloctor.
      
      This series has 3 patches.  The first one tries to make hugetlb
      allocation layers more clear.  The second one implements the zonelist
      hugetlb pool allocation and introduces a preferred node semantic which
      is used by the migration callbacks.  The last patch is a clean up.
      
      This patch (of 3):
      
      Hugetlb allocation path for fresh huge pages is unnecessarily complex
      and it mixes different interfaces between layers.
      
      __alloc_buddy_huge_page is the central place to perform a new
      allocation.  It checks for the hugetlb overcommit and then relies on
      __hugetlb_alloc_buddy_huge_page to invoke the page allocator.  This is
      all good except that __alloc_buddy_huge_page pushes vma and address down
      the callchain and so __hugetlb_alloc_buddy_huge_page has to deal with
      two different allocation modes - one for memory policy and other node
      specific (or to make it more obscure node non-specific) requests.
      
      This just screams for a reorganization.
      
      This patch pulls out all the vma specific handling up to
      __alloc_buddy_huge_page_with_mpol where it belongs.
      __alloc_buddy_huge_page will get nodemask argument and
      __hugetlb_alloc_buddy_huge_page will become a trivial wrapper over the
      page allocator.
      
      In short:
      __alloc_buddy_huge_page_with_mpol - memory policy handling
        __alloc_buddy_huge_page - overcommit handling and accounting
          __hugetlb_alloc_buddy_huge_page - page allocator layer
      
      Also note that __hugetlb_alloc_buddy_huge_page and its cpuset retry loop
      is not really needed because the page allocator already handles the
      cpusets update.
      
      Finally __hugetlb_alloc_buddy_huge_page had a special case for node
      specific allocations (when no policy is applied and there is a node
      given).  This has relied on __GFP_THISNODE to not fallback to a different
      node.  alloc_huge_page_node is the only caller which relies on this
      behavior so move the __GFP_THISNODE there.
      
      Not only does this remove quite some code it also should make those
      layers easier to follow and clear wrt responsibilities.
      
      Link: http://lkml.kernel.org/r/20170622193034.28972-2-mhocko@kernel.org
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Tested-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      aaf14e40
    • Michal Hocko's avatar
      hugetlb, memory_hotplug: prefer to use reserved pages for migration · 4db9b2ef
      Michal Hocko authored
      new_node_page will try to use the origin's next NUMA node as the
      migration destination for hugetlb pages.  If such a node doesn't have
      any preallocated pool it falls back to __alloc_buddy_huge_page_no_mpol
      to allocate a surplus page instead.  This is quite subotpimal for any
      configuration when hugetlb pages are no distributed to all NUMA nodes
      evenly.  Say we have a hotplugable node 4 and spare hugetlb pages are
      node 0
      
        /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages:10000
        /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages:0
        /sys/devices/system/node/node2/hugepages/hugepages-2048kB/nr_hugepages:0
        /sys/devices/system/node/node3/hugepages/hugepages-2048kB/nr_hugepages:0
        /sys/devices/system/node/node4/hugepages/hugepages-2048kB/nr_hugepages:10000
        /sys/devices/system/node/node5/hugepages/hugepages-2048kB/nr_hugepages:0
        /sys/devices/system/node/node6/hugepages/hugepages-2048kB/nr_hugepages:0
        /sys/devices/system/node/node7/hugepages/hugepages-2048kB/nr_hugepages:0
      
      Now we consume the whole pool on node 4 and try to offline this node.
      All the allocated pages should be moved to node0 which has enough
      preallocated pages to hold them.  With the current implementation
      offlining very likely fails because hugetlb allocations during runtime
      are much less reliable.
      
      Fix this by reusing the nodemask which excludes migration source and try
      to find a first node which has a page in the preallocated pool first and
      fall back to __alloc_buddy_huge_page_no_mpol only when the whole pool is
      consumed.
      
      [akpm@linux-foundation.org: remove bogus arg from alloc_huge_page_nodemask() stub]
      Link: http://lkml.kernel.org/r/20170608074553.22152-3-mhocko@kernel.org
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: zhong jiang <zhongjiang@huawei.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4db9b2ef
    • Naoya Horiguchi's avatar
      ddd40d8a
    • Anshuman Khandual's avatar
      mm: hugetlb: soft-offline: dissolve source hugepage after successful migration · c3114a84
      Anshuman Khandual authored
      Currently hugepage migrated by soft-offline (i.e.  due to correctable
      memory errors) is contained as a hugepage, which means many non-error
      pages in it are unreusable, i.e.  wasted.
      
      This patch solves this issue by dissolving source hugepages into buddy.
      As done in previous patch, PageHWPoison is set only on a head page of
      the error hugepage.  Then in dissoliving we move the PageHWPoison flag
      to the raw error page so that all healthy subpages return back to buddy.
      
      [arnd@arndb.de: fix warnings: replace some macros with inline functions]
        Link: http://lkml.kernel.org/r/20170609102544.2947326-1-arnd@arndb.de
      Link: http://lkml.kernel.org/r/1496305019-5493-5-git-send-email-n-horiguchi@ah.jp.nec.com
      
      Signed-off-by: default avatarAnshuman Khandual <khandual@linux.vnet.ibm.com>
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c3114a84
  6. 06 Jul, 2017 8 commits
  7. 09 Mar, 2017 1 commit
  8. 23 Feb, 2017 2 commits
  9. 08 Oct, 2016 2 commits
  10. 21 May, 2016 2 commits
  11. 20 May, 2016 1 commit
    • Vaishali Thakkar's avatar
      mm/hugetlb: introduce hugetlb_bad_size() · 9fee021d
      Vaishali Thakkar authored
      
      
      When any unsupported hugepage size is specified, 'hugepagesz=' and
      'hugepages=' should be ignored during command line parsing until any
      supported hugepage size is found.  But currently incorrect number of
      hugepages are allocated when unsupported size is specified as it fails
      to ignore the 'hugepages=' command.
      
      Test case:
      
      Note that this is specific to x86 architecture.
      
      Boot the kernel with command line option 'hugepagesz=256M hugepages=X'.
      After boot, dmesg output shows that X number of hugepages of the size 2M
      is pre-allocated instead of 0.
      
      So, to handle such command line options, introduce new routine
      hugetlb_bad_size.  The routine hugetlb_bad_size sets the global variable
      parsed_valid_hugepagesz.  We are using parsed_valid_hugepagesz to save
      the state when unsupported hugepagesize is found so that we can ignore
      the 'hugepages=' parameters after that and then reset the variable when
      supported hugepage size is found.
      
      The routine hugetlb_bad_size can be called while setting 'hugepagesz='
      parameter in an architecture specific code.
      Signed-off-by: default avatarVaishali Thakkar <vaishali.thakkar@oracle.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
      Cc: Dominik Dingel <dingel@linux.vnet.ibm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9fee021d
  12. 16 Jan, 2016 1 commit
  13. 15 Jan, 2016 1 commit
  14. 21 Dec, 2015 1 commit
    • David Woods's avatar
      arm64: hugetlb: add support for PTE contiguous bit · 66b3923a
      David Woods authored
      
      
      The arm64 MMU supports a Contiguous bit which is a hint that the TTE
      is one of a set of contiguous entries which can be cached in a single
      TLB entry.  Supporting this bit adds new intermediate huge page sizes.
      
      The set of huge page sizes available depends on the base page size.
      Without using contiguous pages the huge page sizes are as follows.
      
       4KB:   2MB  1GB
      64KB: 512MB
      
      With a 4KB granule, the contiguous bit groups together sets of 16 pages
      and with a 64KB granule it groups sets of 32 pages.  This enables two new
      huge page sizes in each case, so that the full set of available sizes
      is as follows.
      
       4KB:  64KB   2MB  32MB  1GB
      64KB:   2MB 512MB  16GB
      
      If a 16KB granule is used then the contiguous bit groups 128 pages
      at the PTE level and 32 pages at the PMD level.
      
      If the base page size is set to 64KB then 2MB pages are enabled by
      default.  It is possible in the future to make 2MB the default huge
      page size for both 4KB and 64KB granules.
      Reviewed-by: default avatarChris Metcalf <cmetcalf@ezchip.com>
      Reviewed-by: default avatarSteve Capper <steve.capper@linaro.org>
      Signed-off-by: default avatarDavid Woods <dwoods@ezchip.com>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      66b3923a
  15. 06 Nov, 2015 1 commit
  16. 08 Sep, 2015 5 commits
    • Mike Kravetz's avatar
      hugetlbfs: add hugetlbfs_fallocate() · 70c3547e
      Mike Kravetz authored
      
      
      This is based on the shmem version, but it has diverged quite a bit.  We
      have no swap to worry about, nor the new file sealing.  Add
      synchronication via the fault mutex table to coordinate page faults,
      fallocate allocation and fallocate hole punch.
      
      What this allows us to do is move physical memory in and out of a
      hugetlbfs file without having it mapped.  This also gives us the ability
      to support MADV_REMOVE since it is currently implemented using
      fallocate().  MADV_REMOVE lets madvise() remove pages from the middle of
      a hugetlbfs file, which wasn't possible before.
      
      hugetlbfs fallocate only operates on whole huge pages.
      
      Based on code by Dave Hansen.
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      70c3547e
    • Mike Kravetz's avatar
      hugetlbfs: New huge_add_to_page_cache helper routine · ab76ad54
      Mike Kravetz authored
      
      
      Currently, there is only a single place where hugetlbfs pages are added
      to the page cache.  The new fallocate code be adding a second one, so
      break the functionality out into its own helper.
      Signed-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ab76ad54
    • Mike Kravetz's avatar
      hugetlbfs: truncate_hugepages() takes a range of pages · b5cec28d
      Mike Kravetz authored
      
      
      Modify truncate_hugepages() to take a range of pages (start, end)
      instead of simply start.  If an end value of LLONG_MAX is passed, the
      current "truncate" functionality is maintained.  Existing callers are
      modified to pass LLONG_MAX as end of range.  By keying off end ==
      LLONG_MAX, the routine behaves differently for truncate and hole punch.
      Page removal is now synchronized with page allocation via faults by
      using the fault mutex table.  The hole punch case can experience the
      rare region_del error and must handle accordingly.
      
      Add the routine hugetlb_fix_reserve_counts to fix up reserve counts in
      the case where region_del returns an error.
      
      Since the routine handles more than just the truncate case, it is
      renamed to remove_inode_hugepages().  To be consistent, the routine
      truncate_huge_page() is renamed remove_huge_page().
      
      Downstream of remove_inode_hugepages(), the routine
      hugetlb_unreserve_pages() is also modified to take a range of pages.
      hugetlb_unreserve_pages is modified to detect an error from region_del and
      pass it back to the caller.
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b5cec28d
    • Mike Kravetz's avatar
      mm/hugetlb: expose hugetlb fault mutex for use by fallocate · c672c7f2
      Mike Kravetz authored
      
      
      hugetlb page faults are currently synchronized by the table of mutexes
      (htlb_fault_mutex_table).  fallocate code will need to synchronize with
      the page fault code when it allocates or deletes pages.  Expose
      interfaces so that fallocate operations can be synchronized with page
      faults.  Minor name changes to be more consistent with other global
      hugetlb symbols.
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c672c7f2
    • Mike Kravetz's avatar
      mm/hugetlb: add cache of descriptors to resv_map for region_add · 5e911373
      Mike Kravetz authored
      
      
      hugetlbfs is used today by applications that want a high degree of
      control over huge page usage.  Often, large hugetlbfs files are used to
      map a large number huge pages into the application processes.  The
      applications know when page ranges within these large files will no
      longer be used, and ideally would like to release them back to the
      subpool or global pools for other uses.  The fallocate() system call
      provides an interface for preallocation and hole punching within files.
      This patch set adds fallocate functionality to hugetlbfs.
      
      fallocate hole punch will want to remove a specific range of pages.
      When pages are removed, their associated entries in the region/reserve
      map will also be removed.  This will break an assumption in the
      region_chg/region_add calling sequence.  If a new region descriptor must
      be allocated, it is done as part of the region_chg processing.  In this
      way, region_add can not fail because it does not need to attempt an
      allocation.
      
      To prepare for fallocate hole punch, create a "cache" of descriptors
      that can be used by region_add if necessary.  region_chg will ensure
      there are sufficient entries in the cache.  It will be necessary to
      track the number of in progress add operations to know a sufficient
      number of descriptors reside in the cache.  A new routine region_abort
      is added to adjust this in progress count when add operations are
      aborted.  vma_abort_reservation is also added for callers creating
      reservations with vma_needs_reservation/vma_commit_reservation.
      
      [akpm@linux-foundation.org: fix typo in comment, use more cols]
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5e911373
  17. 17 Jul, 2015 1 commit
  18. 15 Apr, 2015 3 commits
    • Naoya Horiguchi's avatar
      mm: hugetlb: cleanup using paeg_huge_active() · 7e1f049e
      Naoya Horiguchi authored
      
      
      Now we have an easy access to hugepages' activeness, so existing helpers to
      get the information can be cleaned up.
      
      [akpm@linux-foundation.org: s/PageHugeActive/page_huge_active/]
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Hugh Dickins <hughd@google.com>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7e1f049e
    • Mike Kravetz's avatar
      hugetlbfs: accept subpool min_size mount option and setup accordingly · 7ca02d0a
      Mike Kravetz authored
      
      
      Make 'min_size=<value>' be an option when mounting a hugetlbfs.  This
      option takes the same value as the 'size' option.  min_size can be
      specified without specifying size.  If both are specified, min_size must
      be less that or equal to size else the mount will fail.  If min_size is
      specified, then at mount time an attempt is made to reserve min_size
      pages.  If the reservation fails, the mount fails.  At umount time, the
      reserved pages are released.
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7ca02d0a
    • Mike Kravetz's avatar
      hugetlbfs: add minimum size tracking fields to subpool structure · c6a91820
      Mike Kravetz authored
      
      
      hugetlbfs allocates huge pages from the global pool as needed.  Even if
      the global pool contains a sufficient number pages for the filesystem size
      at mount time, those global pages could be grabbed for some other use.  As
      a result, filesystem huge page allocations may fail due to lack of pages.
      
      Applications such as a database want to use huge pages for performance
      reasons.  hugetlbfs filesystem semantics with ownership and modes work
      well to manage access to a pool of huge pages.  However, the application
      would like some reasonable assurance that allocations will not fail due to
      a lack of huge pages.  At application startup time, the application would
      like to configure itself to use a specific number of huge pages.  Before
      starting, the application can check to make sure that enough huge pages
      exist in the system global pools.  However, there are no guarantees that
      those pages will be available when needed by the application.  What the
      application wants is exclusive use of a subset of huge pages.
      
      Add a new hugetlbfs mount option 'min_size=<value>' to indicate that the
      specified number of pages will be available for use by the filesystem.  At
      mount time, this number of huge pages will be reserved for exclusive use
      of the filesystem.  If there is not a sufficient number of free pages, the
      mount will fail.  As pages are allocated to and freeed from the
      filesystem, the number of reserved pages is adjusted so that the specified
      minimum is maintained.
      
      This patch (of 4):
      
      Add a field to the subpool structure to indicate the minimimum number of
      huge pages to always be used by this subpool.  This minimum count includes
      allocated pages as well as reserved pages.  If the minimum number of pages
      for the subpool have not been allocated, pages are reserved up to this
      minimum.  An additional field (rsv_hpages) is used to track the number of
      pages reserved to meet this minimum size.  The hstate pointer in the
      subpool is convenient to have when reserving and unreserving the pages.
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c6a91820