1. 17 Jun, 2009 40 commits
    • Wu Fengguang's avatar
      mm: introduce PageHuge() for testing huge/gigantic pages · 20a0307c
      Wu Fengguang authored
      
      
      A series of patches to enhance the /proc/pagemap interface and to add a
      userspace executable which can be used to present the pagemap data.
      
      Export 10 more flags to end users (and more for kernel developers):
      
              11. KPF_MMAP            (pseudo flag) memory mapped page
              12. KPF_ANON            (pseudo flag) memory mapped page (anonymous)
              13. KPF_SWAPCACHE       page is in swap cache
              14. KPF_SWAPBACKED      page is swap/RAM backed
              15. KPF_COMPOUND_HEAD   (*)
              16. KPF_COMPOUND_TAIL   (*)
              17. KPF_HUGE		hugeTLB pages
              18. KPF_UNEVICTABLE     page is in the unevictable LRU list
              19. KPF_HWPOISON        hardware detected corruption
              20. KPF_NOPAGE          (pseudo flag) no page frame at the address
      
              (*) For compound pages, exporting _both_ head/tail info enables
                  users to tell where a compound page starts/ends, and its order.
      
      a simple demo of the page-types tool
      
      # ./page-types -h
      page-types [options]
                  -r|--raw                  Raw mode, for kernel developers
                  -a|--addr    addr-spec    Walk a range of pages
                  -b|--bits    bits-spec    Walk pages with specified bits
                  -l|--list                 Show page details in ranges
                  -L|--list-each            Show page details one by one
                  -N|--no-summary           Don't show summay info
                  -h|--help                 Show this usage message
      addr-spec:
                  N                         one page at offset N (unit: pages)
                  N+M                       pages range from N to N+M-1
                  N,M                       pages range from N to M-1
                  N,                        pages range from N to end
                  ,M                        pages range from 0 to M
      bits-spec:
                  bit1,bit2                 (flags & (bit1|bit2)) != 0
                  bit1,bit2=bit1            (flags & (bit1|bit2)) == bit1
                  bit1,~bit2                (flags & (bit1|bit2)) == bit1
                  =bit1,bit2                flags == (bit1|bit2)
      bit-names:
                locked              error         referenced           uptodate
                 dirty                lru             active               slab
             writeback            reclaim              buddy               mmap
             anonymous          swapcache         swapbacked      compound_head
         compound_tail               huge        unevictable           hwpoison
                nopage           reserved(r)         mlocked(r)    mappedtodisk(r)
               private(r)       private_2(r)   owner_private(r)            arch(r)
              uncached(r)       readahead(o)       slob_free(o)     slub_frozen(o)
            slub_debug(o)
                                         (r) raw mode bits  (o) overloaded bits
      
      # ./page-types
                   flags      page-count       MB  symbolic-flags                     long-symbolic-flags
      0x0000000000000000          487369     1903  _________________________________
      0x0000000000000014               5        0  __R_D____________________________  referenced,dirty
      0x0000000000000020               1        0  _____l___________________________  lru
      0x0000000000000024              34        0  __R__l___________________________  referenced,lru
      0x0000000000000028            3838       14  ___U_l___________________________  uptodate,lru
      0x0001000000000028              48        0  ___U_l_______________________I___  uptodate,lru,readahead
      0x000000000000002c            6478       25  __RU_l___________________________  referenced,uptodate,lru
      0x000100000000002c              47        0  __RU_l_______________________I___  referenced,uptodate,lru,readahead
      0x0000000000000040            8344       32  ______A__________________________  active
      0x0000000000000060               1        0  _____lA__________________________  lru,active
      0x0000000000000068             348        1  ___U_lA__________________________  uptodate,lru,active
      0x0001000000000068              12        0  ___U_lA______________________I___  uptodate,lru,active,readahead
      0x000000000000006c             988        3  __RU_lA__________________________  referenced,uptodate,lru,active
      0x000100000000006c              48        0  __RU_lA______________________I___  referenced,uptodate,lru,active,readahead
      0x0000000000004078               1        0  ___UDlA_______b__________________  uptodate,dirty,lru,active,swapbacked
      0x000000000000407c              34        0  __RUDlA_______b__________________  referenced,uptodate,dirty,lru,active,swapbacked
      0x0000000000000400             503        1  __________B______________________  buddy
      0x0000000000000804               1        0  __R________M_____________________  referenced,mmap
      0x0000000000000828            1029        4  ___U_l_____M_____________________  uptodate,lru,mmap
      0x0001000000000828              43        0  ___U_l_____M_________________I___  uptodate,lru,mmap,readahead
      0x000000000000082c             382        1  __RU_l_____M_____________________  referenced,uptodate,lru,mmap
      0x000100000000082c              12        0  __RU_l_____M_________________I___  referenced,uptodate,lru,mmap,readahead
      0x0000000000000868             192        0  ___U_lA____M_____________________  uptodate,lru,active,mmap
      0x0001000000000868              12        0  ___U_lA____M_________________I___  uptodate,lru,active,mmap,readahead
      0x000000000000086c             800        3  __RU_lA____M_____________________  referenced,uptodate,lru,active,mmap
      0x000100000000086c              31        0  __RU_lA____M_________________I___  referenced,uptodate,lru,active,mmap,readahead
      0x0000000000004878               2        0  ___UDlA____M__b__________________  uptodate,dirty,lru,active,mmap,swapbacked
      0x0000000000001000             492        1  ____________a____________________  anonymous
      0x0000000000005808               4        0  ___U_______Ma_b__________________  uptodate,mmap,anonymous,swapbacked
      0x0000000000005868            2839       11  ___U_lA____Ma_b__________________  uptodate,lru,active,mmap,anonymous,swapbacked
      0x000000000000586c              30        0  __RU_lA____Ma_b__________________  referenced,uptodate,lru,active,mmap,anonymous,swapbacked
                   total          513968     2007
      
      # ./page-types -r
                   flags      page-count       MB  symbolic-flags                     long-symbolic-flags
      0x0000000000000000          468002     1828  _________________________________
      0x0000000100000000           19102       74  _____________________r___________  reserved
      0x0000000000008000              41        0  _______________H_________________  compound_head
      0x0000000000010000             188        0  ________________T________________  compound_tail
      0x0000000000008014               1        0  __R_D__________H_________________  referenced,dirty,compound_head
      0x0000000000010014               4        0  __R_D___________T________________  referenced,dirty,compound_tail
      0x0000000000000020               1        0  _____l___________________________  lru
      0x0000000800000024              34        0  __R__l__________________P________  referenced,lru,private
      0x0000000000000028            3794       14  ___U_l___________________________  uptodate,lru
      0x0001000000000028              46        0  ___U_l_______________________I___  uptodate,lru,readahead
      0x0000000400000028              44        0  ___U_l_________________d_________  uptodate,lru,mappedtodisk
      0x0001000400000028               2        0  ___U_l_________________d_____I___  uptodate,lru,mappedtodisk,readahead
      0x000000000000002c            6434       25  __RU_l___________________________  referenced,uptodate,lru
      0x000100000000002c              47        0  __RU_l_______________________I___  referenced,uptodate,lru,readahead
      0x000000040000002c              14        0  __RU_l_________________d_________  referenced,uptodate,lru,mappedtodisk
      0x000000080000002c              30        0  __RU_l__________________P________  referenced,uptodate,lru,private
      0x0000000800000040            8124       31  ______A_________________P________  active,private
      0x0000000000000040             219        0  ______A__________________________  active
      0x0000000800000060               1        0  _____lA_________________P________  lru,active,private
      0x0000000000000068             322        1  ___U_lA__________________________  uptodate,lru,active
      0x0001000000000068              12        0  ___U_lA______________________I___  uptodate,lru,active,readahead
      0x0000000400000068              13        0  ___U_lA________________d_________  uptodate,lru,active,mappedtodisk
      0x0000000800000068              12        0  ___U_lA_________________P________  uptodate,lru,active,private
      0x000000000000006c             977        3  __RU_lA__________________________  referenced,uptodate,lru,active
      0x000100000000006c              48        0  __RU_lA______________________I___  referenced,uptodate,lru,active,readahead
      0x000000040000006c               5        0  __RU_lA________________d_________  referenced,uptodate,lru,active,mappedtodisk
      0x000000080000006c               3        0  __RU_lA_________________P________  referenced,uptodate,lru,active,private
      0x0000000c0000006c               3        0  __RU_lA________________dP________  referenced,uptodate,lru,active,mappedtodisk,private
      0x0000000c00000068               1        0  ___U_lA________________dP________  uptodate,lru,active,mappedtodisk,private
      0x0000000000004078               1        0  ___UDlA_______b__________________  uptodate,dirty,lru,active,swapbacked
      0x000000000000407c              34        0  __RUDlA_______b__________________  referenced,uptodate,dirty,lru,active,swapbacked
      0x0000000000000400             538        2  __________B______________________  buddy
      0x0000000000000804               1        0  __R________M_____________________  referenced,mmap
      0x0000000000000828            1029        4  ___U_l_____M_____________________  uptodate,lru,mmap
      0x0001000000000828              43        0  ___U_l_____M_________________I___  uptodate,lru,mmap,readahead
      0x000000000000082c             382        1  __RU_l_____M_____________________  referenced,uptodate,lru,mmap
      0x000100000000082c              12        0  __RU_l_____M_________________I___  referenced,uptodate,lru,mmap,readahead
      0x0000000000000868             192        0  ___U_lA____M_____________________  uptodate,lru,active,mmap
      0x0001000000000868              12        0  ___U_lA____M_________________I___  uptodate,lru,active,mmap,readahead
      0x000000000000086c             800        3  __RU_lA____M_____________________  referenced,uptodate,lru,active,mmap
      0x000100000000086c              31        0  __RU_lA____M_________________I___  referenced,uptodate,lru,active,mmap,readahead
      0x0000000000004878               2        0  ___UDlA____M__b__________________  uptodate,dirty,lru,active,mmap,swapbacked
      0x0000000000001000             492        1  ____________a____________________  anonymous
      0x0000000000005008               2        0  ___U________a_b__________________  uptodate,anonymous,swapbacked
      0x0000000000005808               4        0  ___U_______Ma_b__________________  uptodate,mmap,anonymous,swapbacked
      0x000000000000580c               1        0  __RU_______Ma_b__________________  referenced,uptodate,mmap,anonymous,swapbacked
      0x0000000000005868            2839       11  ___U_lA____Ma_b__________________  uptodate,lru,active,mmap,anonymous,swapbacked
      0x000000000000586c              29        0  __RU_lA____Ma_b__________________  referenced,uptodate,lru,active,mmap,anonymous,swapbacked
                   total          513968     2007
      
      # ./page-types --raw --list --no-summary --bits reserved
      offset  count   flags
      0       15      _____________________r___________
      31      4       _____________________r___________
      159     97      _____________________r___________
      4096    2067    _____________________r___________
      6752    2390    _____________________r___________
      9355    3       _____________________r___________
      9728    14526   _____________________r___________
      
      This patch:
      
      Introduce PageHuge(), which identifies huge/gigantic pages by their
      dedicated compound destructor functions.
      
      Also move prep_compound_gigantic_page() to hugetlb.c and make
      __free_pages_ok() non-static.
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Matt Mackall <mpm@selenic.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      20a0307c
    • Mel Gorman's avatar
      mm: use alloc_pages_exact() in alloc_large_system_hash() to avoid duplicated logic · a1dd268c
      Mel Gorman authored
      
      
      alloc_large_system_hash() has logic for freeing pages at the end of an
      excessively large power-of-two buffer that is a duplicate of what is in
      alloc_pages_exact().  This patch converts alloc_large_system_hash() to use
      alloc_pages_exact().
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Acked-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a1dd268c
    • Mel Gorman's avatar
      page allocator: sanity check order in the page allocator slow path · 72807a74
      Mel Gorman authored
      
      
      Callers may speculatively call different allocators in order of preference
      trying to allocate a buffer of a given size.  The order needed to allocate
      this may be larger than what the page allocator can normally handle.
      While the allocator mostly does the right thing, it should not direct
      reclaim or wakeup kswapd with a bogus order.  This patch sanity checks the
      order in the slow path and returns NULL if it is too large.
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarDave Hansen <dave@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      72807a74
    • KOSAKI Motohiro's avatar
      page allocator: move free_page_mlock() to page_alloc.c · 092cead6
      KOSAKI Motohiro authored
      
      
      Currently, free_page_mlock() is only called from page_alloc.c.  Thus, we
      can move it to page_alloc.c.
      
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      092cead6
    • Mel Gorman's avatar
      page allocator: slab: use nr_online_nodes to check for a NUMA platform · b6e68bc1
      Mel Gorman authored
      
      
      SLAB currently avoids checking a bitmap repeatedly by checking once and
      storing a flag.  When the addition of nr_online_nodes as a cheaper version
      of num_online_nodes(), this check can be replaced by nr_online_nodes.
      
      (Christoph did a patch that this is lifted almost verbatim from)
      Signed-off-by: default avatarChristoph Lameter <cl@linux.com>
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: default avatarPekka Enberg <penberg@cs.helsinki.fi>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b6e68bc1
    • Christoph Lameter's avatar
      page allocator: use a pre-calculated value instead of num_online_nodes() in fast paths · 62bc62a8
      Christoph Lameter authored
      
      
      num_online_nodes() is called in a number of places but most often by the
      page allocator when deciding whether the zonelist needs to be filtered
      based on cpusets or the zonelist cache.  This is actually a heavy function
      and touches a number of cache lines.
      
      This patch stores the number of online nodes at boot time and updates the
      value when nodes get onlined and offlined.  The value is then used in a
      number of important paths in place of num_online_nodes().
      
      [rientjes@google.com: do not override definition of node_set_online() with macro]
      Signed-off-by: default avatarChristoph Lameter <cl@linux-foundation.org>
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      62bc62a8
    • Mel Gorman's avatar
      page allocator: get the pageblock migratetype without disabling interrupts · 974709bd
      Mel Gorman authored
      
      
      Local interrupts are disabled when freeing pages to the PCP list.  Part of
      that free checks what the migratetype of the pageblock the page is in but
      it checks this with interrupts disabled and interupts should never be
      disabled longer than necessary.  This patch checks the pagetype with
      interrupts enabled with the impact that it is possible a page is freed to
      the wrong list when a pageblock changes type.  As that block is now
      already considered mixed from an anti-fragmentation perspective, it's not
      of vital importance.
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      974709bd
    • Mel Gorman's avatar
      page allocator: update NR_FREE_PAGES only as necessary · f2260e6b
      Mel Gorman authored
      
      
      When pages are being freed to the buddy allocator, the zone NR_FREE_PAGES
      counter must be updated.  In the case of bulk per-cpu page freeing, it's
      updated once per page.  This retouches cache lines more than necessary.
      Update the counters one per per-cpu bulk free.
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Reviewed-by: default avatarChristoph Lameter <cl@linux-foundation.org>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f2260e6b
    • Mel Gorman's avatar
      page allocator: use allocation flags as an index to the zone watermark · 41858966
      Mel Gorman authored
      
      
      ALLOC_WMARK_MIN, ALLOC_WMARK_LOW and ALLOC_WMARK_HIGH determin whether
      pages_min, pages_low or pages_high is used as the zone watermark when
      allocating the pages.  Two branches in the allocator hotpath determine
      which watermark to use.
      
      This patch uses the flags as an array index into a watermark array that is
      indexed with WMARK_* defines accessed via helpers.  All call sites that
      use zone->pages_* are updated to use the helpers for accessing the values
      and the array offsets for setting.
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Reviewed-by: default avatarChristoph Lameter <cl@linux-foundation.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      41858966
    • Nick Piggin's avatar
      page allocator: do not check for compound pages during the page allocator sanity checks · a3af9c38
      Nick Piggin authored
      
      
      A number of sanity checks are made on each page allocation and free
      including that the page count is zero.  page_count() checks for compound
      pages and checks the count of the head page if true.  However, in these
      paths, we do not care if the page is compound or not as the count of each
      tail page should also be zero.
      
      This patch makes two changes to the use of page_count() in the free path.
      It converts one check of page_count() to a VM_BUG_ON() as the count should
      have been unconditionally checked earlier in the free path.  It also
      avoids checking for compound pages.
      
      [mel@csn.ul.ie: Wrote changelog]
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarNick Piggin <nickpiggin@yahoo.com.au>
      Reviewed-by: default avatarChristoph Lameter <cl@linux-foundation.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a3af9c38
    • Mel Gorman's avatar
      page allocator: do not setup zonelist cache when there is only one node · d395b734
      Mel Gorman authored
      
      
      There is a zonelist cache which is used to track zones that are not in the
      allowed cpuset or found to be recently full.  This is to reduce cache
      footprint on large machines.  On smaller machines, it just incurs cost for
      no gain.  This patch only uses the zonelist cache when there are NUMA
      nodes.
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Reviewed-by: default avatarChristoph Lameter <cl@linux-foundation.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d395b734
    • Mel Gorman's avatar
      page allocator: do not disable interrupts in free_page_mlock() · da456f14
      Mel Gorman authored
      
      
      free_page_mlock() tests and clears PG_mlocked using locked versions of the
      bit operations.  If set, it disables interrupts to update counters and
      this happens on every page free even though interrupts are disabled very
      shortly afterwards a second time.  This is wasteful.
      
      This patch splits what free_page_mlock() does.  The bit check is still
      made.  However, the update of counters is delayed until the interrupts are
      disabled and the non-lock version for clearing the bit is used.  One
      potential weirdness with this split is that the counters do not get
      updated if the bad_page() check is triggered but a system showing bad
      pages is getting screwed already.
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Reviewed-by: default avatarChristoph Lameter <cl@linux-foundation.org>
      Reviewed-by: default avatarPekka Enberg <penberg@cs.helsinki.fi>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Acked-by: default avatarLee Schermerhorn <Lee.Schermerhorn@hp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      da456f14
    • Mel Gorman's avatar
      page allocator: do not call get_pageblock_migratetype() more than necessary · ed0ae21d
      Mel Gorman authored
      
      
      get_pageblock_migratetype() is potentially called twice for every page
      free.  Once, when being freed to the pcp lists and once when being freed
      back to buddy.  When freeing from the pcp lists, it is known what the
      pageblock type was at the time of free so use it rather than rechecking.
      In low memory situations under memory pressure, this might skew
      anti-fragmentation slightly but the interference is minimal and decisions
      that are fragmenting memory are being made anyway.
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Reviewed-by: default avatarChristoph Lameter <cl@linux-foundation.org>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ed0ae21d
    • Mel Gorman's avatar
      page allocator: inline __rmqueue_fallback() · 0ac3a409
      Mel Gorman authored
      
      
      __rmqueue_fallback() is in the slow path but has only one call site.
      Because there is only one call-site, this function can then be inlined
      without causing text bloat.  On an x86-based config, it made no difference
      as the savings were padded out by NOP instructions.  Milage varies but
      text will either decrease in size or remain static.
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0ac3a409
    • Mel Gorman's avatar
      page allocator: inline buffered_rmqueue() · 0a15c3e9
      Mel Gorman authored
      
      
      buffered_rmqueue() is in the fast path so inline it.  Because it only has
      one call site, this function can then be inlined without causing text
      bloat.  On an x86-based config, it made no difference as the savings were
      padded out by NOP instructions.  Milage varies but text will either
      decrease in size or remain static.
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0a15c3e9
    • Mel Gorman's avatar
      page allocator: inline __rmqueue_smallest() · 728ec980
      Mel Gorman authored
      
      
      Inline __rmqueue_smallest by altering flow very slightly so that there is
      only one call site.  Because there is only one call-site, this function
      can then be inlined without causing text bloat.  On an x86-based config,
      this patch reduces text by 16 bytes.
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Reviewed-by: default avatarChristoph Lameter <cl@linux-foundation.org>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      728ec980
    • Mel Gorman's avatar
      page allocator: remove a branch by assuming __GFP_HIGH == ALLOC_HIGH · a56f57ff
      Mel Gorman authored
      
      
      Allocations that specify __GFP_HIGH get the ALLOC_HIGH flag.  If these
      flags are equal to each other, we can eliminate a branch.
      
      [akpm@linux-foundation.org: Suggested the hack]
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Reviewed-by: default avatarPekka Enberg <penberg@cs.helsinki.fi>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a56f57ff
    • Peter Zijlstra's avatar
      page allocator: calculate the alloc_flags for allocation only once · 341ce06f
      Peter Zijlstra authored
      
      
      Factor out the mapping between GFP and alloc_flags only once.  Once
      factored out, it only needs to be calculated once but some care must be
      taken.
      
      [neilb@suse.de says]
      As the test:
      
      -       if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
      -                       && !in_interrupt()) {
      -               if (!(gfp_mask & __GFP_NOMEMALLOC)) {
      
      has been replaced with a slightly weaker one:
      
      +       if (alloc_flags & ALLOC_NO_WATERMARKS) {
      
      Without care, this would allow recursion into the allocator via direct
      reclaim.  This patch ensures we do not recurse when PF_MEMALLOC is set but
      TF_MEMDIE callers are now allowed to directly reclaim where they would
      have been prevented in the past.
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: default avatarPekka Enberg <penberg@cs.helsinki.fi>
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      341ce06f
    • Mel Gorman's avatar
      page allocator: calculate the migratetype for allocation only once · 3dd28266
      Mel Gorman authored
      
      
      GFP mask is converted into a migratetype when deciding which pagelist to
      take a page from.  However, it is happening multiple times per allocation,
      at least once per zone traversed.  Calculate it once.
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3dd28266
    • Mel Gorman's avatar
      page allocator: calculate the preferred zone for allocation only once · 5117f45d
      Mel Gorman authored
      
      
      get_page_from_freelist() can be called multiple times for an allocation.
      Part of this calculates the preferred_zone which is the first usable zone
      in the zonelist but the zone depends on the GFP flags specified at the
      beginning of the allocation call.  This patch calculates preferred_zone
      once.  It's safe to do this because if preferred_zone is NULL at the start
      of the call, no amount of direct reclaim or other actions will change the
      fact the allocation will fail.
      
      [akpm@linux-foundation.org: remove (void) casts]
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: default avatarPekka Enberg <penberg@cs.helsinki.fi>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5117f45d
    • Mel Gorman's avatar
      page allocator: move check for disabled anti-fragmentation out of fastpath · 49255c61
      Mel Gorman authored
      
      
      On low-memory systems, anti-fragmentation gets disabled as there is
      nothing it can do and it would just incur overhead shuffling pages between
      lists constantly.  Currently the check is made in the free page fast path
      for every page.  This patch moves it to a slow path.  On machines with low
      memory, there will be small amount of additional overhead as pages get
      shuffled between lists but it should quickly settle.
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Reviewed-by: default avatarChristoph Lameter <cl@linux-foundation.org>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      49255c61
    • Mel Gorman's avatar
      page allocator: break up the allocator entry point into fast and slow paths · 11e33f6a
      Mel Gorman authored
      
      
      The core of the page allocator is one giant function which allocates
      memory on the stack and makes calculations that may not be needed for
      every allocation.  This patch breaks up the allocator path into fast and
      slow paths for clarity.  Note the slow paths are still inlined but the
      entry is marked unlikely.  If they were not inlined, it actally increases
      text size to generate the as there is only one call site.
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Reviewed-by: default avatarChristoph Lameter <cl@linux-foundation.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      11e33f6a
    • Mel Gorman's avatar
      page allocator: check only once if the zonelist is suitable for the allocation · 7f82af97
      Mel Gorman authored
      
      
      It is possible with __GFP_THISNODE that no zones are suitable.  This patch
      makes sure the check is only made once.
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Reviewed-by: default avatarChristoph Lameter <cl@linux-foundation.org>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: default avatarPekka Enberg <penberg@cs.helsinki.fi>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7f82af97
    • Mel Gorman's avatar
      page allocator: do not check NUMA node ID when the caller knows the node is valid · 6484eb3e
      Mel Gorman authored
      
      
      Callers of alloc_pages_node() can optionally specify -1 as a node to mean
      "allocate from the current node".  However, a number of the callers in
      fast paths know for a fact their node is valid.  To avoid a comparison and
      branch, this patch adds alloc_pages_exact_node() that only checks the nid
      with VM_BUG_ON().  Callers that know their node is valid are then
      converted.
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Reviewed-by: default avatarChristoph Lameter <cl@linux-foundation.org>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: default avatarPekka Enberg <penberg@cs.helsinki.fi>
      Acked-by: Paul Mundt <lethal@linux-sh.org>	[for the SLOB NUMA bits]
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6484eb3e
    • Mel Gorman's avatar
      page allocator: do not sanity check order in the fast path · b3c466ce
      Mel Gorman authored
      
      
      No user of the allocator API should be passing in an order >= MAX_ORDER
      but we check for it on each and every allocation.  Delete this check and
      make it a VM_BUG_ON check further down the call path.
      
      [akpm@linux-foundation.org: s/VM_BUG_ON/WARN_ON_ONCE/]
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Reviewed-by: default avatarChristoph Lameter <cl@linux-foundation.org>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: default avatarPekka Enberg <penberg@cs.helsinki.fi>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b3c466ce
    • Mel Gorman's avatar
      page allocator: replace __alloc_pages_internal() with __alloc_pages_nodemask() · d239171e
      Mel Gorman authored
      
      
      The start of a large patch series to clean up and optimise the page
      allocator.
      
      The performance improvements are in a wide range depending on the exact
      machine but the results I've seen so fair are approximately;
      
      kernbench:	0	to	 0.12% (elapsed time)
      		0.49%	to	 3.20% (sys time)
      aim9:		-4%	to	30% (for page_test and brk_test)
      tbench:		-1%	to	 4%
      hackbench:	-2.5%	to	 3.45% (mostly within the noise though)
      netperf-udp	-1.34%  to	 4.06% (varies between machines a bit)
      netperf-tcp	-0.44%  to	 5.22% (varies between machines a bit)
      
      I haven't sysbench figures at hand, but previously they were within the
      -0.5% to 2% range.
      
      On netperf, the client and server were bound to opposite number CPUs to
      maximise the problems with cache line bouncing of the struct pages so I
      expect different people to report different results for netperf depending
      on their exact machine and how they ran the test (different machines, same
      cpus client/server, shared cache but two threads client/server, different
      socket client/server etc).
      
      I also measured the vmlinux sizes for a single x86-based config with
      CONFIG_DEBUG_INFO enabled but not CONFIG_DEBUG_VM.  The core of the
      .config is based on the Debian Lenny kernel config so I expect it to be
      reasonably typical.
      
      This patch:
      
      __alloc_pages_internal is the core page allocator function but essentially
      it is an alias of __alloc_pages_nodemask.  Naming a publicly available and
      exported function "internal" is also a big ugly.  This patch renames
      __alloc_pages_internal() to __alloc_pages_nodemask() and deletes the old
      nodemask function.
      
      Warning - This patch renames an exported symbol.  No kernel driver is
      affected by external drivers calling __alloc_pages_internal() should
      change the call to __alloc_pages_nodemask() without any alteration of
      parameters.
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Reviewed-by: default avatarChristoph Lameter <cl@linux-foundation.org>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: default avatarPekka Enberg <penberg@cs.helsinki.fi>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d239171e
    • Hugh Dickins's avatar
      mm: alloc_large_system_hash check order · 6c0db466
      Hugh Dickins authored
      
      
      On an x86_64 with 4GB ram, tcp_init()'s call to alloc_large_system_hash(),
      to allocate tcp_hashinfo.ehash, is now triggering an mmotm WARN_ON_ONCE on
      order >= MAX_ORDER - it's hoping for order 11.  alloc_large_system_hash()
      had better make its own check on the order.
      Signed-off-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: David Miller <davem@davemloft.net>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Eric Dumazet <dada1@cosmosbay.com>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6c0db466
    • Miao Xie's avatar
      cpuset,mm: update tasks' mems_allowed in time · 58568d2a
      Miao Xie authored
      
      
      Fix allocating page cache/slab object on the unallowed node when memory
      spread is set by updating tasks' mems_allowed after its cpuset's mems is
      changed.
      
      In order to update tasks' mems_allowed in time, we must modify the code of
      memory policy.  Because the memory policy is applied in the process's
      context originally.  After applying this patch, one task directly
      manipulates anothers mems_allowed, and we use alloc_lock in the
      task_struct to protect mems_allowed and memory policy of the task.
      
      But in the fast path, we didn't use lock to protect them, because adding a
      lock may lead to performance regression.  But if we don't add a lock,the
      task might see no nodes when changing cpuset's mems_allowed to some
      non-overlapping set.  In order to avoid it, we set all new allowed nodes,
      then clear newly disallowed ones.
      
      [lee.schermerhorn@hp.com:
        The rework of mpol_new() to extract the adjusting of the node mask to
        apply cpuset and mpol flags "context" breaks set_mempolicy() and mbind()
        with MPOL_PREFERRED and a NULL nodemask--i.e., explicit local
        allocation.  Fix this by adding the check for MPOL_PREFERRED and empty
        node mask to mpol_new_mpolicy().
      
        Remove the now unneeded 'nodes = NULL' from mpol_new().
      
        Note that mpol_new_mempolicy() is always called with a non-NULL
        'nodes' parameter now that it has been removed from mpol_new().
        Therefore, we don't need to test nodes for NULL before testing it for
        'empty'.  However, just to be extra paranoid, add a VM_BUG_ON() to
        verify this assumption.]
      [lee.schermerhorn@hp.com:
      
        I don't think the function name 'mpol_new_mempolicy' is descriptive
        enough to differentiate it from mpol_new().
      
        This function applies cpuset set context, usually constraining nodes
        to those allowed by the cpuset.  However, when the 'RELATIVE_NODES flag
        is set, it also translates the nodes.  So I settled on
        'mpol_set_nodemask()', because the comment block for mpol_new() mentions
        that we need to call this function to "set nodes".
      
        Some additional minor line length, whitespace and typo cleanup.]
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Paul Menage <menage@google.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarLee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      58568d2a
    • Miao Xie's avatar
      cpusets: update tasks' page/slab spread flags in time · 950592f7
      Miao Xie authored
      
      
      Fix the bug that the kernel didn't spread page cache/slab object evenly
      over all the allowed nodes when spread flags were set by updating tasks'
      page/slab spread flags in time.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Paul Menage <menage@google.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      950592f7
    • Miao Xie's avatar
      cpusets: restructure the function cpuset_update_task_memory_state() · f3b39d47
      Miao Xie authored
      
      
      The kernel still allocates the page caches on old node after modifying its
      cpuset's mems when 'memory_spread_page' was set, or it didn't spread the
      page cache evenly over all the nodes that faulting task is allowed to usr
      after memory_spread_page was set.  it is caused by the old mem_allowed and
      flags of the task, the current kernel doesn't updates them unless some
      function invokes cpuset_update_task_memory_state(), it is too late
      sometimes.We must update the mem_allowed and the flags of the tasks in
      time.
      
      Slab has the same problem.
      
      The following patches fix this bug by updating tasks' mem_allowed and
      spread flag after its cpuset's mems or spread flag is changed.
      
      This patch:
      
      Extract a function from cpuset_update_task_memory_state().  It will be
      used later for update tasks' page/slab spread flags after its cpuset's
      flag is set
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Paul Menage <menage@google.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f3b39d47
    • H Hartley Sweeten's avatar
      mm/page-writeback.c: dirty limit type should be unsigned long · dcf975d5
      H Hartley Sweeten authored
      
      
      get_dirty_limits() calls clip_bdi_dirty_limit() and task_dirty_limit()
      with variable pbdi_dirty as one of the arguments.  This variable is an
      unsigned long * but both functions expect it to be a long *.  This causes
      the following sparse warnings:
      
        warning: incorrect type in argument 3 (different signedness)
           expected long *pbdi_dirty
           got unsigned long *pbdi_dirty
        warning: incorrect type in argument 2 (different signedness)
           expected long *pdirty
           got unsigned long *pbdi_dirty
      
      Fix the warnings by changing the long * to unsigned long * in both
      functions.
      Signed-off-by: default avatarH Hartley Sweeten <hsweeten@visionengravers.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dcf975d5
    • KOSAKI Motohiro's avatar
      vmscan: low order lumpy reclaim also should use PAGEOUT_IO_SYNC · 78dc583d
      KOSAKI Motohiro authored
      Commit 33c120ed ("more aggressively use
      lumpy reclaim") increased how aggressive lumpy reclaim was by isolating
      both active and inactive pages for asynchronous lumpy reclaim on
      costly-high-order pages and for cheap-high-order when memory pressure is
      high.  However, if the system is under heavy pressure and there are dirty
      pages, asynchronous IO may not be sufficient to reclaim a suitable page in
      time.
      
      This patch causes the caller to enter synchronous lumpy reclaim for
      costly-high-order pages and for cheap-high-order pages when under memory
      pressure.
      
      Minchan.kim@gmail.com said:
      
      Andy added synchronous lumpy reclaim with
      c661b078.  At that time, lumpy reclaim is
      not agressive.  His intension is just for high-order users.(above
      PAGE_ALLOC_COSTLY_ORDER).
      
      After some time, Rik added aggressive lumpy reclaim with
      33c120ed
      
      .  His intention was to do lumpy
      reclaim when high-order users and trouble getting a small set of
      contiguous pages.
      
      So we also have to add synchronous pageout for small set of contiguous
      pages.
      
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: default avatarMinchan Kim <Minchan.kim@gmail.com>
      Reviewed-by: default avatarMel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      78dc583d
    • Nick Piggin's avatar
      mm: clean up get_user_pages_fast() documentation · d2bf6be8
      Nick Piggin authored
      
      
      Move more documentation for get_user_pages_fast into the new kerneldoc comment.
      Add some comments for get_user_pages as well.
      
      Also, move get_user_pages_fast declaration up to get_user_pages. It wasn't
      there initially because it was once a static inline function.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Cc: Andy Grover <andy.grover@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d2bf6be8
    • Wu Fengguang's avatar
      readahead: enforce full sync mmap readahead size · 7ffc59b4
      Wu Fengguang authored
      
      
      Now that we do readahead for sequential mmap reads, here is a simple
      evaluation of the impacts, and one further optimization.
      
      It's an NFS-root debian desktop system, readahead size = 60 pages.
      The numbers are grabbed after a fresh boot into console.
      
      approach        pgmajfault      RA miss ratio   mmap IO count   avg IO size(pages)
         A            383             31.6%           383             11
         B            225             32.4%           390             11
         C            224             32.6%           307             13
      
      case A: mmap sync/async readahead disabled
      case B: mmap sync/async readahead enabled, with enforced full async readahead size
      case C: mmap sync/async readahead enabled, with enforced full sync/async readahead size
      or:
      A = vanilla 2.6.30-rc1
      B = A plus mmap readahead
      C = B plus this patch
      
      The numbers show that
      - there are good possibilities for random mmap reads to trigger readahead
      - 'pgmajfault' is reduced by 1/3, due to the _async_ nature of readahead
      - case C can further reduce IO count by 1/4
      - readahead miss ratios are not quite affected
      
      The theory is
      - readahead is _good_ for clustered random reads, and can perform
        _better_ than readaround because they could be _async_.
      - async readahead size is guaranteed to be larger than readaround
        size, and they are _async_, hence will mostly behave better
      However for B
      - sync readahead size could be smaller than readaround size, hence may
        make things worse by produce more smaller IOs
      which will be fixed by this patch.
      
      Final conclusion:
      - mmap readahead reduced major faults by 1/3 and no obvious overheads;
      - mmap io can be further reduced by 1/4 with this patch.
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7ffc59b4
    • Wu Fengguang's avatar
    • Wu Fengguang's avatar
      readahead: introduce context readahead algorithm · 10be0b37
      Wu Fengguang authored
      Introduce page cache context based readahead algorithm.
      This is to better support concurrent read streams in general.
      
      RATIONALE
      ---------
      The current readahead algorithm detects interleaved reads in a _passive_ way.
      Given a sequence of interleaved streams 1,1001,2,1002,3,4,1003,5,1004,1005,6,...
      By checking for (offset == prev_offset + 1), it will discover the sequentialness
      between 3,4 and between 1004,1005, and start doing sequential readahead for the
      individual streams since page 4 and page 1005.
      
      The context readahead algorithm guarantees to discover the sequentialness no
      matter how the streams are interleaved. For the above example, it will start
      sequential readahead since page 2 and 1002.
      
      The trick is to poke for page @offset-1 in the page cache when it has no other
      clues on the sequentialness of request @offset: if the current requenst belongs
      to a sequential stream, that stream must have accessed page @offset-1 recently,
      and the page will still be cached now. So if page @offset-1 is there, we can
      take request @offset as a sequential access.
      
      BENEFICIARIES
      -------------
      - strictly interleaved reads  i.e. 1,1001,2,1002,3,1003,...
        the current readahead will take them as silly random reads;
        the context readahead will take them as two sequential streams.
      
      - cooperative IO processes   i.e. NFS and SCST
        They create a thread pool, farming off (sequential) IO requests to different
        threads which will be performing interleaved IO.
      
        It was not easy(or possible) to reliably tell from file->f_ra all those
        cooperative processes working on the same sequential stream, since they will
        have different file->f_ra instances. And NFSD's file->f_ra is particularly
        unusable, since their file objects are dynamically created for each request.
        The nfsd does have code trying to restore the f_ra bits, but not satisfactory.
      
        The new scheme is to detect the sequential pattern via looking up the page
        cache, which provides one single and consistent view of the pages recently
        accessed. That makes sequential detection for cooperative processes possible.
      
      USER REPORT
      -----------
      Vladislav recommends the addition of context readahead as a result of his SCST
      benchmarks. It leads to 6%~40% performance gains in various cases and achieves
      equal performance in others.                http://lkml.org/lkml/2009/3/19/239
      
      
      
      OVERHEADS
      ---------
      In theory, it introduces one extra page cache lookup per random read.  However
      the below benchmark shows context readahead to be slightly faster, wondering..
      
      Randomly reading 200MB amount of data on a sparse file, repeat 20 times for
      each block size. The average throughputs are:
      
                             	original ra	context ra	gain
       4K random reads:	 65.561MB/s	 65.648MB/s	+0.1%
      16K random reads:	124.767MB/s	124.951MB/s	+0.1%
      64K random reads: 	162.123MB/s	162.278MB/s	+0.1%
      
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Tested-by: default avatarVladislav Bolkhovitin <vst@vlnb.net>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      10be0b37
    • Wu Fengguang's avatar
      readahead: move the random read case to bottom · 045a2529
      Wu Fengguang authored
      
      
      Split all readahead cases, and move the random one to bottom.
      
      No behavior changes.
      
      This is to prepare for the introduction of context readahead, and make it
      easy for inserting accounting/tracing points for each case.
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Cc: Vladislav Bolkhovitin <vst@vlnb.net>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      045a2529
    • Wu Fengguang's avatar
      radix-tree: add radix_tree_prev_hole() · dc566127
      Wu Fengguang authored
      
      
      The counterpart of radix_tree_next_hole(). To be used by context readahead.
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Cc: Vladislav Bolkhovitin <vst@vlnb.net>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dc566127
    • Wu Fengguang's avatar
      readahead: record mmap read-around states in file_ra_state · d30a1100
      Wu Fengguang authored
      
      
      Mmap read-around now shares the same code style and data structure with
      readahead code.
      
      This also removes do_page_cache_readahead().  Its last user, mmap
      read-around, has been changed to call ra_submit().
      
      The no-readahead-if-congested logic is dumped by the way.  Users will be
      pretty sensitive about the slow loading of executables.  So it's
      unfavorable to disabled mmap read-around on a congested queue.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Cc: Nick Piggin <npiggin@suse.de>
      Signed-off-by: default avatarFengguang Wu <wfg@mail.ustc.edu.cn>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d30a1100
    • Wu Fengguang's avatar
      readahead: enforce full readahead size on async mmap readahead · 2fad6f5d
      Wu Fengguang authored
      
      
      We need this in one particular case and two more general ones.
      
      Now we do async readahead for sequential mmap reads, and do it with the
      help of PG_readahead.  For normal reads, PG_readahead is the sufficient
      condition to do a sequential readahead.  But unfortunately, for mmap
      reads, there is a tiny nuisance:
      
      [11736.998347] readahead-init0(process: sh/23926, file: sda1/w3m, offset=0:4503599627370495, ra=0+4-3) = 4
      [11737.014985] readahead-around(process: w3m/23926, file: sda1/w3m, offset=0:0, ra=290+32-0) = 17
      [11737.019488] readahead-around(process: w3m/23926, file: sda1/w3m, offset=0:0, ra=118+32-0) = 32
      [11737.024921] readahead-interleaved(process: w3m/23926, file: sda1/w3m, offset=0:2, ra=4+6-6) = 6
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~                                                 ~~~~~~~~~~~~~
      
      An unfavorably small readahead.  The original dumb read-around size could
      be more efficient.
      
      That happened because ld-linux.so does a read(832) in L1 before mmap(),
      which triggers a 4-page readahead, with the second page tagged
      PG_readahead.
      
      L0: open("/lib/libc.so.6", O_RDONLY)        = 3
      L1: read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\340\342"..., 832) = 832
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      L2: fstat(3, {st_mode=S_IFREG|0755, st_size=1420624, ...}) = 0
      L3: mmap(NULL, 3527256, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7fac6e51d000
      L4: mprotect(0x7fac6e671000, 2097152, PROT_NONE) = 0
      L5: mmap(0x7fac6e871000, 20480, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x154000) = 0x7fac6e871000
      L6: mmap(0x7fac6e876000, 16984, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7fac6e876000
      L7: close(3)                                = 0
      
      In general, the PG_readahead flag will also be hit in cases
      
      - sequential reads
      
      - clustered random reads
      
      A full readahead size is desirable in both cases.
      
      Cc: Nick Piggin <npiggin@suse.de>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2fad6f5d