1. 20 Oct, 2008 6 commits
    • Rik van Riel's avatar
      vmscan: fix pagecache reclaim referenced bit check · 7e9cd484
      Rik van Riel authored
      
      
      Moving referenced pages back to the head of the active list creates a huge
      scalability problem, because by the time a large memory system finally
      runs out of free memory, every single page in the system will have been
      referenced.
      
      Not only do we not have the time to scan every single page on the active
      list, but since they have will all have the referenced bit set, that bit
      conveys no useful information.
      
      A more scalable solution is to just move every page that hits the end of
      the active list to the inactive list.
      
      We clear the referenced bit off of mapped pages, which need just one
      reference to be moved back onto the active list.
      
      Unmapped pages will be moved back to the active list after two references
      (see mark_page_accessed).  We preserve the PG_referenced flag on unmapped
      pages to preserve accesses that were made while the page was on the active
      list.
      Signed-off-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7e9cd484
    • Rik van Riel's avatar
      vmscan: second chance replacement for anonymous pages · 556adecb
      Rik van Riel authored
      
      
      We avoid evicting and scanning anonymous pages for the most part, but
      under some workloads we can end up with most of memory filled with
      anonymous pages.  At that point, we suddenly need to clear the referenced
      bits on all of memory, which can take ages on very large memory systems.
      
      We can reduce the maximum number of pages that need to be scanned by not
      taking the referenced state into account when deactivating an anonymous
      page.  After all, every anonymous page starts out referenced, so why
      check?
      
      If an anonymous page gets referenced again before it reaches the end of
      the inactive list, we move it back to the active list.
      
      To keep the maximum amount of necessary work reasonable, we scale the
      active to inactive ratio with the size of memory, using the formula
      active:inactive ratio = sqrt(memory in GB * 10).
      
      Kswapd CPU use now seems to scale by the amount of pageout bandwidth,
      instead of by the amount of memory present in the system.
      
      [kamezawa.hiroyu@jp.fujitsu.com: fix OOM with memcg]
      [kamezawa.hiroyu@jp.fujitsu.com: memcg: lru scan fix]
      Signed-off-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      556adecb
    • Rik van Riel's avatar
      vmscan: split LRU lists into anon & file sets · 4f98a2fe
      Rik van Riel authored
      
      
      Split the LRU lists in two, one set for pages that are backed by real file
      systems ("file") and one for pages that are backed by memory and swap
      ("anon").  The latter includes tmpfs.
      
      The advantage of doing this is that the VM will not have to scan over lots
      of anonymous pages (which we generally do not want to swap out), just to
      find the page cache pages that it should evict.
      
      This patch has the infrastructure and a basic policy to balance how much
      we scan the anon lists and how much we scan the file lists.  The big
      policy changes are in separate patches.
      
      [lee.schermerhorn@hp.com: collect lru meminfo statistics from correct offset]
      [kosaki.motohiro@jp.fujitsu.com: prevent incorrect oom under split_lru]
      [kosaki.motohiro@jp.fujitsu.com: fix pagevec_move_tail() doesn't treat unevictable page]
      [hugh@veritas.com: memcg swapbacked pages active]
      [hugh@veritas.com: splitlru: BDI_CAP_SWAP_BACKED]
      [akpm@linux-foundation.org: fix /proc/vmstat units]
      [nishimura@mxp.nes.nec.co.jp: memcg: fix handling of shmem migration]
      [kosaki.motohiro@jp.fujitsu.com: adjust Quicklists field of /proc/meminfo]
      [kosaki.motohiro@jp.fujitsu.com: fix style issue of get_scan_ratio()]
      Signed-off-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarLee Schermerhorn <Lee.Schermerhorn@hp.com>
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4f98a2fe
    • Rik van Riel's avatar
      vmscan: free swap space on swap-in/activation · 68a22394
      Rik van Riel authored
      
      
      If vm_swap_full() (swap space more than 50% full), the system will free
      swap space at swapin time.  With this patch, the system will also free the
      swap space in the pageout code, when we decide that the page is not a
      candidate for swapout (and just wasting swap space).
      Signed-off-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarLee Schermerhorn <Lee.Schermerhorn@hp.com>
      Signed-off-by: default avatarMinChan Kim <minchan.kim@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      68a22394
    • Christoph Lameter's avatar
      vmscan: Use an indexed array for LRU variables · b69408e8
      Christoph Lameter authored
      
      
      Currently we are defining explicit variables for the inactive and active
      list.  An indexed array can be more generic and avoid repeating similar
      code in several places in the reclaim code.
      
      We are saving a few bytes in terms of code size:
      
      Before:
      
         text    data     bss     dec     hex filename
      4097753  573120 4092484 8763357  85b7dd vmlinux
      
      After:
      
         text    data     bss     dec     hex filename
      4097729  573120 4092484 8763333  85b7c5 vmlinux
      
      Having an easy way to add new lru lists may ease future work on the
      reclaim code.
      Signed-off-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarLee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: default avatarChristoph Lameter <cl@linux-foundation.org>
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b69408e8
    • Nick Piggin's avatar
      vmscan: move isolate_lru_page() to vmscan.c · 62695a84
      Nick Piggin authored
      
      
      On large memory systems, the VM can spend way too much time scanning
      through pages that it cannot (or should not) evict from memory.  Not only
      does it use up CPU time, but it also provokes lock contention and can
      leave large systems under memory presure in a catatonic state.
      
      This patch series improves VM scalability by:
      
      1) putting filesystem backed, swap backed and unevictable pages
         onto their own LRUs, so the system only scans the pages that it
         can/should evict from memory
      
      2) switching to two handed clock replacement for the anonymous LRUs,
         so the number of pages that need to be scanned when the system
         starts swapping is bound to a reasonable number
      
      3) keeping unevictable pages off the LRU completely, so the
         VM does not waste CPU time scanning them. ramfs, ramdisk,
         SHM_LOCKED shared memory segments and mlock()ed VMA pages
         are keept on the unevictable list.
      
      This patch:
      
      isolate_lru_page logically belongs to be in vmscan.c than migrate.c.
      
      It is tough, because we don't need that function without memory migration
      so there is a valid argument to have it in migrate.c.  However a
      subsequent patch needs to make use of it in the core mm, so we can happily
      move it to vmscan.c.
      
      Also, make the function a little more generic by not requiring that it
      adds an isolated page to a given list.  Callers can do that.
      
      	Note that we now have '__isolate_lru_page()', that does
      	something quite different, visible outside of vmscan.c
      	for use with memory controller.  Methinks we need to
      	rationalize these names/purposes.	--lts
      
      [akpm@linux-foundation.org: fix mm/memory_hotplug.c build]
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Signed-off-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarLee Schermerhorn <Lee.Schermerhorn@hp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      62695a84
  2. 05 Aug, 2008 1 commit
  3. 30 Jul, 2008 2 commits
  4. 26 Jul, 2008 2 commits
  5. 25 Jul, 2008 1 commit
    • Keika Kobayashi's avatar
      per-task-delay-accounting: add memory reclaim delay · 873b4771
      Keika Kobayashi authored
      
      
      Sometimes, application responses become bad under heavy memory load.
      Applications take a bit time to reclaim memory.  The statistics, how long
      memory reclaim takes, will be useful to measure memory usage.
      
      This patch adds accounting memory reclaim to per-task-delay-accounting for
      accounting the time of do_try_to_free_pages().
      
      <i.e>
      
      - When System is under low memory load,
        memory reclaim may not occur.
      
      $ free
                   total       used       free     shared    buffers     cached
      Mem:       8197800    1577300    6620500          0       4808    1516724
      -/+ buffers/cache:      55768    8142032
      Swap:     16386292          0   16386292
      
      $ vmstat 1
      procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
       r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
       0  0      0 5069748  10612 3014060    0    0     0     0    3   26  0  0 100  0
       0  0      0 5069748  10612 3014060    0    0     0     0    4   22  0  0 100  0
       0  0      0 5069748  10612 3014060    0    0     0     0    3   18  0  0 100  0
      
      Measure the time of tar command.
      
      $ ls -s test.dat
      1501472 test.dat
      
      $ time tar cvf test.tar test.dat
      real    0m13.388s
      user    0m0.116s
      sys     0m5.304s
      
      $ ./delayget -d -p <pid>
      CPU             count     real total  virtual total    delay total
                        428     5528345500     5477116080       62749891
      IO              count    delay total
                        338     8078977189
      SWAP            count    delay total
                          0              0
      RECLAIM         count    delay total
                          0              0
      
      - When system is under heavy memory load
        memory reclaim may occur.
      
      $ vmstat 1
      procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
       r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
       0  0 7159032  49724   1812   3012    0    0     0     0    3   24  0  0 100  0
       0  0 7159032  49724   1812   3012    0    0     0     0    4   24  0  0 100  0
       0  0 7159032  49848   1812   3012    0    0     0     0    3   22  0  0 100  0
      
      In this case, one process uses more 8G memory
      by execution of malloc() and memset().
      
      $ time tar cvf test.tar test.dat
      real    1m38.563s        <-  increased by 85 sec
      user    0m0.140s
      sys     0m7.060s
      
      $ ./delayget -d -p <pid>
      CPU             count     real total  virtual total    delay total
                       9021     7140446250     7315277975      923201824
      IO              count    delay total
                       8965    90466349669
      SWAP            count    delay total
                          3       21036367
      RECLAIM         count    delay total
                        740    61011951153
      
      In the later case, the value of RECLAIM is increasing.
      So, taskstats can show how much memory reclaim influences TAT.
      Signed-off-by: default avatarKeika Kobayashi <kobayashi.kk@ncos.nec.co.jp>
      Acked-by: default avatarBalbir Singh <balbir@linux.vnet.ibm.com>
      Acked-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujistu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      873b4771
  6. 13 Jun, 2008 1 commit
  7. 30 Apr, 2008 1 commit
  8. 29 Apr, 2008 1 commit
    • Nishanth Aravamudan's avatar
      page allocator: smarter retry of costly-order allocations · a41f24ea
      Nishanth Aravamudan authored
      
      
      Because of page order checks in __alloc_pages(), hugepage (and similarly
      large order) allocations will not retry unless explicitly marked
      __GFP_REPEAT. However, the current retry logic is nearly an infinite
      loop (or until reclaim does no progress whatsoever). For these costly
      allocations, that seems like overkill and could potentially never
      terminate. Mel observed that allowing current __GFP_REPEAT semantics for
      hugepage allocations essentially killed the system. I believe this is
      because we may continue to reclaim small orders of pages all over, but
      never have enough to satisfy the hugepage allocation request. This is
      clearly only a problem for large order allocations, of which hugepages
      are the most obvious (to me).
      
      Modify try_to_free_pages() to indicate how many pages were reclaimed.
      Use that information in __alloc_pages() to eventually fail a large
      __GFP_REPEAT allocation when we've reclaimed an order of pages equal to
      or greater than the allocation's order. This relies on lumpy reclaim
      functioning as advertised. Due to fragmentation, lumpy reclaim may not
      be able to free up the order needed in one invocation, so multiple
      iterations may be requred. In other words, the more fragmented memory
      is, the more retry attempts __GFP_REPEAT will make (particularly for
      higher order allocations).
      
      This changes the semantics of __GFP_REPEAT subtly, but *only* for
      allocations > PAGE_ALLOC_COSTLY_ORDER. With this patch, for those size
      allocations, we will try up to some point (at least 1<<order reclaimed
      pages), rather than forever (which is the case for allocations <=
      PAGE_ALLOC_COSTLY_ORDER).
      
      This change improves the /proc/sys/vm/nr_hugepages interface with a
      follow-on patch that makes pool allocations use __GFP_REPEAT. Rather
      than administrators repeatedly echo'ing a particular value into the
      sysctl, and forcing reclaim into action manually, this change allows for
      the sysctl to attempt a reasonable effort itself. Similarly, dynamic
      pool growth should be more successful under load, as lumpy reclaim can
      try to free up pages, rather than failing right away.
      
      Choosing to reclaim only up to the order of the requested allocation
      strikes a balance between not failing hugepage allocations and returning
      to the caller when it's unlikely to every succeed. Because of lumpy
      reclaim, if we have freed the order requested, hopefully it has been in
      big chunks and those chunks will allow our allocation to succeed. If
      that isn't the case after freeing up the current order, I don't think it
      is likely to succeed in the future, although it is possible given a
      particular fragmentation pattern.
      Signed-off-by: default avatarNishanth Aravamudan <nacc@us.ibm.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Tested-by: default avatarMel Gorman <mel@csn.ul.ie>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a41f24ea
  9. 28 Apr, 2008 3 commits
    • Mel Gorman's avatar
      mm: have zonelist contains structs with both a zone pointer and zone_idx · dd1a239f
      Mel Gorman authored
      
      
      Filtering zonelists requires very frequent use of zone_idx().  This is costly
      as it involves a lookup of another structure and a substraction operation.  As
      the zone_idx is often required, it should be quickly accessible.  The node idx
      could also be stored here if it was found that accessing zone->node is
      significant which may be the case on workloads where nodemasks are heavily
      used.
      
      This patch introduces a struct zoneref to store a zone pointer and a zone
      index.  The zonelist then consists of an array of these struct zonerefs which
      are looked up as necessary.  Helpers are given for accessing the zone index as
      well as the node index.
      
      [kamezawa.hiroyu@jp.fujitsu.com: Suggested struct zoneref instead of embedding information in pointers]
      [hugh@veritas.com: mm-have-zonelist: fix memcg ooms]
      [hugh@veritas.com: just return do_try_to_free_pages]
      [hugh@veritas.com: do_try_to_free_pages gfp_mask redundant]
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Acked-by: default avatarChristoph Lameter <clameter@sgi.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dd1a239f
    • Mel Gorman's avatar
      mm: use two zonelist that are filtered by GFP mask · 54a6eb5c
      Mel Gorman authored
      
      
      Currently a node has two sets of zonelists, one for each zone type in the
      system and a second set for GFP_THISNODE allocations.  Based on the zones
      allowed by a gfp mask, one of these zonelists is selected.  All of these
      zonelists consume memory and occupy cache lines.
      
      This patch replaces the multiple zonelists per-node with two zonelists.  The
      first contains all populated zones in the system, ordered by distance, for
      fallback allocations when the target/preferred node has no free pages.  The
      second contains all populated zones in the node suitable for GFP_THISNODE
      allocations.
      
      An iterator macro is introduced called for_each_zone_zonelist() that interates
      through each zone allowed by the GFP flags in the selected zonelist.
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Acked-by: default avatarChristoph Lameter <clameter@sgi.com>
      Signed-off-by: default avatarLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      54a6eb5c
    • Mel Gorman's avatar
      mm: use zonelists instead of zones when direct reclaiming pages · dac1d27b
      Mel Gorman authored
      
      
      The following patches replace multiple zonelists per node with two zonelists
      that are filtered based on the GFP flags.  The patches as a set fix a bug with
      regard to the use of MPOL_BIND and ZONE_MOVABLE.  With this patchset, the
      MPOL_BIND will apply to the two highest zones when the highest zone is
      ZONE_MOVABLE.  This should be considered as an alternative fix for the
      MPOL_BIND+ZONE_MOVABLE in 2.6.23 to the previously discussed hack that filters
      only custom zonelists.
      
      The first patch cleans up an inconsistency where direct reclaim uses
      zonelist->zones where other places use zonelist.
      
      The second patch introduces a helper function node_zonelist() for looking up
      the appropriate zonelist for a GFP mask which simplifies patches later in the
      set.
      
      The third patch defines/remembers the "preferred zone" for numa statistics, as
      it is no longer always the first zone in a zonelist.
      
      The forth patch replaces multiple zonelists with two zonelists that are
      filtered.  The two zonelists are due to the fact that the memoryless patchset
      introduces a second set of zonelists for __GFP_THISNODE.
      
      The fifth patch introduces helper macros for retrieving the zone and node
      indices of entries in a zonelist.
      
      The final patch introduces filtering of the zonelists based on a nodemask.
      Two zonelists exist per node, one for normal allocations and one for
      __GFP_THISNODE.
      
      Performance results varied depending on the machine configuration.  In real
      workloads the gain/loss will depend on how much the userspace portion of the
      benchmark benefits from having more cache available due to reduced referencing
      of zonelists.
      
      These are the range of performance losses/gains when running against
      2.6.24-rc4-mm1.  The set and these machines are a mix of i386, x86_64 and
      ppc64 both NUMA and non-NUMA.
      			     loss   to  gain
      Total CPU time on Kernbench: -0.86% to  1.13%
      Elapsed   time on Kernbench: -0.79% to  0.76%
      page_test from aim9:         -4.37% to  0.79%
      brk_test  from aim9:         -0.71% to  4.07%
      fork_test from aim9:         -1.84% to  4.60%
      exec_test from aim9:         -0.71% to  1.08%
      
      This patch:
      
      The allocator deals with zonelists which indicate the order in which zones
      should be targeted for an allocation.  Similarly, direct reclaim of pages
      iterates over an array of zones.  For consistency, this patch converts direct
      reclaim to use a zonelist.  No functionality is changed by this patch.  This
      simplifies zonelist iterators in the next patch.
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Acked-by: default avatarChristoph Lameter <clameter@sgi.com>
      Signed-off-by: default avatarLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dac1d27b
  10. 19 Apr, 2008 1 commit
    • Mike Travis's avatar
      nodemask: use new node_to_cpumask_ptr function · c5f59f08
      Mike Travis authored
      
      
        * Use new node_to_cpumask_ptr.  This creates a pointer to the
          cpumask for a given node.  This definition is in mm patch:
      
      	asm-generic-add-node_to_cpumask_ptr-macro.patch
      
        * Use new set_cpus_allowed_ptr function.
      
      Depends on:
      	[mm-patch]: asm-generic-add-node_to_cpumask_ptr-macro.patch
      	[sched-devel]: sched: add new set_cpus_allowed_ptr function
      	[x86/latest]: x86: add cpus_scnprintf function
      
      Cc: Greg Kroah-Hartman <gregkh@suse.de>
      Cc: Greg Banks <gnb@melbourne.sgi.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Signed-off-by: default avatarMike Travis <travis@sgi.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      c5f59f08
  11. 25 Mar, 2008 1 commit
  12. 05 Mar, 2008 2 commits
  13. 07 Feb, 2008 7 commits
    • KAMEZAWA Hiroyuki's avatar
      per-zone and reclaim enhancements for memory controller: modifies vmscan.c for... · 1cfb419b
      KAMEZAWA Hiroyuki authored
      
      per-zone and reclaim enhancements for memory controller: modifies vmscan.c for isolate globa/cgroup lru activity
      
      When using memory controller, there are 2 levels of memory reclaim.
       1. zone memory reclaim because of system/zone memory shortage.
       2. memory cgroup memory reclaim because of hitting limit.
      
      These two can be distinguished by sc->mem_cgroup parameter.
      (scan_global_lru() macro)
      
      This patch tries to make memory cgroup reclaim routine avoid affecting
      system/zone memory reclaim. This patch inserts if (scan_global_lru()) and
      hook to memory_cgroup reclaim support functions.
      
      This patch can be a help for isolating system lru activity and group lru
      activity and shows what additional functions are necessary.
      
       * mem_cgroup_calc_mapped_ratio() ... calculate mapped ratio for cgroup.
       * mem_cgroup_reclaim_imbalance() ... calculate active/inactive balance in
                                              cgroup.
       * mem_cgroup_calc_reclaim_active() ... calculate the number of active pages to
                                      be scanned in this priority in mem_cgroup.
      
       * mem_cgroup_calc_reclaim_inactive() ... calculate the number of inactive pages
                                      to be scanned in this priority in mem_cgroup.
      
       * mem_cgroup_all_unreclaimable() .. checks cgroup's page is all unreclaimable
                                           or not.
       * mem_cgroup_get_reclaim_priority() ...
       * mem_cgroup_note_reclaim_priority() ... record reclaim priority (temporal)
       * mem_cgroup_remember_reclaim_priority()
                                   .... record reclaim priority as
                                        zone->prev_priority.
                                        This value is used for calc reclaim_mapped.
      
      [akpm@linux-foundation.org: fix unused var warning]
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Paul Menage <menage@google.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1cfb419b
    • KAMEZAWA Hiroyuki's avatar
      per-zone and reclaim enhancements for memory controller: add scan_global_lru macro · 91a45470
      KAMEZAWA Hiroyuki authored
      
      
      This is used to detect which scan_control scans global lru or mem_cgroup lru.
      And compiled to be static value (1) when memory controller is not configured.
      This may make the meaning obvious.
      Acked-by: default avatarBalbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Paul Menage <menage@google.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      91a45470
    • KAMEZAWA Hiroyuki's avatar
      memory cgroup enhancements: fix zone handling in try_to_free_mem_cgroup_page · 417eead3
      KAMEZAWA Hiroyuki authored
      
      
      Because NODE_DATA(node)->node_zonelists[] is guaranteed to contain all
      necessary zones, it is not necessary to use for_each_online_node.
      
      And this for_each_online_node() makes reclaim routine start always
      from node 0. This is not good. This patch makes reclaim start from
      caller's node and just use usual (default) zonelist order.
      
      [akpm@linux-foundation.org: fix warning]
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Paul Menage <menage@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      417eead3
    • Rik van Riel's avatar
      kswapd should only wait on IO if there is IO · f1a9ee75
      Rik van Riel authored
      
      
      The current kswapd (and try_to_free_pages) code has an oddity where the
      code will wait on IO, even if there is no IO in flight.  This problem is
      notable especially when the system scans through many unfreeable pages,
      causing unnecessary stalls in the VM.
      
      Additionally, tasks without __GFP_FS or __GFP_IO in the direct reclaim path
      will sleep if a significant number of pages are encountered that should be
      written out.  This gives kswapd a chance to write out those pages, while
      the direct reclaim task sleeps.
      Signed-off-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f1a9ee75
    • Balbir Singh's avatar
      Memory controller: make charging gfp mask aware · e1a1cd59
      Balbir Singh authored
      
      
      Nick Piggin pointed out that swap cache and page cache addition routines
      could be called from non GFP_KERNEL contexts.  This patch makes the
      charging routine aware of the gfp context.  Charging might fail if the
      cgroup is over it's limit, in which case a suitable error is returned.
      
      This patch was tested on a Powerpc box.  I am still looking at being able
      to test the path, through which allocations happen in non GFP_KERNEL
      contexts.
      
      [kamezawa.hiroyu@jp.fujitsu.com: problem with ZONE_MOVABLE]
      Signed-off-by: default avatarBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Paul Menage <menage@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e1a1cd59
    • Balbir Singh's avatar
      Memory controller: make page_referenced() cgroup aware · bed7161a
      Balbir Singh authored
      
      
      Make page_referenced() cgroup aware.  Without this patch, page_referenced()
      can cause a page to be skipped while reclaiming pages.  This patch ensures
      that other cgroups do not hold pages in a particular cgroup hostage.  It
      is required to ensure that shared pages are freed from a cgroup when they
      are not actively referenced from the cgroup that brought them in
      Signed-off-by: default avatarBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Paul Menage <menage@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bed7161a
    • Balbir Singh's avatar
      Memory controller: add per cgroup LRU and reclaim · 66e1707b
      Balbir Singh authored
      
      
      Add the page_cgroup to the per cgroup LRU.  The reclaim algorithm has
      been modified to make the isolate_lru_pages() as a pluggable component.  The
      scan_control data structure now accepts the cgroup on behalf of which
      reclaims are carried out.  try_to_free_pages() has been extended to become
      cgroup aware.
      
      [akpm@linux-foundation.org: fix warning]
      [Lee.Schermerhorn@hp.com: initialize all scan_control's isolate_pages member]
      [bunk@kernel.org: make do_try_to_free_pages() static]
      [hugh@veritas.com: memcgroup: fix try_to_free order]
      [kamezawa.hiroyu@jp.fujitsu.com: this unlock_page_cgroup() is unnecessary]
      Signed-off-by: default avatarPavel Emelianov <xemul@openvz.org>
      Signed-off-by: default avatarBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Signed-off-by: default avatarLee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      66e1707b
  14. 19 Oct, 2007 1 commit
  15. 18 Oct, 2007 1 commit
  16. 17 Oct, 2007 2 commits
    • David Rientjes's avatar
      mm: test and set zone reclaim lock before starting reclaim · d773ed6b
      David Rientjes authored
      
      
      Introduces new zone flag interface for testing and setting flags:
      
      	int zone_test_and_set_flag(struct zone *zone, zone_flags_t flag)
      
      Instead of setting and clearing ZONE_RECLAIM_LOCKED each time shrink_zone() is
      called, this flag is test and set before starting zone reclaim.  Zone reclaim
      starts in __alloc_pages() when a zone's watermark fails and the system is in
      zone_reclaim_mode.  If it's already in reclaim, there's no need to start again
      so it is simply considered full for that allocation attempt.
      
      There is a change of behavior with regard to concurrent zone shrinking.  It is
      now possible for try_to_free_pages() or kswapd to already be shrinking a
      particular zone when __alloc_pages() starts zone reclaim.  In this case, it is
      possible for two concurrent threads to invoke shrink_zone() for a single zone.
      
      This change forbids a zone to be in zone reclaim twice, which was always the
      behavior, but allows for concurrent try_to_free_pages() or kswapd shrinking
      when starting zone reclaim.
      
      Cc: Andrea Arcangeli <andrea@suse.de>
      Cc: Christoph Lameter <clameter@sgi.com>
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d773ed6b
    • David Rientjes's avatar
      oom: change all_unreclaimable zone member to flags · e815af95
      David Rientjes authored
      
      
      Convert the int all_unreclaimable member of struct zone to unsigned long
      flags.  This can now be used to specify several different zone flags such as
      all_unreclaimable and reclaim_in_progress, which can now be removed and
      converted to a per-zone flag.
      
      Flags are set and cleared as follows:
      
      	zone_set_flag(struct zone *zone, zone_flags_t flag)
      	zone_clear_flag(struct zone *zone, zone_flags_t flag)
      
      Defines the first zone flags, ZONE_ALL_UNRECLAIMABLE and ZONE_RECLAIM_LOCKED,
      which have the same semantics as the old zone->all_unreclaimable and
      zone->reclaim_in_progress, respectively.  Also converts all current users that
      set or clear either flag to use the new interface.
      
      Helper functions are defined to test the flags:
      
      	int zone_is_all_unreclaimable(const struct zone *zone)
      	int zone_is_reclaim_locked(const struct zone *zone)
      
      All flag operators are of the atomic variety because there are currently
      readers that are implemented that do not take zone->lock.
      
      [akpm@linux-foundation.org: add needed include]
      Cc: Andrea Arcangeli <andrea@suse.de>
      Acked-by: default avatarChristoph Lameter <clameter@sgi.com>
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e815af95
  17. 16 Oct, 2007 5 commits
    • Andrea Arcangeli's avatar
      make swappiness safer to use · 4106f83a
      Andrea Arcangeli authored
      
      
      Swappiness isn't a safe sysctl.  Setting it to 0 for example can hang a
      system.  That's a corner case but even setting it to 10 or lower can waste
      enormous amounts of cpu without making much progress.  We've customers who
      wants to use swappiness but they can't because of the current
      implementation (if you change it so the system stops swapping it really
      stops swapping and nothing works sane anymore if you really had to swap
      something to make progress).
      
      This patch from Kurt Garloff makes swappiness safer to use (no more huge
      cpu usage or hangs with low swappiness values).
      
      I think the prev_priority can also be nuked since it wastes 4 bytes per
      zone (that would be an incremental patch but I wait the nr_scan_[in]active
      to be nuked first for similar reasons).  Clearly somebody at some point
      noticed how broken that thing was and they had to add min(priority,
      prev_priority) to give it some reliability, but they didn't go the last
      mile to nuke prev_priority too.  Calculating distress only in function of
      not-racy priority is correct and sure more than enough without having to
      add randomness into the equation.
      
      Patch is tested on older kernels but it compiles and it's quite simple
      so...
      
      Overall I'm not very satisified by the swappiness tweak, since it doesn't
      rally do anything with the dirty pagecache that may be inactive.  We need
      another kind of tweak that controls the inactive scan and tunes the
      can_writepage feature (not yet in mainline despite having submitted it a
      few times), not only the active one.  That new tweak will tell the kernel
      how hard to scan the inactive list for pure clean pagecache (something the
      mainline kernel isn't capable of yet).  We already have that feature
      working in all our enterprise kernels with the default reasonable tune, or
      they can't even run a readonly backup with tar without triggering huge
      write I/O.  I think it should be available also in mainline later.
      
      Cc: Nick Piggin <npiggin@suse.de>
      Signed-off-by: default avatarKurt Garloff <garloff@suse.de>
      Signed-off-by: default avatarAndrea Arcangeli <andrea@suse.de>
      Signed-off-by: default avatarFengguang Wu <wfg@mail.ustc.edu.cn>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4106f83a
    • Yasunori Goto's avatar
      Fix panic of cpu online with memory less node · 58c0a4a7
      Yasunori Goto authored
      
      
      When a cpu is onlined on memory-less-node box, kernel panics due to touch
      NULL pointer of pgdat->kswapd.  Current kswapd runs only nodes which have
      memory.  So, calling of set_cpus_allowed() is not necessary for memory-less
      node.
      
      This is fix for it.
      Signed-off-by: default avatarYasunori Goto <y-goto@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      58c0a4a7
    • Christoph Lameter's avatar
      Memoryless nodes: Add N_CPU node state · 37c0708d
      Christoph Lameter authored
      
      
      We need the check for a node with cpu in zone reclaim.  Zone reclaim will not
      allow remote zone reclaim if a node has a cpu.
      
      [Lee.Schermerhorn@hp.com: Move setup of N_CPU node state mask]
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Tested-by: default avatarLee Schermerhorn <lee.schermerhorn@hp.com>
      Acked-by: default avatarBob Picco <bob.picco@hp.com>
      Cc: Nishanth Aravamudan <nacc@us.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@skynet.ie>
      Signed-off-by: default avatarLee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      37c0708d
    • Christoph Lameter's avatar
      Memoryless nodes: No need for kswapd · 9422ffba
      Christoph Lameter authored
      
      
      A node without memory does not need a kswapd.  So use the memory map instead
      of the online map when starting kswapd.
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Acked-by: default avatarNishanth Aravamudan <nacc@us.ibm.com>
      Tested-by: default avatarLee Schermerhorn <lee.schermerhorn@hp.com>
      Acked-by: default avatarBob Picco <bob.picco@hp.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@skynet.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9422ffba
    • Rik van Riel's avatar
      mm: prevent kswapd from freeing excessive amounts of lowmem · 32a4330d
      Rik van Riel authored
      
      
      The current VM can get itself into trouble fairly easily on systems with a
      small ZONE_HIGHMEM, which is common on i686 computers with 1GB of memory.
      
      On one side, page_alloc() will allocate down to zone->pages_low, while on
      the other side, kswapd() and balance_pgdat() will try to free memory from
      every zone, until every zone has more free pages than zone->pages_high.
      
      Highmem can be filled up to zone->pages_low with page tables, ramfs,
      vmalloc allocations and other unswappable things quite easily and without
      many bad side effects, since we still have a huge ZONE_NORMAL to do future
      allocations from.
      
      However, as long as the number of free pages in the highmem zone is below
      zone->pages_high, kswapd will continue swapping things out from
      ZONE_NORMAL, too!
      
      Sami Farin managed to get his system into a stage where kswapd had freed
      about 700MB of low memory and was still "going strong".
      
      The attached patch will make kswapd stop paging out data from zones when
      there is more than enough memory free.  We do go above zone->pages_high in
      order to keep pressure between zones equal in normal circumstances, but the
      patch should prevent the kind of excesses that made Sami's computer totally
      unusable.
      Signed-off-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      32a4330d
  18. 23 Aug, 2007 2 commits
    • Andy Whitcroft's avatar
      synchronous lumpy reclaim: wait for page writeback when directly reclaiming contiguous areas · c661b078
      Andy Whitcroft authored
      
      
      Lumpy reclaim works by selecting a lead page from the LRU list and then
      selecting pages for reclaim from the order-aligned area of pages.  In the
      situation were all pages in that region are inactive and not referenced by any
      process over time, it works well.
      
      In the situation where there is even light load on the system, the pages may
      not free quickly.  Out of a area of 1024 pages, maybe only 950 of them are
      freed when the allocation attempt occurs because lumpy reclaim returned early.
       This patch alters the behaviour of direct reclaim for large contiguous
      blocks.
      
      The first attempt to call shrink_page_list() is asynchronous but if it fails,
      the pages are submitted a second time and the calling process waits for the IO
      to complete.  This may stall allocators waiting for contiguous memory but that
      should be expected behaviour for high-order users.  It is preferable behaviour
      to potentially queueing unnecessary areas for IO.  Note that kswapd will not
      stall in this fashion.
      
      [apw@shadowen.org: update to version 2]
      [apw@shadowen.org: update to version 3]
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarAndy Whitcroft <apw@shadowen.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c661b078
    • Andy Whitcroft's avatar
      synchronous lumpy reclaim: ensure we count pages transitioning inactive via clear_active_flags · e9187bdc
      Andy Whitcroft authored
      
      
      As pointed out by Mel when reclaim is applied at higher orders a significant
      amount of IO may be started.  As this takes finite time to drain reclaim will
      consider more areas than ultimatly needed to satisfy the request.  This leads
      to more reclaim than strictly required and reduced success rates.
      
      I was able to confirm Mel's test results on systems locally.  These show that
      even under light load the success rates drop off far more than expected.
      Testing with a modified version of his patch (which follows) I was able to
      allocate almost all of ZONE_MOVABLE with a near idle system.  I ran 5 test
      passes sequentially following system boot (the system has 29 hugepages in
      ZONE_MOVABLE):
      
        2.6.23-rc1              11  8  6  7  7
        sync_lumpy              28 28 29 29 26
      
      These show that although hugely better than the near 0% success normally
      expected we can only allocate about a 1/4 of the zone.  Using synchronous
      reclaim for these allocations we get close to 100% as expected.
      
      I have also run our standard high order tests and these show no regressions in
      allocation success rates at rest, and some significant improvements under
      load.
      
      This patch:
      
      We are transitioning pages from active to inactive in clear_active_flags,
      those need counting as PGDEACTIVATE vm events.
      Signed-off-by: default avatarAndy Whitcroft <apw@shadowen.org>
      Acked-by: default avatarMel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e9187bdc