1. 23 Mar, 2011 1 commit
    • Andi Kleen's avatar
      mm: add __GFP_OTHER_NODE flag · 78afd561
      Andi Kleen authored
      Add a new __GFP_OTHER_NODE flag to tell the low level numa statistics in
      zone_statistics() that an allocation is on behalf of another thread.  This
      way the local and remote counters can be still correct, even when
      background daemons like khugepaged are changing memory mappings.
      
      This only affects the accounting, but I think it's worth doing that right
      to avoid confusing users.
      
      I first tried to just pass down the right node, but this required a lot of
      changes to pass down this parameter and at least one addition of a 10th
      argument to a 9 argument function.  Using the flag is a lot less
      intrusive.
      
      Open: should be also used for migration?
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: default avatarAndi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Reviewed-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      78afd561
  2. 14 Jan, 2011 3 commits
    • Andrea Arcangeli's avatar
      thp: transparent hugepage vmstat · 79134171
      Andrea Arcangeli authored
      Add hugepage stat information to /proc/vmstat and /proc/meminfo.
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      79134171
    • Mel Gorman's avatar
      mm: vmstat: use a single setter function and callback for adjusting percpu thresholds · b44129b3
      Mel Gorman authored
      reduce_pgdat_percpu_threshold() and restore_pgdat_percpu_threshold() exist
      to adjust the per-cpu vmstat thresholds while kswapd is awake to avoid
      errors due to counter drift.  The functions duplicate some code so this
      patch replaces them with a single set_pgdat_percpu_threshold() that takes
      a callback function to calculate the desired threshold as a parameter.
      
      [akpm@linux-foundation.org: readability tweak]
      [kosaki.motohiro@jp.fujitsu.com: set_pgdat_percpu_threshold(): don't use for_each_online_cpu]
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Reviewed-by: default avatarChristoph Lameter <cl@linux.com>
      Reviewed-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b44129b3
    • Mel Gorman's avatar
      mm: page allocator: adjust the per-cpu counter threshold when memory is low · 88f5acf8
      Mel Gorman authored
      Commit aa454840 ("calculate a better estimate of NR_FREE_PAGES when memory
      is low") noted that watermarks were based on the vmstat NR_FREE_PAGES.  To
      avoid synchronization overhead, these counters are maintained on a per-cpu
      basis and drained both periodically and when a threshold is above a
      threshold.  On large CPU systems, the difference between the estimate and
      real value of NR_FREE_PAGES can be very high.  The system can get into a
      case where pages are allocated far below the min watermark potentially
      causing livelock issues.  The commit solved the problem by taking a better
      reading of NR_FREE_PAGES when memory was low.
      
      Unfortately, as reported by Shaohua Li this accurate reading can consume a
      large amount of CPU time on systems with many sockets due to cache line
      bouncing.  This patch takes a different approach.  For large machines
      where counter drift might be unsafe and while kswapd is awake, the per-cpu
      thresholds for the target pgdat are reduced to limit the level of drift to
      what should be a safe level.  This incurs a performance penalty in heavy
      memory pressure by a factor that depends on the workload and the machine
      but the machine should function correctly without accidentally exhausting
      all memory on a node.  There is an additional cost when kswapd wakes and
      sleeps but the event is not expected to be frequent - in Shaohua's test
      case, there was one recorded sleep and wake event at least.
      
      To ensure that kswapd wakes up, a safe version of zone_watermark_ok() is
      introduced that takes a more accurate reading of NR_FREE_PAGES when called
      from wakeup_kswapd, when deciding whether it is really safe to go back to
      sleep in sleeping_prematurely() and when deciding if a zone is really
      balanced or not in balance_pgdat().  We are still using an expensive
      function but limiting how often it is called.
      
      When the test case is reproduced, the time spent in the watermark
      functions is reduced.  The following report is on the percentage of time
      spent cumulatively spent in the functions zone_nr_free_pages(),
      zone_watermark_ok(), __zone_watermark_ok(), zone_watermark_ok_safe(),
      zone_page_state_snapshot(), zone_page_state().
      
      vanilla                      11.6615%
      disable-threshold            0.2584%
      
      David said:
      
      : We had to pull aa454840 "mm: page allocator: calculate a better estimate
      : of NR_FREE_PAGES when memory is low and kswapd is awake" from 2.6.36
      : internally because tests showed that it would cause the machine to stall
      : as the result of heavy kswapd activity.  I merged it back with this fix as
      : it is pending in the -mm tree and it solves the issue we were seeing, so I
      : definitely think this should be pushed to -stable (and I would seriously
      : consider it for 2.6.37 inclusion even at this late date).
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Reported-by: default avatarShaohua Li <shaohua.li@intel.com>
      Reviewed-by: default avatarChristoph Lameter <cl@linux.com>
      Tested-by: default avatarNicolas Bareil <nico@chdir.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Kyle McMartin <kyle@mcmartin.ca>
      Cc: <stable@kernel.org>		[2.6.37.1, 2.6.36.x]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      88f5acf8
  3. 18 Dec, 2010 1 commit
    • Christoph Lameter's avatar
      vmstat: User per cpu atomics to avoid interrupt disable / enable · 7c839120
      Christoph Lameter authored
      Currently the operations to increment vm counters must disable interrupts
      in order to not mess up their housekeeping of counters.
      
      So use this_cpu_cmpxchg() to avoid the overhead. Since we can no longer
      count on preremption being disabled we still have some minor issues.
      The fetching of the counter thresholds is racy.
      A threshold from another cpu may be applied if we happen to be
      rescheduled on another cpu.  However, the following vmstat operation
      will then bring the counter again under the threshold limit.
      
      The operations for __xxx_zone_state are not changed since the caller
      has taken care of the synchronization needs (and therefore the cycle
      count is even less than the optimized version for the irq disable case
      provided here).
      
      The optimization using this_cpu_cmpxchg will only be used if the arch
      supports efficient this_cpu_ops (must have CONFIG_CMPXCHG_LOCAL set!)
      
      The use of this_cpu_cmpxchg reduces the cycle count for the counter
      operations by %80 (inc_zone_page_state goes from 170 cycles to 32).
      Signed-off-by: default avatarChristoph Lameter <cl@linux.com>
      7c839120
  4. 17 Dec, 2010 2 commits
  5. 15 Dec, 2010 1 commit
    • Tejun Heo's avatar
      workqueue: convert cancel_rearming_delayed_work[queue]() users to cancel_delayed_work_sync() · afe2c511
      Tejun Heo authored
      cancel_rearming_delayed_work[queue]() has been superceded by
      cancel_delayed_work_sync() quite some time ago.  Convert all the
      in-kernel users.  The conversions are completely equivalent and
      trivial.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatar"David S. Miller" <davem@davemloft.net>
      Acked-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      Acked-by: default avatarEvgeniy Polyakov <zbr@ioremap.net>
      Cc: Jeff Garzik <jgarzik@pobox.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Mauro Carvalho Chehab <mchehab@infradead.org>
      Cc: netdev@vger.kernel.org
      Cc: Anton Vorontsov <cbou@mail.ru>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Alex Elder <aelder@sgi.com>
      Cc: xfs-masters@oss.sgi.com
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: netfilter-devel@vger.kernel.org
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Cc: linux-nfs@vger.kernel.org
      afe2c511
  6. 02 Dec, 2010 1 commit
  7. 03 Nov, 2010 1 commit
    • Wu Fengguang's avatar
      vmstat: fix offset calculation on void* · ff8b16d7
      Wu Fengguang authored
      Fix regression introduced by commit 79da826a ("writeback: report
      dirty thresholds in /proc/vmstat").
      
      The incorrect pointer arithmetic can result in problems like this:
      
        BUG: unable to handle kernel paging request at 07c06d16
        IP: [<c050c336>] strnlen+0x6/0x20
        Call Trace:
         [<c050a249>] ? string+0x39/0xe0
         [<c042be6b>] ? __wake_up_common+0x4b/0x80
         [<c050afcc>] ? vsnprintf+0x1ec/0x380
         [<c04b380e>] ? seq_printf+0x2e/0x60
         [<c04829a6>] ? vmstat_show+0x26/0x30
         [<c04b3bb6>] ? seq_read+0xa6/0x380
         [<c04b3b10>] ? seq_read+0x0/0x380
         [<c04d5d2f>] ? proc_reg_read+0x5f/0x90
         [<c049c4a1>] ? vfs_read+0xa1/0x140
         [<c04d5cd0>] ? proc_reg_read+0x0/0x90
         [<c049c981>] ? sys_read+0x41/0x70
         [<c0402bd0>] ? sysenter_do_call+0x12/0x26
      Reported-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Michael Rubin <mrubin@google.com>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ff8b16d7
  8. 26 Oct, 2010 3 commits
  9. 10 Sep, 2010 2 commits
  10. 10 Aug, 2010 2 commits
    • KOSAKI Motohiro's avatar
      vmscan: kill prev_priority completely · 25edde03
      KOSAKI Motohiro authored
      Since 2.6.28 zone->prev_priority is unused. Then it can be removed
      safely. It reduce stack usage slightly.
      
      Now I have to say that I'm sorry. 2 years ago, I thought prev_priority
      can be integrate again, it's useful. but four (or more) times trying
      haven't got good performance number. Thus I give up such approach.
      
      The rest of this changelog is notes on prev_priority and why it existed in
      the first place and why it might be not necessary any more. This information
      is based heavily on discussions between Andrew Morton, Rik van Riel and
      Kosaki Motohiro who is heavily quotes from.
      
      Historically prev_priority was important because it determined when the VM
      would start unmapping PTE pages. i.e. there are no balances of note within
      the VM, Anon vs File and Mapped vs Unmapped. Without prev_priority, there
      is a potential risk of unnecessarily increasing minor faults as a large
      amount of read activity of use-once pages could push mapped pages to the
      end of the LRU and get unmapped.
      
      There is no proof this is still a problem but currently it is not considered
      to be. Active files are not deactivated if the active file list is smaller
      than the inactive list reducing the liklihood that file-mapped pages are
      being pushed off the LRU and referenced executable pages are kept on the
      active list to avoid them getting pushed out by read activity.
      
      Even if it is a problem, prev_priority prev_priority wouldn't works
      nowadays. First of all, current vmscan still a lot of UP centric code. it
      expose some weakness on some dozens CPUs machine. I think we need more and
      more improvement.
      
      The problem is, current vmscan mix up per-system-pressure, per-zone-pressure
      and per-task-pressure a bit. example, prev_priority try to boost priority to
      other concurrent priority. but if the another task have mempolicy restriction,
      it is unnecessary, but also makes wrong big latency and exceeding reclaim.
      per-task based priority + prev_priority adjustment make the emulation of
      per-system pressure. but it have two issue 1) too rough and brutal emulation
      2) we need per-zone pressure, not per-system.
      
      Another example, currently DEF_PRIORITY is 12. it mean the lru rotate about
      2 cycle (1/4096 + 1/2048 + 1/1024 + .. + 1) before invoking OOM-Killer.
      but if 10,0000 thrreads enter DEF_PRIORITY reclaim at the same time, the
      system have higher memory pressure than priority==0 (1/4096*10,000 > 2).
      prev_priority can't solve such multithreads workload issue. In other word,
      prev_priority concept assume the sysmtem don't have lots threads."
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Reviewed-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michael Rubin <mrubin@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      25edde03
    • Minchan Kim's avatar
      mm: use for_each_online_cpu() in vmstat · 31f961a8
      Minchan Kim authored
      The sum_vm_events passes cpumask for for_each_cpu().  But it's useless
      since we have for_each_online_cpu.  Althougth it's tirival overhead, it's
      not good about coding consistency.
      
      Let's use for_each_online_cpu instead of for_each_cpu with cpumask
      argument.
      Signed-off-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Reviewed-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: default avatarChristoph Lameter <cl@linux-foundation.org>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      31f961a8
  11. 25 May, 2010 4 commits
    • Mel Gorman's avatar
      mm: compaction: direct compact when a high-order allocation fails · 56de7263
      Mel Gorman authored
      Ordinarily when a high-order allocation fails, direct reclaim is entered
      to free pages to satisfy the allocation.  With this patch, it is
      determined if an allocation failed due to external fragmentation instead
      of low memory and if so, the calling process will compact until a suitable
      page is freed.  Compaction by moving pages in memory is considerably
      cheaper than paging out to disk and works where there are locked pages or
      no swap.  If compaction fails to free a page of a suitable size, then
      reclaim will still occur.
      
      Direct compaction returns as soon as possible.  As each block is
      compacted, it is checked if a suitable page has been freed and if so, it
      returns.
      
      [akpm@linux-foundation.org: Fix build errors]
      [aarcange@redhat.com: fix count_vm_event preempt in memory compaction direct reclaim]
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Reviewed-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      56de7263
    • Mel Gorman's avatar
      mm: compaction: memory compaction core · 748446bb
      Mel Gorman authored
      This patch is the core of a mechanism which compacts memory in a zone by
      relocating movable pages towards the end of the zone.
      
      A single compaction run involves a migration scanner and a free scanner.
      Both scanners operate on pageblock-sized areas in the zone.  The migration
      scanner starts at the bottom of the zone and searches for all movable
      pages within each area, isolating them onto a private list called
      migratelist.  The free scanner starts at the top of the zone and searches
      for suitable areas and consumes the free pages within making them
      available for the migration scanner.  The pages isolated for migration are
      then migrated to the newly isolated free pages.
      
      [aarcange@redhat.com: Fix unsafe optimisation]
      [mel@csn.ul.ie: do not schedule work on other CPUs for compaction]
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Reviewed-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      748446bb
    • Mel Gorman's avatar
      mm: export fragmentation index via debugfs · f1a5ab12
      Mel Gorman authored
      The fragmentation fragmentation index, is only meaningful if an allocation
      would fail and indicates what the failure is due to.  A value of -1 such
      as in many of the examples above states that the allocation would succeed.
       If it would fail, the value is between 0 and 1.  A value tending towards
      0 implies the allocation failed due to a lack of memory.  A value tending
      towards 1 implies it failed due to external fragmentation.
      
      For the most part, the huge page size will be the size of interest but not
      necessarily so it is exported on a per-order and per-zo basis via
      /sys/kernel/debug/extfrag/extfrag_index
      
      > cat /sys/kernel/debug/extfrag/extfrag_index
      Node 0, zone      DMA -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.00
      Node 0, zone   Normal -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 0.954
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Reviewed-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Reviewed-by: default avatarChristoph Lameter <cl@linux-foundation.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f1a5ab12
    • Mel Gorman's avatar
      mm: export unusable free space index via debugfs · d7a5752c
      Mel Gorman authored
      The unusable free space index measures how much of the available free
      memory cannot be used to satisfy an allocation of a given size and is a
      value between 0 and 1.  The higher the value, the more of free memory is
      unusable and by implication, the worse the external fragmentation is.  For
      the most part, the huge page size will be the size of interest but not
      necessarily so it is exported on a per-order and per-zone basis via
      /sys/kernel/debug/extfrag/unusable_index.
      
      > cat /sys/kernel/debug/extfrag/unusable_index
      Node 0, zone      DMA 0.000 0.000 0.000 0.001 0.005 0.013 0.021 0.037 0.037 0.101 0.230
      Node 0, zone   Normal 0.000 0.000 0.000 0.001 0.002 0.002 0.005 0.015 0.028 0.028 0.054
      
      [akpm@linux-foundation.org: Fix allnoconfig]
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Reviewed-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: default avatarChristoph Lameter <cl@linux-foundation.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d7a5752c
  12. 30 Mar, 2010 1 commit
    • Tejun Heo's avatar
      include cleanup: Update gfp.h and slab.h includes to prepare for breaking... · 5a0e3ad6
      Tejun Heo authored
      include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
      
      percpu.h is included by sched.h and module.h and thus ends up being
      included when building most .c files.  percpu.h includes slab.h which
      in turn includes gfp.h making everything defined by the two files
      universally available and complicating inclusion dependencies.
      
      percpu.h -> slab.h dependency is about to be removed.  Prepare for
      this change by updating users of gfp and slab facilities include those
      headers directly instead of assuming availability.  As this conversion
      needs to touch large number of source files, the following script is
      used as the basis of conversion.
      
        http://userweb.kernel.org/~tj/misc/slabh-sweep.py
      
      The script does the followings.
      
      * Scan files for gfp and slab usages and update includes such that
        only the necessary includes are there.  ie. if only gfp is used,
        gfp.h, if slab is used, slab.h.
      
      * When the script inserts a new include, it looks at the include
        blocks and try to put the new include such that its order conforms
        to its surrounding.  It's put in the include block which contains
        core kernel includes, in the same order that the rest are ordered -
        alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
        doesn't seem to be any matching order.
      
      * If the script can't find a place to put a new include (mostly
        because the file doesn't have fitting include block), it prints out
        an error message indicating which .h file needs to be added to the
        file.
      
      The conversion was done in the following steps.
      
      1. The initial automatic conversion of all .c files updated slightly
         over 4000 files, deleting around 700 includes and adding ~480 gfp.h
         and ~3000 slab.h inclusions.  The script emitted errors for ~400
         files.
      
      2. Each error was manually checked.  Some didn't need the inclusion,
         some needed manual addition while adding it to implementation .h or
         embedding .c file was more appropriate for others.  This step added
         inclusions to around 150 files.
      
      3. The script was run again and the output was compared to the edits
         from #2 to make sure no file was left behind.
      
      4. Several build tests were done and a couple of problems were fixed.
         e.g. lib/decompress_*.c used malloc/free() wrappers around slab
         APIs requiring slab.h to be added manually.
      
      5. The script was run on all .h files but without automatically
         editing them as sprinkling gfp.h and slab.h inclusions around .h
         files could easily lead to inclusion dependency hell.  Most gfp.h
         inclusion directives were ignored as stuff from gfp.h was usually
         wildly available and often used in preprocessor macros.  Each
         slab.h inclusion directive was examined and added manually as
         necessary.
      
      6. percpu.h was updated not to include slab.h.
      
      7. Build test were done on the following configurations and failures
         were fixed.  CONFIG_GCOV_KERNEL was turned off for all tests (as my
         distributed build env didn't work with gcov compiles) and a few
         more options had to be turned off depending on archs to make things
         build (like ipr on powerpc/64 which failed due to missing writeq).
      
         * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
         * powerpc and powerpc64 SMP allmodconfig
         * sparc and sparc64 SMP allmodconfig
         * ia64 SMP allmodconfig
         * s390 SMP allmodconfig
         * alpha SMP allmodconfig
         * um on x86_64 SMP allmodconfig
      
      8. percpu.h modifications were reverted so that it could be applied as
         a separate patch and serve as bisection point.
      
      Given the fact that I had only a couple of failures from tests on step
      6, I'm fairly confident about the coverage of this conversion patch.
      If there is a breakage, it's likely to be something in one of the arch
      headers which should be easily discoverable easily on most builds of
      the specific arch.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Guess-its-ok-by: default avatarChristoph Lameter <cl@linux-foundation.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      5a0e3ad6
  13. 06 Mar, 2010 1 commit
  14. 05 Jan, 2010 2 commits
    • Christoph Lameter's avatar
      this_cpu: Remove pageset_notifier · ad596925
      Christoph Lameter authored
      Remove the pageset notifier since it only marks that a processor
      exists on a specific node. Move that code into the vmstat notifier.
      Signed-off-by: default avatarChristoph Lameter <cl@linux-foundation.org>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      ad596925
    • Christoph Lameter's avatar
      this_cpu: Page allocator conversion · 99dcc3e5
      Christoph Lameter authored
      Use the per cpu allocator functionality to avoid per cpu arrays in struct zone.
      
      This drastically reduces the size of struct zone for systems with large
      amounts of processors and allows placement of critical variables of struct
      zone in one cacheline even on very large systems.
      
      Another effect is that the pagesets of one processor are placed near one
      another. If multiple pagesets from different zones fit into one cacheline
      then additional cacheline fetches can be avoided on the hot paths when
      allocating memory from multiple zones.
      
      Bootstrap becomes simpler if we use the same scheme for UP, SMP, NUMA. #ifdefs
      are reduced and we can drop the zone_pcp macro.
      
      Hotplug handling is also simplified since cpu alloc can bring up and
      shut down cpu areas for a specific cpu as a whole. So there is no need to
      allocate or free individual pagesets.
      
      V7-V8:
      - Explain chicken egg dilemmna with percpu allocator.
      
      V4-V5:
      - Fix up cases where per_cpu_ptr is called before irq disable
      - Integrate the bootstrap logic that was separate before.
      
      tj: Build failure in pageset_cpuup_callback() due to missing ret
          variable fixed.
      Reviewed-by: default avatarMel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarChristoph Lameter <cl@linux-foundation.org>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      99dcc3e5
  15. 15 Dec, 2009 2 commits
    • KOSAKI Motohiro's avatar
      vmscan: stop kswapd waiting on congestion when the min watermark is not being met · bb3ab596
      KOSAKI Motohiro authored
      If reclaim fails to make sufficient progress, the priority is raised.
      Once the priority is higher, kswapd starts waiting on congestion.
      However, if the zone is below the min watermark then kswapd needs to
      continue working without delay as there is a danger of an increased rate
      of GFP_ATOMIC allocation failure.
      
      This patch changes the conditions under which kswapd waits on congestion
      by only going to sleep if the min watermarks are being met.
      
      [mel@csn.ul.ie: add stats to track how relevant the logic is]
      [mel@csn.ul.ie: make kswapd only check its own zones and rename the relevant counters]
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bb3ab596
    • Mel Gorman's avatar
      vmscan: have kswapd sleep for a short interval and double check it should be asleep · f50de2d3
      Mel Gorman authored
      After kswapd balances all zones in a pgdat, it goes to sleep.  In the
      event of no IO congestion, kswapd can go to sleep very shortly after the
      high watermark was reached.  If there are a constant stream of allocations
      from parallel processes, it can mean that kswapd went to sleep too quickly
      and the high watermark is not being maintained for sufficient length time.
      
      This patch makes kswapd go to sleep as a two-stage process.  It first
      tries to sleep for HZ/10.  If it is woken up by another process or the
      high watermark is no longer met, it's considered a premature sleep and
      kswapd continues work.  Otherwise it goes fully to sleep.
      
      This adds more counters to distinguish between fast and slow breaches of
      watermarks.  A "fast" premature sleep is one where the low watermark was
      hit in a very short time after kswapd going to sleep.  A "slow" premature
      sleep indicates that the high watermark was breached after a very short
      interval.
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Cc: Frans Pop <elendil@planet.nl>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f50de2d3
  16. 29 Oct, 2009 1 commit
    • Tejun Heo's avatar
      percpu: make percpu symbols under kernel/ and mm/ unique · 1871e52c
      Tejun Heo authored
      This patch updates percpu related symbols under kernel/ and mm/ such
      that percpu symbols are unique and don't clash with local symbols.
      This serves two purposes of decreasing the possibility of global
      percpu symbol collision and allowing dropping per_cpu__ prefix from
      percpu symbols.
      
      * kernel/lockdep.c: s/lock_stats/cpu_lock_stats/
      
      * kernel/sched.c: s/init_rq_rt/init_rt_rq_var/	(any better idea?)
        		  s/sched_group_cpus/sched_groups/
      
      * kernel/softirq.c: s/ksoftirqd/run_ksoftirqd/a
      
      * kernel/softlockup.c: s/(*)_timestamp/softlockup_\1_ts/
        		       s/watchdog_task/softlockup_watchdog/
      		       s/timestamp/ts/ for local variables
      
      * kernel/time/timer_stats: s/lookup_lock/tstats_lookup_lock/
      
      * mm/slab.c: s/reap_work/slab_reap_work/
        	     s/reap_node/slab_reap_node/
      
      * mm/vmstat.c: local variable changed to avoid collision with vmstat_work
      
      Partly based on Rusty Russell's "alloc_percpu: rename percpu vars
      which cause name clashes" patch.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatar(slab/vmstat) Christoph Lameter <cl@linux-foundation.org>
      Reviewed-by: default avatarChristoph Lameter <cl@linux-foundation.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Nick Piggin <npiggin@suse.de>
      1871e52c
  17. 22 Sep, 2009 3 commits
  18. 17 Jun, 2009 5 commits
  19. 18 May, 2009 1 commit
    • Mel Gorman's avatar
      [ARM] Double check memmap is actually valid with a memmap has unexpected holes V2 · eb33575c
      Mel Gorman authored
      pfn_valid() is meant to be able to tell if a given PFN has valid memmap
      associated with it or not. In FLATMEM, it is expected that holes always
      have valid memmap as long as there is valid PFNs either side of the hole.
      In SPARSEMEM, it is assumed that a valid section has a memmap for the
      entire section.
      
      However, ARM and maybe other embedded architectures in the future free
      memmap backing holes to save memory on the assumption the memmap is never
      used. The page_zone linkages are then broken even though pfn_valid()
      returns true. A walker of the full memmap must then do this additional
      check to ensure the memmap they are looking at is sane by making sure the
      zone and PFN linkages are still valid. This is expensive, but walkers of
      the full memmap are extremely rare.
      
      This was caught before for FLATMEM and hacked around but it hits again for
      SPARSEMEM because the page_zone linkages can look ok where the PFN linkages
      are totally screwed. This looks like a hatchet job but the reality is that
      any clean solution would end up consumning all the memory saved by punching
      these unexpected holes in the memmap. For example, we tried marking the
      memmap within the section invalid but the section size exceeds the size of
      the hole in most cases so pfn_valid() starts returning false where valid
      memmap exists. Shrinking the size of the section would increase memory
      consumption offsetting the gains.
      
      This patch identifies when an architecture is punching unexpected holes
      in the memmap that the memory model cannot automatically detect and sets
      ARCH_HAS_HOLES_MEMORYMODEL. At the moment, this is restricted to EP93xx
      which is the model sub-architecture this has been reported on but may expand
      later. When set, walkers of the full memmap must call memmap_valid_within()
      for each PFN and passing in what it expects the page and zone to be for
      that PFN. If it finds the linkages to be broken, it assumes the memmap is
      invalid for that PFN.
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarRussell King <rmk+kernel@arm.linux.org.uk>
      eb33575c
  20. 03 Apr, 2009 1 commit
  21. 01 Apr, 2009 1 commit
  22. 30 Mar, 2009 1 commit