1. 11 Dec, 2014 8 commits
    • Johannes Weiner's avatar
      mm: memcontrol: lockless page counters · 3e32cb2e
      Johannes Weiner authored
      
      
      Memory is internally accounted in bytes, using spinlock-protected 64-bit
      counters, even though the smallest accounting delta is a page.  The
      counter interface is also convoluted and does too many things.
      
      Introduce a new lockless word-sized page counter API, then change all
      memory accounting over to it.  The translation from and to bytes then only
      happens when interfacing with userspace.
      
      The removed locking overhead is noticable when scaling beyond the per-cpu
      charge caches - on a 4-socket machine with 144-threads, the following test
      shows the performance differences of 288 memcgs concurrently running a
      page fault benchmark:
      
      vanilla:
      
         18631648.500498      task-clock (msec)         #  140.643 CPUs utilized            ( +-  0.33% )
               1,380,638      context-switches          #    0.074 K/sec                    ( +-  0.75% )
                  24,390      cpu-migrations            #    0.001 K/sec                    ( +-  8.44% )
           1,843,305,768      page-faults               #    0.099 M/sec                    ( +-  0.00% )
      50,134,994,088,218      cycles                    #    2.691 GHz                      ( +-  0.33% )
         <not supported>      stalled-cycles-frontend
         <not supported>      stalled-cycles-backend
       8,049,712,224,651      instructions              #    0.16  insns per cycle          ( +-  0.04% )
       1,586,970,584,979      branches                  #   85.176 M/sec                    ( +-  0.05% )
           1,724,989,949      branch-misses             #    0.11% of all branches          ( +-  0.48% )
      
           132.474343877 seconds time elapsed                                          ( +-  0.21% )
      
      lockless:
      
         12195979.037525      task-clock (msec)         #  133.480 CPUs utilized            ( +-  0.18% )
                 832,850      context-switches          #    0.068 K/sec                    ( +-  0.54% )
                  15,624      cpu-migrations            #    0.001 K/sec                    ( +- 10.17% )
           1,843,304,774      page-faults               #    0.151 M/sec                    ( +-  0.00% )
      32,811,216,801,141      cycles                    #    2.690 GHz                      ( +-  0.18% )
         <not supported>      stalled-cycles-frontend
         <not supported>      stalled-cycles-backend
       9,999,265,091,727      instructions              #    0.30  insns per cycle          ( +-  0.10% )
       2,076,759,325,203      branches                  #  170.282 M/sec                    ( +-  0.12% )
           1,656,917,214      branch-misses             #    0.08% of all branches          ( +-  0.55% )
      
            91.369330729 seconds time elapsed                                          ( +-  0.45% )
      
      On top of improved scalability, this also gets rid of the icky long long
      types in the very heart of memcg, which is great for 32 bit and also makes
      the code a lot more readable.
      
      Notable differences between the old and new API:
      
      - res_counter_charge() and res_counter_charge_nofail() become
        page_counter_try_charge() and page_counter_charge() resp. to match
        the more common kernel naming scheme of try_do()/do()
      
      - res_counter_uncharge_until() is only ever used to cancel a local
        counter and never to uncharge bigger segments of a hierarchy, so
        it's replaced by the simpler page_counter_cancel()
      
      - res_counter_set_limit() is replaced by page_counter_limit(), which
        expects its callers to serialize against themselves
      
      - res_counter_memparse_write_strategy() is replaced by
        page_counter_limit(), which rounds down to the nearest page size -
        rather than up.  This is more reasonable for explicitely requested
        hard upper limits.
      
      - to keep charging light-weight, page_counter_try_charge() charges
        speculatively, only to roll back if the result exceeds the limit.
        Because of this, a failing bigger charge can temporarily lock out
        smaller charges that would otherwise succeed.  The error is bounded
        to the difference between the smallest and the biggest possible
        charge size, so for memcg, this means that a failing THP charge can
        send base page charges into reclaim upto 2MB (4MB) before the limit
        would have been reached.  This should be acceptable.
      
      [akpm@linux-foundation.org: add includes for WARN_ON_ONCE and memparse]
      [akpm@linux-foundation.org: add includes for WARN_ON_ONCE, memparse, strncmp, and PAGE_SIZE]
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3e32cb2e
    • Pranith Kumar's avatar
      slab: replace smp_read_barrier_depends() with lockless_dereference() · 8df0c2dc
      Pranith Kumar authored
      
      
      Recently lockless_dereference() was added which can be used in place of
      hard-coding smp_read_barrier_depends().  The following PATCH makes the
      change.
      Signed-off-by: default avatarPranith Kumar <bobby.prani@gmail.com>
      Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8df0c2dc
    • Andrew Morton's avatar
      slab: improve checking for invalid gfp_flags · c871ac4e
      Andrew Morton authored
      
      
      The code goes BUG, but doesn't tell us which bits were unexpectedly set.
      Print that out.
      
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c871ac4e
    • Andrey Ryabinin's avatar
      mm: slub: fix format mismatches in slab_err() callers · f6edde9c
      Andrey Ryabinin authored
      
      
      Adding __printf(3, 4) to slab_err exposed following:
      
        mm/slub.c: In function `check_slab':
        mm/slub.c:852:4: warning: format `%u' expects argument of type `unsigned int', but argument 4 has type `const char *' [-Wformat=]
            s->name, page->objects, maxobj);
            ^
        mm/slub.c:852:4: warning: too many arguments for format [-Wformat-extra-args]
        mm/slub.c:857:4: warning: format `%u' expects argument of type `unsigned int', but argument 4 has type `const char *' [-Wformat=]
            s->name, page->inuse, page->objects);
            ^
        mm/slub.c:857:4: warning: too many arguments for format [-Wformat-extra-args]
      
        mm/slub.c: In function `on_freelist':
        mm/slub.c:905:4: warning: format `%d' expects argument of type `int', but argument 5 has type `long unsigned int' [-Wformat=]
            "should be %d", page->objects, max_objects);
      
      Fix first two warnings by removing redundant s->name.
      Fix the last by changing type of max_object from unsigned long to int.
      Signed-off-by: default avatarAndrey Ryabinin <a.ryabinin@samsung.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f6edde9c
    • Joonsoo Kim's avatar
      mm/slab: reverse iteration on find_mergeable() · 54362057
      Joonsoo Kim authored
      Unlike SLUB, sometimes, object isn't started at the beginning of the slab
      in the SLAB.  This causes the unalignment problem when after slab merging
      is supported by commit 12220dea
      
       ("mm/slab: support slab merge").
      Alignment mismatch check is introduced ("mm/slab: fix unalignment problem
      on Malta with EVA due to slab merge") to prevent merge in this case.
      
      This causes undesirable result that merging happens between infrequently
      used kmem_caches if there are kmem_caches with same size and is 256 bytes,
      are merged into pool_workqueue rather than kmalloc-256, because
      kmem_caches for kmalloc are at the tail of the list.
      
      To prevent this situation, this patch reverses iteration order in
      find_mergeable() to find frequently used kmem_caches.  This change helps
      to merge kmem_cache to frequently used kmem_caches, such as kmalloc
      kmem_caches.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      54362057
    • Vladimir Davydov's avatar
      slab: print slabinfo header in seq show · 1df3b26f
      Vladimir Davydov authored
      
      
      Currently we print the slabinfo header in the seq start method, which
      makes it unusable for showing leaks, so we have leaks_show, which does
      practically the same as s_show except it doesn't show the header.
      
      However, we can print the header in the seq show method - we only need
      to check if the current element is the first on the list.  This will
      allow us to use the same set of seq iterators for both leaks and
      slabinfo reporting, which is nice.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1df3b26f
    • LQYMGT's avatar
      mm: slab/slub: coding style: whitespaces and tabs mixture · b455def2
      LQYMGT authored
      
      
      Some code in mm/slab.c and mm/slub.c use whitespaces in indent.
      Clean them up.
      Signed-off-by: default avatarLQYMGT <lqymgt@gmail.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b455def2
    • Joonsoo Kim's avatar
      mm/CMA: fix boot regression due to physical address of high_memory · 6b101e2a
      Joonsoo Kim authored
      
      
      high_memory isn't direct mapped memory so retrieving it's physical address
      isn't appropriate.  But, it would be useful to check physical address of
      highmem boundary so it's justfiable to get physical address from it.  In
      x86, there is a validation check if CONFIG_DEBUG_VIRTUAL and it triggers
      following boot failure reported by Ingo.
      
        ...
        BUG: Int 6: CR2 00f06f53
        ...
        Call Trace:
          dump_stack+0x41/0x52
          early_idt_handler+0x6b/0x6b
          cma_declare_contiguous+0x33/0x212
          dma_contiguous_reserve_area+0x31/0x4e
          dma_contiguous_reserve+0x11d/0x125
          setup_arch+0x7b5/0xb63
          start_kernel+0xb8/0x3e6
          i386_start_kernel+0x79/0x7d
      
      To fix boot regression, this patch implements workaround to avoid
      validation check in x86 when retrieving physical address of high_memory.
      __pa_nodebug() used by this patch is implemented only in x86 so there is
      no choice but to use dirty #ifdef.
      
      [akpm@linux-foundation.org: tweak comment]
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Reported-by: default avatarIngo Molnar <mingo@kernel.org>
      Tested-by: default avatarIngo Molnar <mingo@kernel.org>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6b101e2a
  2. 03 Dec, 2014 5 commits
    • Paul Mackerras's avatar
      slab: fix nodeid bounds check for non-contiguous node IDs · 7c3fbbdd
      Paul Mackerras authored
      The bounds check for nodeid in ____cache_alloc_node gives false
      positives on machines where the node IDs are not contiguous, leading to
      a panic at boot time.  For example, on a POWER8 machine the node IDs are
      typically 0, 1, 16 and 17.  This means that num_online_nodes() returns
      4, so when ____cache_alloc_node is called with nodeid = 16 the VM_BUG_ON
      triggers, like this:
      
        kernel BUG at /home/paulus/kernel/kvm/mm/slab.c:3079!
        Call Trace:
          .____cache_alloc_node+0x5c/0x270 (unreliable)
          .kmem_cache_alloc_node_trace+0xdc/0x360
          .init_list+0x3c/0x128
          .kmem_cache_init+0x1dc/0x258
          .start_kernel+0x2a0/0x568
          start_here_common+0x20/0xa8
      
      To fix this, we instead compare the nodeid with MAX_NUMNODES, and
      additionally make sure it isn't negative (since nodeid is an int).  The
      check is there mainly to protect the array dereference in the get_node()
      call in the next line, and the array being dereferenced is of size
      MAX_NUMNODES.  If the nodeid is in range but invalid (for example if the
      node is off-line), the BUG_ON in the next line will catch that.
      
      Fixes: 14e50c6a
      
       ("mm: slab: Verify the nodeid passed to ____cache_alloc_node")
      Signed-off-by: default avatarPaul Mackerras <paulus@samba.org>
      Reviewed-by: default avatarYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Reviewed-by: default avatarPekka Enberg <penberg@kernel.org>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7c3fbbdd
    • Daniel Forrest's avatar
      mm: fix anon_vma_clone() error treatment · c4ea95d7
      Daniel Forrest authored
      Andrew Morton noticed that the error return from anon_vma_clone() was
      being dropped and replaced with -ENOMEM (which is not itself a bug
      because the only error return value from anon_vma_clone() is -ENOMEM).
      
      I did an audit of callers of anon_vma_clone() and discovered an actual
      bug where the error return was being lost.  In __split_vma(), between
      Linux 3.11 and 3.12 the code was changed so the err variable is used
      before the call to anon_vma_clone() and the default initial value of
      -ENOMEM is overwritten.  So a failure of anon_vma_clone() will return
      success since err at this point is now zero.
      
      Below is a patch which fixes this bug and also propagates the error
      return value from anon_vma_clone() in all cases.
      
      Fixes: ef0855d3
      
       ("mm: mempolicy: turn vma_set_policy() into vma_dup_policy()")
      Signed-off-by: default avatarDaniel Forrest <dan.forrest@ssec.wisc.edu>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Tim Hartrick <tim@edgecast.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>	[3.12+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c4ea95d7
    • Hugh Dickins's avatar
      mm: fix swapoff hang after page migration and fork · 2022b4d1
      Hugh Dickins authored
      
      
      I've been seeing swapoff hangs in recent testing: it's cycling around
      trying unsuccessfully to find an mm for some remaining pages of swap.
      
      I have been exercising swap and page migration more heavily recently,
      and now notice a long-standing error in copy_one_pte(): it's trying to
      add dst_mm to swapoff's mmlist when it finds a swap entry, but is doing
      so even when it's a migration entry or an hwpoison entry.
      
      Which wouldn't matter much, except it adds dst_mm next to src_mm,
      assuming src_mm is already on the mmlist: which may not be so.  Then if
      pages are later swapped out from dst_mm, swapoff won't be able to find
      where to replace them.
      
      There's already a !non_swap_entry() test for stats: move that up before
      the swap_duplicate() and the addition to mmlist.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Kelley Nielsen <kelleynnn@gmail.com>
      Cc: <stable@vger.kernel.org>	[2.6.18+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2022b4d1
    • Andrew Morton's avatar
      mm/vmpressure.c: fix race in vmpressure_work_fn() · 91b57191
      Andrew Morton authored
      In some android devices, there will be a "divide by zero" exception.
      vmpr->scanned could be zero before spin_lock(&vmpr->sr_lock).
      
      Addresses https://bugzilla.kernel.org/show_bug.cgi?id=88051
      
      
      
      [akpm@linux-foundation.org: neaten]
      Reported-by: default avatarji_ang <ji_ang@163.com>
      Cc: Anton Vorontsov <anton.vorontsov@linaro.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      91b57191
    • Weijie Yang's avatar
      mm: frontswap: invalidate expired data on a dup-store failure · fb993fa1
      Weijie Yang authored
      
      
      If a frontswap dup-store failed, it should invalidate the expired page
      in the backend, or it could trigger some data corruption issue.
      Such as:
       1. use zswap as the frontswap backend with writeback feature
       2. store a swap page(version_1) to entry A, success
       3. dup-store a newer page(version_2) to the same entry A, fail
       4. use __swap_writepage() write version_2 page to swapfile, success
       5. zswap do shrink, writeback version_1 page to swapfile
       6. version_2 page is overwrited by version_1, data corrupt.
      
      This patch fixes this issue by invalidating expired data immediately
      when meet a dup-store failure.
      Signed-off-by: default avatarWeijie Yang <weijie.yang@samsung.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Seth Jennings <sjennings@variantweb.net>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fb993fa1
  3. 17 Nov, 2014 2 commits
    • Dave Hansen's avatar
      x86, mpx: Cleanup unused bound tables · 1de4fa14
      Dave Hansen authored
      
      
      The previous patch allocates bounds tables on-demand.  As noted in
      an earlier description, these can add up to *HUGE* amounts of
      memory.  This has caused OOMs in practice when running tests.
      
      This patch adds support for freeing bounds tables when they are no
      longer in use.
      
      There are two types of mappings in play when unmapping tables:
       1. The mapping with the actual data, which userspace is
          munmap()ing or brk()ing away, etc...
       2. The mapping for the bounds table *backing* the data
          (is tagged with VM_MPX, see the patch "add MPX specific
          mmap interface").
      
      If userspace use the prctl() indroduced earlier in this patchset
      to enable the management of bounds tables in kernel, when it
      unmaps the first type of mapping with the actual data, the kernel
      needs to free the mapping for the bounds table backing the data.
      This patch hooks in at the very end of do_unmap() to do so.
      We look at the addresses being unmapped and find the bounds
      directory entries and tables which cover those addresses.  If
      an entire table is unused, we clear associated directory entry
      and free the table.
      
      Once we unmap the bounds table, we would have a bounds directory
      entry pointing at empty address space. That address space might
      now be allocated for some other (random) use, and the MPX
      hardware might now try to walk it as if it were a bounds table.
      That would be bad.  So any unmapping of an enture bounds table
      has to be accompanied by a corresponding write to the bounds
      directory entry to invalidate it.  That write to the bounds
      directory can fault, which causes the following problem:
      
      Since we are doing the freeing from munmap() (and other paths
      like it), we hold mmap_sem for write. If we fault, the page
      fault handler will attempt to acquire mmap_sem for read and
      we will deadlock.  To avoid the deadlock, we pagefault_disable()
      when touching the bounds directory entry and use a
      get_user_pages() to resolve the fault.
      
      The unmapping of bounds tables happends under vm_munmap().  We
      also (indirectly) call vm_munmap() to _do_ the unmapping of the
      bounds tables.  We avoid unbounded recursion by disallowing
      freeing of bounds tables *for* bounds tables.  This would not
      occur normally, so should not have any practical impact.  Being
      strict about it here helps ensure that we do not have an
      exploitable stack overflow.
      Based-on-patch-by: default avatarQiaowei Ren <qiaowei.ren@intel.com>
      Signed-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Cc: linux-mm@kvack.org
      Cc: linux-mips@linux-mips.org
      Cc: Dave Hansen <dave@sr71.net>
      Link: http://lkml.kernel.org/r/20141114151831.E4531C4A@viggo.jf.intel.com
      
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      1de4fa14
    • Will Deacon's avatar
      mmu_gather: move minimal range calculations into generic code · fb7332a9
      Will Deacon authored
      
      
      On architectures with hardware broadcasting of TLB invalidation messages
      , it makes sense to reduce the range of the mmu_gather structure when
      unmapping page ranges based on the dirty address information passed to
      tlb_remove_tlb_entry.
      
      arm64 already does this by directly manipulating the start/end fields
      of the gather structure, but this confuses the generic code which
      does not expect these fields to change and can end up calculating
      invalid, negative ranges when forcing a flush in zap_pte_range.
      
      This patch moves the minimal range calculation out of the arm64 code
      and into the generic implementation, simplifying zap_pte_range in the
      process (which no longer needs to care about start/end, since they will
      point to the appropriate ranges already). With the range being tracked
      by core code, the need_flush flag is dropped in favour of checking that
      the end of the range has actually been set.
      
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Russell King - ARM Linux <linux@arm.linux.org.uk>
      Cc: Michal Simek <monstr@monstr.eu>
      Acked-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      fb7332a9
  4. 14 Nov, 2014 11 commits
    • Tang Chen's avatar
      mem-hotplug: reset node present pages when hot-adding a new pgdat · 0bd85420
      Tang Chen authored
      
      
      When memory is hot-added, all the memory is in offline state.  So clear
      all zones' present_pages because they will be updated in online_pages()
      and offline_pages().  Otherwise, /proc/zoneinfo will corrupt:
      
      When the memory of node2 is offline:
      
        # cat /proc/zoneinfo
        ......
        Node 2, zone   Movable
        ......
              spanned  8388608
              present  8388608
              managed  0
      
      When we online memory on node2:
      
        # cat /proc/zoneinfo
        ......
        Node 2, zone   Movable
        ......
              spanned  8388608
              present  16777216
              managed  8388608
      Signed-off-by: default avatarTang Chen <tangchen@cn.fujitsu.com>
      Reviewed-by: default avatarYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: <stable@vger.kernel.org>	[3.16+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0bd85420
    • Tang Chen's avatar
      mem-hotplug: reset node managed pages when hot-adding a new pgdat · f784a3f1
      Tang Chen authored
      
      
      In free_area_init_core(), zone->managed_pages is set to an approximate
      value for lowmem, and will be adjusted when the bootmem allocator frees
      pages into the buddy system.
      
      But free_area_init_core() is also called by hotadd_new_pgdat() when
      hot-adding memory.  As a result, zone->managed_pages of the newly added
      node's pgdat is set to an approximate value in the very beginning.
      
      Even if the memory on that node has node been onlined,
      /sys/device/system/node/nodeXXX/meminfo has wrong value:
      
        hot-add node2 (memory not onlined)
        cat /sys/device/system/node/node2/meminfo
        Node 2 MemTotal:       33554432 kB
        Node 2 MemFree:               0 kB
        Node 2 MemUsed:        33554432 kB
        Node 2 Active:                0 kB
      
      This patch fixes this problem by reset node managed pages to 0 after
      hot-adding a new node.
      
      1. Move reset_managed_pages_done from reset_node_managed_pages() to
         reset_all_zones_managed_pages()
      2. Make reset_node_managed_pages() non-static
      3. Call reset_node_managed_pages() in hotadd_new_pgdat() after pgdat
         is initialized
      Signed-off-by: default avatarTang Chen <tangchen@cn.fujitsu.com>
      Signed-off-by: default avatarYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: <stable@vger.kernel.org>	[3.16+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f784a3f1
    • Joonsoo Kim's avatar
      mm/debug-pagealloc: correct freepage accounting and order resetting · 57cbc87e
      Joonsoo Kim authored
      
      
      One thing I did in this patch is fixing freepage accounting.  If we
      clear guard page and link it onto isolate buddy list, we should not
      increase freepage count.  This patch adds conditional branch to skip
      counting in this case.  Without this patch, this overcounting happens
      frequently if guard order is set and CMA is used.
      
      Another thing fixed in this patch is the target to reset order.  In
      __free_one_page(), we check the buddy page whether it is a guard page or
      not.  And, if so, we should clear guard attribute on the buddy page and
      reset order of it to 0.  But, current code resets original page's order
      rather than buddy one's.  Maybe, this doesn't have any problem, because
      whole merged page's order will be re-assigned soon.  But, it is better
      to correct code.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Gioh Kim <gioh.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      57cbc87e
    • Vlastimil Babka's avatar
      mm, compaction: prevent infinite loop in compact_zone · 1d5bfe1f
      Vlastimil Babka authored
      Several people have reported occasionally seeing processes stuck in
      compact_zone(), even triggering soft lockups, in 3.18-rc2+.
      
      Testing a revert of commit e14c720e ("mm, compaction: remember
      position within pageblock in free pages scanner") fixed the issue,
      although the stuck processes do not appear to involve the free scanner.
      
      Finally, by code inspection, the bug was found in isolate_migratepages()
      which uses a slightly different condition to detect if the migration and
      free scanners have met, than compact_finished().  That has not been a
      problem until commit e14c720e allowed the free scanner position
      between individual invocations to be in the middle of a pageblock.
      
      In a relatively rare case, the migration scanner position can end up at
      the beginning of a pageblock, with the free scanner position in the
      middle of the same pageblock.  If it's the migration scanner's turn,
      isolate_migratepages() exits immediately (without updating the
      position), while compact_finished() decides to continue compaction,
      resulting in a potentially infinite loop.  The system can recover only
      if another process creates enough high-order pages to make the watermark
      checks in compact_finished() pass.
      
      This patch fixes the immediate problem by bumping the migration
      scanner's position to meet the free scanner in isolate_migratepages(),
      when both are within the same pageblock.  This causes compact_finished()
      to terminate properly.  A more robust check in compact_finished() is
      planned as a cleanup for better future maintainability.
      
      Fixes: e14c720e
      
       ("mm, compaction: remember position within pageblock in free pages scanner)
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reported-by: default avatarP. Christeas <xrg@linux.gr>
      Tested-by: default avatarP. Christeas <xrg@linux.gr>
      Link: http://marc.info/?l=linux-mm&m=141508604232522&w=2
      
      Reported-by: default avatarNorbert Preining <preining@logic.at>
      Tested-by: default avatarNorbert Preining <preining@logic.at>
      Link: https://lkml.org/lkml/2014/11/4/904
      
      Reported-by: default avatarPavel Machek <pavel@ucw.cz>
      Link: https://lkml.org/lkml/2014/11/7/164
      
      
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1d5bfe1f
    • Michal Nazarewicz's avatar
      mm: alloc_contig_range: demote pages busy message from warn to info · dae803e1
      Michal Nazarewicz authored
      
      
      Having test_pages_isolated failure message as a warning confuses users
      into thinking that it is more serious than it really is.  In reality, if
      called via CMA, allocation will be retried so a single
      test_pages_isolated failure does not prevent allocation from succeeding.
      
      Demote the warning message to an info message and reformat it such that
      the text "failed" does not appear and instead a less worrying "PFNS
      busy" is used.
      
      This message is trivially reproducible on a 10GB x86 machine on 3.16.y
      kernels configured with CONFIG_DMA_CMA.
      Signed-off-by: default avatarMichal Nazarewicz <mina86@mina86.com>
      Cc: Laurent Pinchart <laurent.pinchart@ideasonboard.com>
      Cc: Peter Hurley <peter@hurleysoftware.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dae803e1
    • Joonsoo Kim's avatar
      mm/slab: fix unalignment problem on Malta with EVA due to slab merge · 95069ac8
      Joonsoo Kim authored
      Unlike SLUB, sometimes, object isn't started at the beginning of the
      slab in SLAB.  This causes the unalignment problem after slab merging is
      supported by commit 12220dea ("mm/slab: support slab merge").
      
      Following is the report from Markos that fail to boot on Malta with EVA.
      
          Calibrating delay loop... 19.86 BogoMIPS (lpj=99328)
          pid_max: default: 32768 minimum: 301
          Mount-cache hash table entries: 4096 (order: 0, 16384 bytes)
          Mountpoint-cache hash table entries: 4096 (order: 0, 16384 bytes)
          Kernel bug detected[#1]:
          CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.17.0-05639-g12220dea
      
       #1631
          task: 1f04f5d8 ti: 1f050000 task.ti: 1f050000
          epc   : 80141190 alloc_unbound_pwq+0x234/0x304
              Not tainted
          ra    : 80141184 alloc_unbound_pwq+0x228/0x304
          Process swapper/0 (pid: 1, threadinfo=1f050000, task=1f04f5d8, tls=00000000)
          Call Trace:
            alloc_unbound_pwq+0x234/0x304
            apply_workqueue_attrs+0x11c/0x294
            __alloc_workqueue_key+0x23c/0x470
            init_workqueues+0x320/0x400
            do_one_initcall+0xe8/0x23c
            kernel_init_freeable+0x9c/0x224
            kernel_init+0x10/0x100
            ret_from_kernel_thread+0x14/0x1c
          [ end trace cb88537fdc8fa200 ]
          Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
      
      alloc_unbound_pwq() allocates slab object from pool_workqueue.  This
      kmem_cache requires 256 bytes alignment, but, current merging code
      doesn't honor that, and merge it with kmalloc-256.  kmalloc-256 requires
      only cacheline size alignment so that above failure occurs.  However, in
      x86, kmalloc-256 is luckily aligned in 256 bytes, so the problem didn't
      happen on it.
      
      To fix this problem, this patch introduces alignment mismatch check in
      find_mergeable().  This will fix the problem.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Reported-by: default avatarMarkos Chandras <Markos.Chandras@imgtec.com>
      Tested-by: default avatarMarkos Chandras <Markos.Chandras@imgtec.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      95069ac8
    • Joonsoo Kim's avatar
      mm/page_alloc: restrict max order of merging on isolated pageblock · 3c605096
      Joonsoo Kim authored
      
      
      Current pageblock isolation logic could isolate each pageblock
      individually.  This causes freepage accounting problem if freepage with
      pageblock order on isolate pageblock is merged with other freepage on
      normal pageblock.  We can prevent merging by restricting max order of
      merging to pageblock order if freepage is on isolate pageblock.
      
      A side-effect of this change is that there could be non-merged buddy
      freepage even if finishing pageblock isolation, because undoing
      pageblock isolation is just to move freepage from isolate buddy list to
      normal buddy list rather than to consider merging.  So, the patch also
      makes undoing pageblock isolation consider freepage merge.  When
      un-isolation, freepage with more than pageblock order and it's buddy are
      checked.  If they are on normal pageblock, instead of just moving, we
      isolate the freepage and free it in order to get merged.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Laura Abbott <lauraa@codeaurora.org>
      Cc: Heesub Shin <heesub.shin@samsung.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Ritesh Harjani <ritesh.list@gmail.com>
      Cc: Gioh Kim <gioh.kim@lge.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3c605096
    • Joonsoo Kim's avatar
      mm/page_alloc: move freepage counting logic to __free_one_page() · 8f82b55d
      Joonsoo Kim authored
      
      
      All the caller of __free_one_page() has similar freepage counting logic,
      so we can move it to __free_one_page().  This reduce line of code and
      help future maintenance.
      
      This is also preparation step for "mm/page_alloc: restrict max order of
      merging on isolated pageblock" which fix the freepage counting problem
      on freepage with more than pageblock order.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Laura Abbott <lauraa@codeaurora.org>
      Cc: Heesub Shin <heesub.shin@samsung.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Ritesh Harjani <ritesh.list@gmail.com>
      Cc: Gioh Kim <gioh.kim@lge.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8f82b55d
    • Joonsoo Kim's avatar
      mm/page_alloc: add freepage on isolate pageblock to correct buddy list · 51bb1a40
      Joonsoo Kim authored
      
      
      In free_pcppages_bulk(), we use cached migratetype of freepage to
      determine type of buddy list where freepage will be added.  This
      information is stored when freepage is added to pcp list, so if
      isolation of pageblock of this freepage begins after storing, this
      cached information could be stale.  In other words, it has original
      migratetype rather than MIGRATE_ISOLATE.
      
      There are two problems caused by this stale information.
      
      One is that we can't keep these freepages from being allocated.
      Although this pageblock is isolated, freepage will be added to normal
      buddy list so that it could be allocated without any restriction.  And
      the other problem is incorrect freepage accounting.  Freepages on
      isolate pageblock should not be counted for number of freepage.
      
      Following is the code snippet in free_pcppages_bulk().
      
          /* MIGRATE_MOVABLE list may include MIGRATE_RESERVEs */
          __free_one_page(page, page_to_pfn(page), zone, 0, mt);
          trace_mm_page_pcpu_drain(page, 0, mt);
          if (likely(!is_migrate_isolate_page(page))) {
              __mod_zone_page_state(zone, NR_FREE_PAGES, 1);
              if (is_migrate_cma(mt))
                  __mod_zone_page_state(zone, NR_FREE_CMA_PAGES, 1);
          }
      
      As you can see above snippet, current code already handle second
      problem, incorrect freepage accounting, by re-fetching pageblock
      migratetype through is_migrate_isolate_page(page).
      
      But, because this re-fetched information isn't used for
      __free_one_page(), first problem would not be solved.  This patch try to
      solve this situation to re-fetch pageblock migratetype before
      __free_one_page() and to use it for __free_one_page().
      
      In addition to move up position of this re-fetch, this patch use
      optimization technique, re-fetching migratetype only if there is isolate
      pageblock.  Pageblock isolation is rare event, so we can avoid
      re-fetching in common case with this optimization.
      
      This patch also correct migratetype of the tracepoint output.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarMichal Nazarewicz <mina86@mina86.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Laura Abbott <lauraa@codeaurora.org>
      Cc: Heesub Shin <heesub.shin@samsung.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Ritesh Harjani <ritesh.list@gmail.com>
      Cc: Gioh Kim <gioh.kim@lge.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      51bb1a40
    • Joonsoo Kim's avatar
      mm/page_alloc: fix incorrect isolation behavior by rechecking migratetype · ad53f92e
      Joonsoo Kim authored
      
      
      Before describing bugs itself, I first explain definition of freepage.
      
       1. pages on buddy list are counted as freepage.
       2. pages on isolate migratetype buddy list are *not* counted as freepage.
       3. pages on cma buddy list are counted as CMA freepage, too.
      
      Now, I describe problems and related patch.
      
      Patch 1: There is race conditions on getting pageblock migratetype that
      it results in misplacement of freepages on buddy list, incorrect
      freepage count and un-availability of freepage.
      
      Patch 2: Freepages on pcp list could have stale cached information to
      determine migratetype of buddy list to go.  This causes misplacement of
      freepages on buddy list and incorrect freepage count.
      
      Patch 4: Merging between freepages on different migratetype of
      pageblocks will cause freepages accouting problem.  This patch fixes it.
      
      Without patchset [3], above problem doesn't happens on my CMA allocation
      test, because CMA reserved pages aren't used at all.  So there is no
      chance for above race.
      
      With patchset [3], I did simple CMA allocation test and get below
      result:
      
       - Virtual machine, 4 cpus, 1024 MB memory, 256 MB CMA reservation
       - run kernel build (make -j16) on background
       - 30 times CMA allocation(8MB * 30 = 240MB) attempts in 5 sec interval
       - Result: more than 5000 freepage count are missed
      
      With patchset [3] and this patchset, I found that no freepage count are
      missed so that I conclude that problems are solved.
      
      On my simple memory offlining test, these problems also occur on that
      environment, too.
      
      This patch (of 4):
      
      There are two paths to reach core free function of buddy allocator,
      __free_one_page(), one is free_one_page()->__free_one_page() and the
      other is free_hot_cold_page()->free_pcppages_bulk()->__free_one_page().
      Each paths has race condition causing serious problems.  At first, this
      patch is focused on first type of freepath.  And then, following patch
      will solve the problem in second type of freepath.
      
      In the first type of freepath, we got migratetype of freeing page
      without holding the zone lock, so it could be racy.  There are two cases
      of this race.
      
       1. pages are added to isolate buddy list after restoring orignal
          migratetype
      
          CPU1                                   CPU2
      
          get migratetype => return MIGRATE_ISOLATE
          call free_one_page() with MIGRATE_ISOLATE
      
                                      grab the zone lock
                                      unisolate pageblock
                                      release the zone lock
      
          grab the zone lock
          call __free_one_page() with MIGRATE_ISOLATE
          freepage go into isolate buddy list,
          although pageblock is already unisolated
      
      This may cause two problems.  One is that we can't use this page anymore
      until next isolation attempt of this pageblock, because freepage is on
      isolate buddy list.  The other is that freepage accouting could be wrong
      due to merging between different buddy list.  Freepages on isolate buddy
      list aren't counted as freepage, but ones on normal buddy list are
      counted as freepage.  If merge happens, buddy freepage on normal buddy
      list is inevitably moved to isolate buddy list without any consideration
      of freepage accouting so it could be incorrect.
      
       2. pages are added to normal buddy list while pageblock is isolated.
          It is similar with above case.
      
      This also may cause two problems.  One is that we can't keep these
      freepages from being allocated.  Although this pageblock is isolated,
      freepage would be added to normal buddy list so that it could be
      allocated without any restriction.  And the other problem is same as
      case 1, that it, incorrect freepage accouting.
      
      This race condition would be prevented by checking migratetype again
      with holding the zone lock.  Because it is somewhat heavy operation and
      it isn't needed in common case, we want to avoid rechecking as much as
      possible.  So this patch introduce new variable, nr_isolate_pageblock in
      struct zone to check if there is isolated pageblock.  With this, we can
      avoid to re-check migratetype in common case and do it only if there is
      isolated pageblock or migratetype is MIGRATE_ISOLATE.  This solve above
      mentioned problems.
      
      Changes from v3:
      Add one more check in free_one_page() that checks whether migratetype is
      MIGRATE_ISOLATE or not. Without this, abovementioned case 1 could happens.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarMichal Nazarewicz <mina86@mina86.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Laura Abbott <lauraa@codeaurora.org>
      Cc: Heesub Shin <heesub.shin@samsung.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Ritesh Harjani <ritesh.list@gmail.com>
      Cc: Gioh Kim <gioh.kim@lge.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ad53f92e
    • Joonsoo Kim's avatar
      mm/compaction: skip the range until proper target pageblock is met · 58420016
      Joonsoo Kim authored
      Commit 7d49d886
      
       ("mm, compaction: reduce zone checking frequency in
      the migration scanner") has a side-effect that changes the iteration
      range calculation.  Before the change, block_end_pfn is calculated using
      start_pfn, but now it blindly adds pageblock_nr_pages to the previous
      value.
      
      This causes the problem that isolation_start_pfn is larger than
      block_end_pfn when we isolate the page with more than pageblock order.
      In this case, isolation would fail due to an invalid range parameter.
      
      To prevent this, this patch implements skipping the range until a proper
      target pageblock is met.  Without this patch, CMA with more than
      pageblock order always fails but with this patch it will succeed.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      58420016
  5. 13 Nov, 2014 1 commit
    • Paul Mackerras's avatar
      Fix thinko in iov_iter_single_seg_count · ad0eab92
      Paul Mackerras authored
      
      
      The branches of the if (i->type & ITER_BVEC) statement in
      iov_iter_single_seg_count() are the wrong way around; if ITER_BVEC is
      clear then we use i->bvec, when we should be using i->iov.  This fixes
      it.
      
      In my case, the symptom that this caused was that a KVM guest doing
      filesystem operations on a virtual disk would result in one of qemu's
      threads on the host going into an infinite loop in
      generic_perform_write().  The loop would hit the copied == 0 case and
      call iov_iter_single_seg_count() to reduce the number of bytes to try
      to process, but because of the error, iov_iter_single_seg_count()
      would just return i->count and the loop made no progress and continued
      forever.
      
      Cc: stable@vger.kernel.org # 3.16+
      Signed-off-by: default avatarPaul Mackerras <paulus@samba.org>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      ad0eab92
  6. 06 Nov, 2014 1 commit
  7. 29 Oct, 2014 11 commits
    • Jan Kara's avatar
      mm: Remove false WARN_ON from pagecache_isize_extended() · f55fefd1
      Jan Kara authored
      
      
      The WARN_ON checking whether i_mutex is held in
      pagecache_isize_extended() was wrong because some filesystems (e.g.
      XFS) use different locks for serialization of truncates / writes. So
      just remove the check.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      f55fefd1
    • Konstantin Khlebnikov's avatar
      mm/balloon_compaction: fix deflation when compaction is disabled · 4d88e6f7
      Konstantin Khlebnikov authored
      If CONFIG_BALLOON_COMPACTION=n balloon_page_insert() does not link pages
      with balloon and doesn't set PagePrivate flag, as a result
      balloon_page_dequeue() cannot get any pages because it thinks that all
      of them are isolated.  Without balloon compaction nobody can isolate
      ballooned pages.  It's safe to remove this check.
      
      Fixes: d6d86c0a
      
       ("mm/balloon_compaction: redesign ballooned pages management").
      Signed-off-by: default avatarKonstantin Khlebnikov <k.khlebnikov@samsung.com>
      Reported-by: default avatarMatt Mullins <mmullins@mmlx.us>
      Cc: <stable@vger.kernel.org>	[3.17]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4d88e6f7
    • Mikulas Patocka's avatar
      mm/slab_common: don't check for duplicate cache names · 8aba7e0a
      Mikulas Patocka authored
      The SLUB cache merges caches with the same size and alignment and there
      was long standing bug with this behavior:
      
       - create the cache named "foo"
       - create the cache named "bar" (which is merged with "foo")
       - delete the cache named "foo" (but it stays allocated because "bar"
         uses it)
       - create the cache named "foo" again - it fails because the name "foo"
         is already used
      
      That bug was fixed in commit 69461747 ("slab_common: fix the check
      for duplicate slab names") by not warning on duplicate cache names when
      the SLUB subsystem is used.
      
      Recently, cache merging was implemented the with SLAB subsystem too, in
      12220dea
      
       ("mm/slab: support slab merge")).  Therefore we need stop
      checking for duplicate names even for the SLAB subsystem.
      
      This patch fixes the bug by removing the check.
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8aba7e0a
    • Johannes Weiner's avatar
      mm: rmap: split out page_remove_file_rmap() · 8186eb6a
      Johannes Weiner authored
      
      
      page_remove_rmap() has too many branches on PageAnon() and is hard to
      follow.  Move the file part into a separate function.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8186eb6a
    • Johannes Weiner's avatar
      mm: memcontrol: fix missed end-writeback page accounting · d7365e78
      Johannes Weiner authored
      Commit 0a31bc97
      
       ("mm: memcontrol: rewrite uncharge API") changed
      page migration to uncharge the old page right away.  The page is locked,
      unmapped, truncated, and off the LRU, but it could race with writeback
      ending, which then doesn't unaccount the page properly:
      
      test_clear_page_writeback()              migration
                                                 wait_on_page_writeback()
        TestClearPageWriteback()
                                                 mem_cgroup_migrate()
                                                   clear PCG_USED
        mem_cgroup_update_page_stat()
          if (PageCgroupUsed(pc))
            decrease memcg pages under writeback
      
        release pc->mem_cgroup->move_lock
      
      The per-page statistics interface is heavily optimized to avoid a
      function call and a lookup_page_cgroup() in the file unmap fast path,
      which means it doesn't verify whether a page is still charged before
      clearing PageWriteback() and it has to do it in the stat update later.
      
      Rework it so that it looks up the page's memcg once at the beginning of
      the transaction and then uses it throughout.  The charge will be
      verified before clearing PageWriteback() and migration can't uncharge
      the page as long as that is still set.  The RCU lock will protect the
      memcg past uncharge.
      
      As far as losing the optimization goes, the following test results are
      from a microbenchmark that maps, faults, and unmaps a 4GB sparse file
      three times in a nested fashion, so that there are two negative passes
      that don't account but still go through the new transaction overhead.
      There is no actual difference:
      
       old:     33.195102545 seconds time elapsed       ( +-  0.01% )
       new:     33.199231369 seconds time elapsed       ( +-  0.03% )
      
      The time spent in page_remove_rmap()'s callees still adds up to the
      same, but the time spent in the function itself seems reduced:
      
           # Children      Self  Command        Shared Object       Symbol
       old:     0.12%     0.11%  filemapstress  [kernel.kallsyms]   [k] page_remove_rmap
       new:     0.12%     0.08%  filemapstress  [kernel.kallsyms]   [k] page_remove_rmap
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: <stable@vger.kernel.org>	[3.17.x]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d7365e78
    • Johannes Weiner's avatar
      mm: page-writeback: inline account_page_dirtied() into single caller · 3a3c02ec
      Johannes Weiner authored
      
      
      A follow-up patch would have changed the call signature.  To save the
      trouble, just fold it instead.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: <stable@vger.kernel.org>	[3.17.x]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3a3c02ec
    • Yasuaki Ishimatsu's avatar
      memory-hotplug: clear pgdat which is allocated by bootmem in try_offline_node() · 35dca71c
      Yasuaki Ishimatsu authored
      
      
      When hot adding the same memory after hot removal, the following
      messages are shown:
      
        WARNING: CPU: 20 PID: 6 at mm/page_alloc.c:4968 free_area_init_node+0x3fe/0x426()
        ...
        Call Trace:
          dump_stack+0x46/0x58
          warn_slowpath_common+0x81/0xa0
          warn_slowpath_null+0x1a/0x20
          free_area_init_node+0x3fe/0x426
          hotadd_new_pgdat+0x90/0x110
          add_memory+0xd4/0x200
          acpi_memory_device_add+0x1aa/0x289
          acpi_bus_attach+0xfd/0x204
          acpi_bus_attach+0x178/0x204
          acpi_bus_scan+0x6a/0x90
          acpi_device_hotplug+0xe8/0x418
          acpi_hotplug_work_fn+0x1f/0x2b
          process_one_work+0x14e/0x3f0
          worker_thread+0x11b/0x510
          kthread+0xe1/0x100
          ret_from_fork+0x7c/0xb0
      
      The detaled explanation is as follows:
      
      When hot removing memory, pgdat is set to 0 in try_offline_node().  But
      if the pgdat is allocated by bootmem allocator, the clearing step is
      skipped.
      
      And when hot adding the same memory, the uninitialized pgdat is reused.
      But free_area_init_node() checks wether pgdat is set to zero.  As a
      result, free_area_init_node() hits WARN_ON().
      
      This patch clears pgdat which is allocated by bootmem allocator in
      try_offline_node().
      Signed-off-by: default avatarYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Zhang Zhen <zhenzhang.zhang@huawei.com>
      Cc: Wang Nan <wangnan0@huawei.com>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Reviewed-by: default avatarToshi Kani <toshi.kani@hp.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      35dca71c
    • David Rientjes's avatar
      mm, thp: fix collapsing of hugepages on madvise · 6d50e60c
      David Rientjes authored
      
      
      If an anonymous mapping is not allowed to fault thp memory and then
      madvise(MADV_HUGEPAGE) is used after fault, khugepaged will never
      collapse this memory into thp memory.
      
      This occurs because the madvise(2) handler for thp, hugepage_madvise(),
      clears VM_NOHUGEPAGE on the stack and it isn't stored in vma->vm_flags
      until the final action of madvise_behavior().  This causes the
      khugepaged_enter_vma_merge() to be a no-op in hugepage_madvise() when
      the vma had previously had VM_NOHUGEPAGE set.
      
      Fix this by passing the correct vma flags to the khugepaged mm slot
      handler.  There's no chance khugepaged can run on this vma until after
      madvise_behavior() returns since we hold mm->mmap_sem.
      
      It would be possible to clear VM_NOHUGEPAGE directly from vma->vm_flags
      in hugepage_advise(), but I didn't want to introduce special case
      behavior into madvise_behavior().  I think it's best to just let it
      always set vma->vm_flags itself.
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Reported-by: default avatarSuleiman Souhlal <suleiman@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6d50e60c
    • Yu Zhao's avatar
      mm: free compound page with correct order · 5ddacbe9
      Yu Zhao authored
      Compound page should be freed by put_page() or free_pages() with correct
      order.  Not doing so will cause tail pages leaked.
      
      The compound order can be obtained by compound_order() or use
      HPAGE_PMD_ORDER in our case.  Some people would argue the latter is
      faster but I prefer the former which is more general.
      
      This bug was observed not just on our servers (the worst case we saw is
      11G leaked on a 48G machine) but also on our workstations running Ubuntu
      based distro.
      
        $ cat /proc/vmstat  | grep thp_zero_page_alloc
        thp_zero_page_alloc 55
        thp_zero_page_alloc_failed 0
      
      This means there is (thp_zero_page_alloc - 1) * (2M - 4K) memory leaked.
      
      Fixes: 97ae1749
      
       ("thp: implement refcounting for huge zero page")
      Signed-off-by: default avatarYu Zhao <yuzhao@google.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Bob Liu <lliubbo@gmail.com>
      Cc: <stable@vger.kernel.org>	[3.8+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5ddacbe9
    • Joonsoo Kim's avatar
      mm/compaction.c: avoid premature range skip in isolate_migratepages_range · 6ea41c0c
      Joonsoo Kim authored
      Commit edc2ca61
      
       ("mm, compaction: move pageblock checks up from
      isolate_migratepages_range()") commonizes isolate_migratepages variants
      and make them use isolate_migratepages_block().
      
      isolate_migratepages_block() could stop the execution when enough pages
      are isolated, but, there is no code in isolate_migratepages_range() to
      handle this case.  In the result, even if isolate_migratepages_block()
      returns prematurely without checking all pages in the range,
      
      isolate_migratepages_block() is called repeately on the following
      pageblock and some pages in the previous range are skipped to check.
      Then, CMA is failed frequently due to this fact.
      
      To fix this problem, this patch let isolate_migratepages_range() know
      the situation that enough pages are isolated and stop the isolation in
      that case.
      
      Note that isolate_migratepages() has no such problem, because, it always
      stops the isolation after just one call of isolate_migratepages_block().
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6ea41c0c
    • Wang Nan's avatar
      cgroup/kmemleak: add kmemleak_free() for cgroup deallocations. · 401507d6
      Wang Nan authored
      Commit ff7ee93f ("cgroup/kmemleak: Annotate alloc_page() for cgroup
      allocations") introduces kmemleak_alloc() for alloc_page_cgroup(), but
      corresponding kmemleak_free() is missing, which makes kmemleak be
      wrongly disabled after memory offlining.  Log is pasted at the end of
      this commit message.
      
      This patch add kmemleak_free() into free_page_cgroup().  During page
      offlining, this patch removes corresponding entries in kmemleak rbtree.
      After that, the freed memory can be allocated again by other subsystems
      without killing kmemleak.
      
        bash # for x in 1 2 3 4; do echo offline > /sys/devices/system/memory/memory$x/state ; sleep 1; done ; dmesg | grep leak
      
        Offlined Pages 32768
        kmemleak: Cannot insert 0xffff880016969000 into the object search tree (overlaps existing)
        CPU: 0 PID: 412 Comm: sleep Not tainted 3.17.0-rc5+ #86
        Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
        Call Trace:
          dump_stack+0x46/0x58
          create_object+0x266/0x2c0
          kmemleak_alloc+0x26/0x50
          kmem_cache_alloc+0xd3/0x160
          __sigqueue_alloc+0x49/0xd0
          __send_signal+0xcb/0x410
          send_signal+0x45/0x90
          __group_send_sig_info+0x13/0x20
          do_notify_parent+0x1bb/0x260
          do_exit+0x767/0xa40
          do_group_exit+0x44/0xa0
          SyS_exit_group+0x17/0x20
          system_call_fastpath+0x16/0x1b
      
        kmemleak: Kernel memory leak detector disabled
        kmemleak: Object 0xffff880016900000 (size 524288):
        kmemleak:   comm "swapper/0", pid 0, jiffies 4294667296
        kmemleak:   min_count = 0
        kmemleak:   count = 0
        kmemleak:   flags = 0x1
        kmemleak:   checksum = 0
        kmemleak:   backtrace:
              log_early+0x63/0x77
              kmemleak_alloc+0x4b/0x50
              init_section_page_cgroup+0x7f/0xf5
              page_cgroup_init+0xc5/0xd0
              start_kernel+0x333/0x408
              x86_64_start_reservations+0x2a/0x2c
              x86_64_start_kernel+0xf5/0xfc
      
      Fixes: ff7ee93f
      
       (cgroup/kmemleak: Annotate alloc_page() for cgroup allocations)
      Signed-off-by: default avatarWang Nan <wangnan0@huawei.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: <stable@vger.kernel.org>	[3.2+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      401507d6
  8. 28 Oct, 2014 1 commit
    • Will Deacon's avatar
      zap_pte_range: update addr when forcing flush after TLB batching faiure · ce9ec37b
      Will Deacon authored
      
      
      When unmapping a range of pages in zap_pte_range, the page being
      unmapped is added to an mmu_gather_batch structure for asynchronous
      freeing. If we run out of space in the batch structure before the range
      has been completely unmapped, then we break out of the loop, force a
      TLB flush and free the pages that we have batched so far. If there are
      further pages to unmap, then we resume the loop where we left off.
      
      Unfortunately, we forget to update addr when we break out of the loop,
      which causes us to truncate the range being invalidated as the end
      address is exclusive. When we re-enter the loop at the same address, the
      page has already been freed and the pte_present test will fail, meaning
      that we do not reconsider the address for invalidation.
      
      This patch fixes the problem by incrementing addr by the PAGE_SIZE
      before breaking out of the loop on batch failure.
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ce9ec37b