1. 23 Feb, 2017 8 commits
    • Tejun Heo's avatar
      slub: make sysfs directories for memcg sub-caches optional · 1663f26d
      Tejun Heo authored
      SLUB creates a per-cache directory under /sys/kernel/slab which hosts a
      bunch of debug files.  Usually, there aren't that many caches on a
      system and this doesn't really matter; however, if memcg is in use, each
      cache can have per-cgroup sub-caches.  SLUB creates the same directories
      for these sub-caches under /sys/kernel/slab/$CACHE/cgroup.
      
      Unfortunately, because there can be a lot of cgroups, active or
      draining, the product of the numbers of caches, cgroups and files in
      each directory can reach a very high number - hundreds of thousands is
      commonplace.  Millions and beyond aren't difficult to reach either.
      
      What's under /sys/kernel/slab is primarily for debugging and the
      information and control on the a root cache already cover its
      sub-caches.  While having a separate directory for each sub-cache can be
      helpful for development, it doesn't make much sense to pay this amount
      of overhead by default.
      
      This patch introduces a boot parameter slub_memcg_sysfs which determines
      whether to create sysfs directories for per-memcg sub-caches.  It also
      adds CONFIG_SLUB_MEMCG_SYSFS_ON which determines the boot parameter's
      default value and defaults to 0.
      
      [akpm@linux-foundation.org: kset_unregister(NULL) is legal]
      Link: http://lkml.kernel.org/r/20170204145203.GB26958@mtj.duckdns.org
      
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1663f26d
    • Tejun Heo's avatar
      slab: remove slub sysfs interface files early for empty memcg caches · 50862ce7
      Tejun Heo authored
      With kmem cgroup support enabled, kmem_caches can be created and
      destroyed frequently and a great number of near empty kmem_caches can
      accumulate if there are a lot of transient cgroups and the system is not
      under memory pressure.  When memory reclaim starts under such
      conditions, it can lead to consecutive deactivation and destruction of
      many kmem_caches, easily hundreds of thousands on moderately large
      systems, exposing scalability issues in the current slab management
      code.  This is one of the patches to address the issue.
      
      Each cache has a number of sysfs interface files under /sys/kernel/slab.
      On a system with a lot of memory and transient memcgs, the number of
      interface files which have to be removed once memory reclaim kicks in
      can reach millions.
      
      Link: http://lkml.kernel.org/r/20170117235411.9408-10-tj@kernel.org
      
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarJay Vana <jsvana@fb.com>
      Acked-by: default avatarVladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      50862ce7
    • Tejun Heo's avatar
      slab: remove synchronous synchronize_sched() from memcg cache deactivation path · 01fb58bc
      Tejun Heo authored
      With kmem cgroup support enabled, kmem_caches can be created and
      destroyed frequently and a great number of near empty kmem_caches can
      accumulate if there are a lot of transient cgroups and the system is not
      under memory pressure.  When memory reclaim starts under such
      conditions, it can lead to consecutive deactivation and destruction of
      many kmem_caches, easily hundreds of thousands on moderately large
      systems, exposing scalability issues in the current slab management
      code.  This is one of the patches to address the issue.
      
      slub uses synchronize_sched() to deactivate a memcg cache.
      synchronize_sched() is an expensive and slow operation and doesn't scale
      when a huge number of caches are destroyed back-to-back.  While there
      used to be a simple batching mechanism, the batching was too restricted
      to be helpful.
      
      This patch implements slab_deactivate_memcg_cache_rcu_sched() which slub
      can use to schedule sched RCU callback instead of performing
      synchronize_sched() synchronously while holding cgroup_mutex.  While
      this adds online cpus, mems and slab_mutex operations, operating on
      these locks back-to-back from the same kworker, which is what's gonna
      happen when there are many to deactivate, isn't expensive at all and
      this gets rid of the scalability problem completely.
      
      Link: http://lkml.kernel.org/r/20170117235411.9408-9-tj@kernel.org
      
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarJay Vana <jsvana@fb.com>
      Acked-by: default avatarVladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      01fb58bc
    • Tejun Heo's avatar
      slab: introduce __kmemcg_cache_deactivate() · c9fc5864
      Tejun Heo authored
      __kmem_cache_shrink() is called with %true @deactivate only for memcg
      caches.  Remove @deactivate from __kmem_cache_shrink() and introduce
      __kmemcg_cache_deactivate() instead.  Each memcg-supporting allocator
      should implement it and it should deactivate and drain the cache.
      
      This is to allow memcg cache deactivation behavior to further deviate
      from simple shrinking without messing up __kmem_cache_shrink().
      
      This is pure reorganization and doesn't introduce any observable
      behavior changes.
      
      v2: Dropped unnecessary ifdef in mm/slab.h as suggested by Vladimir.
      
      Link: http://lkml.kernel.org/r/20170117235411.9408-8-tj@kernel.org
      
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarVladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c9fc5864
    • Tejun Heo's avatar
      slab: implement slab_root_caches list · 510ded33
      Tejun Heo authored
      With kmem cgroup support enabled, kmem_caches can be created and
      destroyed frequently and a great number of near empty kmem_caches can
      accumulate if there are a lot of transient cgroups and the system is not
      under memory pressure.  When memory reclaim starts under such
      conditions, it can lead to consecutive deactivation and destruction of
      many kmem_caches, easily hundreds of thousands on moderately large
      systems, exposing scalability issues in the current slab management
      code.  This is one of the patches to address the issue.
      
      slab_caches currently lists all caches including root and memcg ones.
      This is the only data structure which lists the root caches and
      iterating root caches can only be done by walking the list while
      skipping over memcg caches.  As there can be a huge number of memcg
      caches, this can become very expensive.
      
      This also can make /proc/slabinfo behave very badly.  seq_file processes
      reads in 4k chunks and seeks to the previous Nth position on slab_caches
      list to resume after each chunk.  With a lot of memcg cache churns on
      the list, reading /proc/slabinfo can become very slow and its content
      often ends up with duplicate and/or missing entries.
      
      This patch adds a new list slab_root_caches which lists only the root
      caches.  When memcg is not enabled, it becomes just an alias of
      slab_caches.  memcg specific list operations are collected into
      memcg_[un]link_cache().
      
      Link: http://lkml.kernel.org/r/20170117235411.9408-7-tj@kernel.org
      
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarJay Vana <jsvana@fb.com>
      Acked-by: default avatarVladimir Davydov <vdavydov@tarantool.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      510ded33
    • Tejun Heo's avatar
      slub: separate out sysfs_slab_release() from sysfs_slab_remove() · bf5eb3de
      Tejun Heo authored
      Separate out slub sysfs removal and release, and call the former earlier
      from __kmem_cache_shutdown().  There's no reason to defer sysfs removal
      through RCU and this will later allow us to remove sysfs files way
      earlier during memory cgroup offline instead of release.
      
      Link: http://lkml.kernel.org/r/20170117235411.9408-3-tj@kernel.org
      
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarVladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bf5eb3de
    • Tejun Heo's avatar
      Revert "slub: move synchronize_sched out of slab_mutex on shrink" · 290b6a58
      Tejun Heo authored
      Patch series "slab: make memcg slab destruction scalable", v3.
      
      With kmem cgroup support enabled, kmem_caches can be created and
      destroyed frequently and a great number of near empty kmem_caches can
      accumulate if there are a lot of transient cgroups and the system is not
      under memory pressure.  When memory reclaim starts under such
      conditions, it can lead to consecutive deactivation and destruction of
      many kmem_caches, easily hundreds of thousands on moderately large
      systems, exposing scalability issues in the current slab management
      code.
      
      I've seen machines which end up with hundred thousands of caches and
      many millions of kernfs_nodes.  The current code is O(N^2) on the total
      number of caches and has synchronous rcu_barrier() and
      synchronize_sched() in cgroup offline / release path which is executed
      while holding cgroup_mutex.  Combined, this leads to very expensive and
      slow cache destruction operations which can easily keep running for half
      a day.
      
      This also messes up /proc/slabinfo along with other cache iterating
      operations.  seq_file operates on 4k chunks and on each 4k boundary
      tries to seek to the last position in the list.  With a huge number of
      caches on the list, this becomes very slow and very prone to the list
      content changing underneath it leading to a lot of missing and/or
      duplicate entries.
      
      This patchset addresses the scalability problem.
      
      * Add root and per-memcg lists.  Update each user to use the
        appropriate list.
      
      * Make rcu_barrier() for SLAB_DESTROY_BY_RCU caches globally batched
        and asynchronous.
      
      * For dying empty slub caches, remove the sysfs files after
        deactivation so that we don't end up with millions of sysfs files
        without any useful information on them.
      
      This patchset contains the following nine patches.
      
       0001-Revert-slub-move-synchronize_sched-out-of-slab_mutex.patch
       0002-slub-separate-out-sysfs_slab_release-from-sysfs_slab.patch
       0003-slab-remove-synchronous-rcu_barrier-call-in-memcg-ca.patch
       0004-slab-reorganize-memcg_cache_params.patch
       0005-slab-link-memcg-kmem_caches-on-their-associated-memo.patch
       0006-slab-implement-slab_root_caches-list.patch
       0007-slab-introduce-__kmemcg_cache_deactivate.patch
       0008-slab-remove-synchronous-synchronize_sched-from-memcg.patch
       0009-slab-remove-slub-sysfs-interface-files-early-for-emp.patch
       0010-slab-use-memcg_kmem_cache_wq-for-slab-destruction-op.patch
      
      0001 reverts an existing optimization to prepare for the following
      changes.  0002 is a prep patch.  0003 makes rcu_barrier() in release
      path batched and asynchronous.  0004-0006 separate out the lists.
      0007-0008 replace synchronize_sched() in slub destruction path with
      call_rcu_sched().  0009 removes sysfs files early for empty dying
      caches.  0010 makes destruction work items use a workqueue with limited
      concurrency.
      
      This patch (of 10):
      
      Revert 89e364db ("slub: move synchronize_sched out of slab_mutex on
      shrink").
      
      With kmem cgroup support enabled, kmem_caches can be created and destroyed
      frequently and a great number of near empty kmem_caches can accumulate if
      there are a lot of transient cgroups and the system is not under memory
      pressure.  When memory reclaim starts under such conditions, it can lead
      to consecutive deactivation and destruction of many kmem_caches, easily
      hundreds of thousands on moderately large systems, exposing scalability
      issues in the current slab management code.  This is one of the patches to
      address the issue.
      
      Moving synchronize_sched() out of slab_mutex isn't enough as it's still
      inside cgroup_mutex.  The whole deactivation / release path will be
      updated to avoid all synchronous RCU operations.  Revert this insufficient
      optimization in preparation to ease future changes.
      
      Link: http://lkml.kernel.org/r/20170117235411.9408-2-tj@kernel.org
      
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarJay Vana <jsvana@fb.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      290b6a58
    • Borislav Petkov's avatar
      mm/slub: add a dump_stack() to the unexpected GFP check · 65b9de75
      Borislav Petkov authored
      We wish to know who is doing such a thing. slab.c does this.
      
      Link: http://lkml.kernel.org/r/20170116091643.15260-1-bp@alien8.de
      
      Signed-off-by: default avatarBorislav Petkov <bp@suse.de>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      65b9de75
  2. 08 Feb, 2017 1 commit
  3. 25 Jan, 2017 1 commit
  4. 13 Dec, 2016 2 commits
  5. 06 Sep, 2016 1 commit
  6. 10 Aug, 2016 1 commit
    • Chris Wilson's avatar
      mm/slub.c: run free_partial() outside of the kmem_cache_node->list_lock · 60398923
      Chris Wilson authored
      With debugobjects enabled and using SLAB_DESTROY_BY_RCU, when a
      kmem_cache_node is destroyed the call_rcu() may trigger a slab
      allocation to fill the debug object pool (__debug_object_init:fill_pool).
      
      Everywhere but during kmem_cache_destroy(), discard_slab() is performed
      outside of the kmem_cache_node->list_lock and avoids a lockdep warning
      about potential recursion:
      
        =============================================
        [ INFO: possible recursive locking detected ]
        4.8.0-rc1-gfxbench+ #1 Tainted: G     U
        ---------------------------------------------
        rmmod/8895 is trying to acquire lock:
         (&(&n->list_lock)->rlock){-.-...}, at: [<ffffffff811c80d7>] get_partial_node.isra.63+0x47/0x430
      
        but task is already holding lock:
         (&(&n->list_lock)->rlock){-.-...}, at: [<ffffffff811cbda4>] __kmem_cache_shutdown+0x54/0x320
      
        other info that might help us debug this:
        Possible unsafe locking scenario:
              CPU0
              ----
         lock(&(&n->list_lock)->rlock);
         lock(&(&n->list_lock)->rlock);
      
         *** DEADLOCK ***
         May be due to missing lock nesting notation
         5 locks held by rmmod/8895:
         #0:  (&dev->mutex){......}, at: driver_detach+0x42/0xc0
         #1:  (&dev->mutex){......}, at: driver_detach+0x50/0xc0
         #2:  (cpu_hotplug.dep_map){++++++}, at: get_online_cpus+0x2d/0x80
         #3:  (slab_mutex){+.+.+.}, at: kmem_cache_destroy+0x3c/0x220
         #4:  (&(&n->list_lock)->rlock){-.-...}, at: __kmem_cache_shutdown+0x54/0x320
      
        stack backtrace:
        CPU: 6 PID: 8895 Comm: rmmod Tainted: G     U          4.8.0-rc1-gfxbench+ #1
        Hardware name: Gigabyte Technology Co., Ltd. H87M-D3H/H87M-D3H, BIOS F11 08/18/2015
        Call Trace:
          __lock_acquire+0x1646/0x1ad0
          lock_acquire+0xb2/0x200
          _raw_spin_lock+0x36/0x50
          get_partial_node.isra.63+0x47/0x430
          ___slab_alloc.constprop.67+0x1a7/0x3b0
          __slab_alloc.isra.64.constprop.66+0x43/0x80
          kmem_cache_alloc+0x236/0x2d0
          __debug_object_init+0x2de/0x400
          debug_object_activate+0x109/0x1e0
          __call_rcu.constprop.63+0x32/0x2f0
          call_rcu+0x12/0x20
          discard_slab+0x3d/0x40
          __kmem_cache_shutdown+0xdb/0x320
          shutdown_cache+0x19/0x60
          kmem_cache_destroy+0x1ae/0x220
          i915_gem_load_cleanup+0x14/0x40 [i915]
          i915_driver_unload+0x151/0x180 [i915]
          i915_pci_remove+0x14/0x20 [i915]
          pci_device_remove+0x34/0xb0
          __device_release_driver+0x95/0x140
          driver_detach+0xb6/0xc0
          bus_remove_driver+0x53/0xd0
          driver_unregister+0x27/0x50
          pci_unregister_driver+0x25/0x70
          i915_exit+0x1a/0x1e2 [i915]
          SyS_delete_module+0x193/0x1f0
          entry_SYSCALL_64_fastpath+0x1c/0xac
      
      Fixes: 52b4b950 ("mm: slab: free kmem_cache_node after destroy sysfs file")
      Link: http://lkml.kernel.org/r/1470759070-18743-1-git-send-email-chris@chris-wilson.co.uk
      
      Reported-by: default avatarDave Gordon <david.s.gordon@intel.com>
      Signed-off-by: default avatarChris Wilson <chris@chris-wilson.co.uk>
      Reviewed-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Dmitry Safonov <dsafonov@virtuozzo.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Dave Gordon <david.s.gordon@intel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      60398923
  7. 05 Aug, 2016 1 commit
  8. 02 Aug, 2016 1 commit
  9. 28 Jul, 2016 2 commits
  10. 26 Jul, 2016 5 commits
    • Vladimir Davydov's avatar
      mm: charge/uncharge kmemcg from generic page allocator paths · 4949148a
      Vladimir Davydov authored
      Currently, to charge a non-slab allocation to kmemcg one has to use
      alloc_kmem_pages helper with __GFP_ACCOUNT flag.  A page allocated with
      this helper should finally be freed using free_kmem_pages, otherwise it
      won't be uncharged.
      
      This API suits its current users fine, but it turns out to be impossible
      to use along with page reference counting, i.e.  when an allocation is
      supposed to be freed with put_page, as it is the case with pipe or unix
      socket buffers.
      
      To overcome this limitation, this patch moves charging/uncharging to
      generic page allocator paths, i.e.  to __alloc_pages_nodemask and
      free_pages_prepare, and zaps alloc/free_kmem_pages helpers.  This way,
      one can use any of the available page allocation functions to get the
      allocated page charged to kmemcg - it's enough to pass __GFP_ACCOUNT,
      just like in case of kmalloc and friends.  A charged page will be
      automatically uncharged on free.
      
      To make it possible, we need to mark pages charged to kmemcg somehow.
      To avoid introducing a new page flag, we make use of page->_mapcount for
      marking such pages.  Since pages charged to kmemcg are not supposed to
      be mapped to userspace, it should work just fine.  There are other
      (ab)users of page->_mapcount - buddy and balloon pages - but we don't
      conflict with them.
      
      In case kmemcg is compiled out or not used at runtime, this patch
      introduces no overhead to generic page allocator paths.  If kmemcg is
      used, it will be plus one gfp flags check on alloc and plus one
      page->_mapcount check on free, which shouldn't hurt performance, because
      the data accessed are hot.
      
      Link: http://lkml.kernel.org/r/a9736d856f895bcb465d9f257b54efe32eda6f99.1464079538.git.vdavydov@virtuozzo.com
      
      Signed-off-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4949148a
    • Michal Hocko's avatar
      slab: do not panic on invalid gfp_mask · 72baeef0
      Michal Hocko authored
      Both SLAB and SLUB BUG() when a caller provides an invalid gfp_mask.
      This is a rather harsh way to announce a non-critical issue.  Allocator
      is free to ignore invalid flags.  Let's simply replace BUG() by
      dump_stack to tell the offender and fixup the mask to move on with the
      allocation request.
      
      This is an example for kmalloc(GFP_KERNEL|__GFP_HIGHMEM) from a test
      module:
      
        Unexpected gfp: 0x2 (__GFP_HIGHMEM). Fixing up to gfp: 0x24000c0 (GFP_KERNEL). Fix your code!
        CPU: 0 PID: 2916 Comm: insmod Tainted: G           O    4.6.0-slabgfp2-00002-g4cdfc2ef4892-dirty #936
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Debian-1.8.2-1 04/01/2014
        Call Trace:
          dump_stack+0x67/0x90
          cache_alloc_refill+0x201/0x617
          kmem_cache_alloc_trace+0xa7/0x24a
          ? 0xffffffffa0005000
          mymodule_init+0x20/0x1000 [test_slab]
          do_one_initcall+0xe7/0x16c
          ? rcu_read_lock_sched_held+0x61/0x69
          ? kmem_cache_alloc_trace+0x197/0x24a
          do_init_module+0x5f/0x1d9
          load_module+0x1a3d/0x1f21
          ? retint_kernel+0x2d/0x2d
          SyS_init_module+0xe8/0x10e
          ? SyS_init_module+0xe8/0x10e
          do_syscall_64+0x68/0x13f
          entry_SYSCALL64_slow_path+0x25/0x25
      
      Link: http://lkml.kernel.org/r/1465548200-11384-2-git-send-email-mhocko@kernel.org
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      72baeef0
    • Michal Hocko's avatar
      slab: make GFP_SLAB_BUG_MASK information more human readable · bacdcb34
      Michal Hocko authored
      printk offers %pGg for quite some time so let's use it to get a human
      readable list of invalid flags.
      
      The original output would be
        [  429.191962] gfp: 2
      
      after the change
        [  429.191962] Unexpected gfp: 0x2 (__GFP_HIGHMEM)
      
      Link: http://lkml.kernel.org/r/1465548200-11384-1-git-send-email-mhocko@kernel.org
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bacdcb34
    • Thomas Garnier's avatar
      mm: SLUB freelist randomization · 210e7a43
      Thomas Garnier authored
      Implements freelist randomization for the SLUB allocator.  It was
      previous implemented for the SLAB allocator.  Both use the same
      configuration option (CONFIG_SLAB_FREELIST_RANDOM).
      
      The list is randomized during initialization of a new set of pages.  The
      order on different freelist sizes is pre-computed at boot for
      performance.  Each kmem_cache has its own randomized freelist.
      
      This security feature reduces the predictability of the kernel SLUB
      allocator against heap overflows rendering attacks much less stable.
      
      For example these attacks exploit the predictability of the heap:
       - Linux Kernel CAN SLUB overflow (https://goo.gl/oMNWkU)
       - Exploiting Linux Kernel Heap corruptions (http://goo.gl/EXLn95)
      
      Performance results:
      
      slab_test impact is between 3% to 4% on average for 100000 attempts
      without smp.  It is a very focused testing, kernbench show the overall
      impact on the system is way lower.
      
      Before:
      
        Single thread testing
        =====================
        1. Kmalloc: Repeatedly allocate then free test
        100000 times kmalloc(8) -> 49 cycles kfree -> 77 cycles
        100000 times kmalloc(16) -> 51 cycles kfree -> 79 cycles
        100000 times kmalloc(32) -> 53 cycles kfree -> 83 cycles
        100000 times kmalloc(64) -> 62 cycles kfree -> 90 cycles
        100000 times kmalloc(128) -> 81 cycles kfree -> 97 cycles
        100000 times kmalloc(256) -> 98 cycles kfree -> 121 cycles
        100000 times kmalloc(512) -> 95 cycles kfree -> 122 cycles
        100000 times kmalloc(1024) -> 96 cycles kfree -> 126 cycles
        100000 times kmalloc(2048) -> 115 cycles kfree -> 140 cycles
        100000 times kmalloc(4096) -> 149 cycles kfree -> 171 cycles
        2. Kmalloc: alloc/free test
        100000 times kmalloc(8)/kfree -> 70 cycles
        100000 times kmalloc(16)/kfree -> 70 cycles
        100000 times kmalloc(32)/kfree -> 70 cycles
        100000 times kmalloc(64)/kfree -> 70 cycles
        100000 times kmalloc(128)/kfree -> 70 cycles
        100000 times kmalloc(256)/kfree -> 69 cycles
        100000 times kmalloc(512)/kfree -> 70 cycles
        100000 times kmalloc(1024)/kfree -> 73 cycles
        100000 times kmalloc(2048)/kfree -> 72 cycles
        100000 times kmalloc(4096)/kfree -> 71 cycles
      
      After:
      
        Single thread testing
        =====================
        1. Kmalloc: Repeatedly allocate then free test
        100000 times kmalloc(8) -> 57 cycles kfree -> 78 cycles
        100000 times kmalloc(16) -> 61 cycles kfree -> 81 cycles
        100000 times kmalloc(32) -> 76 cycles kfree -> 93 cycles
        100000 times kmalloc(64) -> 83 cycles kfree -> 94 cycles
        100000 times kmalloc(128) -> 106 cycles kfree -> 107 cycles
        100000 times kmalloc(256) -> 118 cycles kfree -> 117 cycles
        100000 times kmalloc(512) -> 114 cycles kfree -> 116 cycles
        100000 times kmalloc(1024) -> 115 cycles kfree -> 118 cycles
        100000 times kmalloc(2048) -> 147 cycles kfree -> 131 cycles
        100000 times kmalloc(4096) -> 214 cycles kfree -> 161 cycles
        2. Kmalloc: alloc/free test
        100000 times kmalloc(8)/kfree -> 66 cycles
        100000 times kmalloc(16)/kfree -> 66 cycles
        100000 times kmalloc(32)/kfree -> 66 cycles
        100000 times kmalloc(64)/kfree -> 66 cycles
        100000 times kmalloc(128)/kfree -> 65 cycles
        100000 times kmalloc(256)/kfree -> 67 cycles
        100000 times kmalloc(512)/kfree -> 67 cycles
        100000 times kmalloc(1024)/kfree -> 64 cycles
        100000 times kmalloc(2048)/kfree -> 67 cycles
        100000 times kmalloc(4096)/kfree -> 67 cycles
      
      Kernbench, before:
      
        Average Optimal load -j 12 Run (std deviation):
        Elapsed Time 101.873 (1.16069)
        User Time 1045.22 (1.60447)
        System Time 88.969 (0.559195)
        Percent CPU 1112.9 (13.8279)
        Context Switches 189140 (2282.15)
        Sleeps 99008.6 (768.091)
      
      After:
      
        Average Optimal load -j 12 Run (std deviation):
        Elapsed Time 102.47 (0.562732)
        User Time 1045.3 (1.34263)
        System Time 88.311 (0.342554)
        Percent CPU 1105.8 (6.49444)
        Context Switches 189081 (2355.78)
        Sleeps 99231.5 (800.358)
      
      Link: http://lkml.kernel.org/r/1464295031-26375-3-git-send-email-thgarnie@google.com
      
      Signed-off-by: default avatarThomas Garnier <thgarnie@google.com>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      210e7a43
    • Kees Cook's avatar
      mm: SLUB hardened usercopy support · ed18adc1
      Kees Cook authored
      
      
      Under CONFIG_HARDENED_USERCOPY, this adds object size checking to the
      SLUB allocator to catch any copies that may span objects. Includes a
      redzone handling fix discovered by Michael Ellerman.
      
      Based on code from PaX and grsecurity.
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Tested-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Reviwed-by: default avatarLaura Abbott <labbott@redhat.com>
      ed18adc1
  11. 21 May, 2016 1 commit
  12. 20 May, 2016 3 commits
  13. 25 Mar, 2016 1 commit
    • Alexander Potapenko's avatar
      mm, kasan: add GFP flags to KASAN API · 505f5dcb
      Alexander Potapenko authored
      
      
      Add GFP flags to KASAN hooks for future patches to use.
      
      This patch is based on the "mm: kasan: unified support for SLUB and SLAB
      allocators" patch originally prepared by Dmitry Chernenkov.
      Signed-off-by: default avatarAlexander Potapenko <glider@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Andrey Konovalov <adech.fo@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Konstantin Serebryany <kcc@google.com>
      Cc: Dmitry Chernenkov <dmitryc@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      505f5dcb
  14. 17 Mar, 2016 4 commits
    • Joe Perches's avatar
      mm: coalesce split strings · 756a025f
      Joe Perches authored
      
      
      Kernel style prefers a single string over split strings when the string is
      'user-visible'.
      
      Miscellanea:
      
       - Add a missing newline
       - Realign arguments
      Signed-off-by: default avatarJoe Perches <joe@perches.com>
      Acked-by: Tejun Heo <tj@kernel.org>	[percpu]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      756a025f
    • Mel Gorman's avatar
      mm: thp: set THP defrag by default to madvise and add a stall-free defrag option · 444eb2a4
      Mel Gorman authored
      
      
      THP defrag is enabled by default to direct reclaim/compact but not wake
      kswapd in the event of a THP allocation failure.  The problem is that
      THP allocation requests potentially enter reclaim/compaction.  This
      potentially incurs a severe stall that is not guaranteed to be offset by
      reduced TLB misses.  While there has been considerable effort to reduce
      the impact of reclaim/compaction, it is still a high cost and workloads
      that should fit in memory fail to do so.  Specifically, a simple
      anon/file streaming workload will enter direct reclaim on NUMA at least
      even though the working set size is 80% of RAM.  It's been years and
      it's time to throw in the towel.
      
      First, this patch defines THP defrag as follows;
      
       madvise: A failed allocation will direct reclaim/compact if the application requests it
       never:   Neither reclaim/compact nor wake kswapd
       defer:   A failed allocation will wake kswapd/kcompactd
       always:  A failed allocation will direct reclaim/compact (historical behaviour)
                khugepaged defrag will enter direct/reclaim but not wake kswapd.
      
      Next it sets the default defrag option to be "madvise" to only enter
      direct reclaim/compaction for applications that specifically requested
      it.
      
      Lastly, it removes a check from the page allocator slowpath that is
      related to __GFP_THISNODE to allow "defer" to work.  The callers that
      really cares are slub/slab and they are updated accordingly.  The slab
      one may be surprising because it also corrects a comment as kswapd was
      never woken up by that path.
      
      This means that a THP fault will no longer stall for most applications
      by default and the ideal for most users that get THP if they are
      immediately available.  There are still options for users that prefer a
      stall at startup of a new application by either restoring historical
      behaviour with "always" or pick a half-way point with "defer" where
      kswapd does some of the work in the background and wakes kcompactd if
      necessary.  THP defrag for khugepaged remains enabled and will enter
      direct/reclaim but no wakeup kswapd or kcompactd.
      
      After this patch a THP allocation failure will quickly fallback and rely
      on khugepaged to recover the situation at some time in the future.  In
      some cases, this will reduce THP usage but the benefit of THP is hard to
      measure and not a universal win where as a stall to reclaim/compaction
      is definitely measurable and can be painful.
      
      The first test for this is using "usemem" to read a large file and write
      a large anonymous mapping (to avoid the zero page) multiple times.  The
      total size of the mappings is 80% of RAM and the benchmark simply
      measures how long it takes to complete.  It uses multiple threads to see
      if that is a factor.  On UMA, the performance is almost identical so is
      not reported but on NUMA, we see this
      
      usemem
                                         4.4.0                 4.4.0
                                kcompactd-v1r1         nodefrag-v1r3
      Amean    System-1       102.86 (  0.00%)       46.81 ( 54.50%)
      Amean    System-4        37.85 (  0.00%)       34.02 ( 10.12%)
      Amean    System-7        48.12 (  0.00%)       46.89 (  2.56%)
      Amean    System-12       51.98 (  0.00%)       56.96 ( -9.57%)
      Amean    System-21       80.16 (  0.00%)       79.05 (  1.39%)
      Amean    System-30      110.71 (  0.00%)      107.17 (  3.20%)
      Amean    System-48      127.98 (  0.00%)      124.83 (  2.46%)
      Amean    Elapsd-1       185.84 (  0.00%)      105.51 ( 43.23%)
      Amean    Elapsd-4        26.19 (  0.00%)       25.58 (  2.33%)
      Amean    Elapsd-7        21.65 (  0.00%)       21.62 (  0.16%)
      Amean    Elapsd-12       18.58 (  0.00%)       17.94 (  3.43%)
      Amean    Elapsd-21       17.53 (  0.00%)       16.60 (  5.33%)
      Amean    Elapsd-30       17.45 (  0.00%)       17.13 (  1.84%)
      Amean    Elapsd-48       15.40 (  0.00%)       15.27 (  0.82%)
      
      For a single thread, the benchmark completes 43.23% faster with this
      patch applied with smaller benefits as the thread increases.  Similar,
      notice the large reduction in most cases in system CPU usage.  The
      overall CPU time is
      
                     4.4.0       4.4.0
              kcompactd-v1r1 nodefrag-v1r3
      User        10357.65    10438.33
      System       3988.88     3543.94
      Elapsed      2203.01     1634.41
      
      Which is substantial. Now, the reclaim figures
      
                                       4.4.0       4.4.0
                                kcompactd-v1r1nodefrag-v1r3
      Minor Faults                 128458477   278352931
      Major Faults                   2174976         225
      Swap Ins                      16904701           0
      Swap Outs                     17359627           0
      Allocation stalls                43611           0
      DMA allocs                           0           0
      DMA32 allocs                  19832646    19448017
      Normal allocs                614488453   580941839
      Movable allocs                       0           0
      Direct pages scanned          24163800           0
      Kswapd pages scanned                 0           0
      Kswapd pages reclaimed               0           0
      Direct pages reclaimed        20691346           0
      Compaction stalls                42263           0
      Compaction success                 938           0
      Compaction failures              41325           0
      
      This patch eliminates almost all swapping and direct reclaim activity.
      There is still overhead but it's from NUMA balancing which does not
      identify that it's pointless trying to do anything with this workload.
      
      I also tried the thpscale benchmark which forces a corner case where
      compaction can be used heavily and measures the latency of whether base
      or huge pages were used
      
      thpscale Fault Latencies
                                             4.4.0                 4.4.0
                                    kcompactd-v1r1         nodefrag-v1r3
      Amean    fault-base-1      5288.84 (  0.00%)     2817.12 ( 46.73%)
      Amean    fault-base-3      6365.53 (  0.00%)     3499.11 ( 45.03%)
      Amean    fault-base-5      6526.19 (  0.00%)     4363.06 ( 33.15%)
      Amean    fault-base-7      7142.25 (  0.00%)     4858.08 ( 31.98%)
      Amean    fault-base-12    13827.64 (  0.00%)    10292.11 ( 25.57%)
      Amean    fault-base-18    18235.07 (  0.00%)    13788.84 ( 24.38%)
      Amean    fault-base-24    21597.80 (  0.00%)    24388.03 (-12.92%)
      Amean    fault-base-30    26754.15 (  0.00%)    19700.55 ( 26.36%)
      Amean    fault-base-32    26784.94 (  0.00%)    19513.57 ( 27.15%)
      Amean    fault-huge-1      4223.96 (  0.00%)     2178.57 ( 48.42%)
      Amean    fault-huge-3      2194.77 (  0.00%)     2149.74 (  2.05%)
      Amean    fault-huge-5      2569.60 (  0.00%)     2346.95 (  8.66%)
      Amean    fault-huge-7      3612.69 (  0.00%)     2997.70 ( 17.02%)
      Amean    fault-huge-12     3301.75 (  0.00%)     6727.02 (-103.74%)
      Amean    fault-huge-18     6696.47 (  0.00%)     6685.72 (  0.16%)
      Amean    fault-huge-24     8000.72 (  0.00%)     9311.43 (-16.38%)
      Amean    fault-huge-30    13305.55 (  0.00%)     9750.45 ( 26.72%)
      Amean    fault-huge-32     9981.71 (  0.00%)    10316.06 ( -3.35%)
      
      The average time to fault pages is substantially reduced in the majority
      of caseds but with the obvious caveat that fewer THPs are actually used
      in this adverse workload
      
                                         4.4.0                 4.4.0
                                kcompactd-v1r1         nodefrag-v1r3
      Percentage huge-1         0.71 (  0.00%)       14.04 (1865.22%)
      Percentage huge-3        10.77 (  0.00%)       33.05 (206.85%)
      Percentage huge-5        60.39 (  0.00%)       38.51 (-36.23%)
      Percentage huge-7        45.97 (  0.00%)       34.57 (-24.79%)
      Percentage huge-12       68.12 (  0.00%)       40.07 (-41.17%)
      Percentage huge-18       64.93 (  0.00%)       47.82 (-26.35%)
      Percentage huge-24       62.69 (  0.00%)       44.23 (-29.44%)
      Percentage huge-30       43.49 (  0.00%)       55.38 ( 27.34%)
      Percentage huge-32       50.72 (  0.00%)       51.90 (  2.35%)
      
                                       4.4.0       4.4.0
                                kcompactd-v1r1nodefrag-v1r3
      Minor Faults                  37429143    47564000
      Major Faults                      1916        1558
      Swap Ins                          1466        1079
      Swap Outs                      2936863      149626
      Allocation stalls                62510           3
      DMA allocs                           0           0
      DMA32 allocs                   6566458     6401314
      Normal allocs                216361697   216538171
      Movable allocs                       0           0
      Direct pages scanned          25977580       17998
      Kswapd pages scanned                 0     3638931
      Kswapd pages reclaimed               0      207236
      Direct pages reclaimed         8833714          88
      Compaction stalls               103349           5
      Compaction success                 270           4
      Compaction failures             103079           1
      
      Note again that while this does swap as it's an aggressive workload, the
      direct relcim activity and allocation stalls is substantially reduced.
      There is some kswapd activity but ftrace showed that the kswapd activity
      was due to normal wakeups from 4K pages being allocated.
      Compaction-related stalls and activity are almost eliminated.
      
      I also tried the stutter benchmark.  For this, I do not have figures for
      NUMA but it's something that does impact UMA so I'll report what is
      available
      
      stutter
                                       4.4.0                 4.4.0
                              kcompactd-v1r1         nodefrag-v1r3
      Min         mmap      7.3571 (  0.00%)      7.3438 (  0.18%)
      1st-qrtle   mmap      7.5278 (  0.00%)     17.9200 (-138.05%)
      2nd-qrtle   mmap      7.6818 (  0.00%)     21.6055 (-181.25%)
      3rd-qrtle   mmap     11.0889 (  0.00%)     21.8881 (-97.39%)
      Max-90%     mmap     27.8978 (  0.00%)     22.1632 ( 20.56%)
      Max-93%     mmap     28.3202 (  0.00%)     22.3044 ( 21.24%)
      Max-95%     mmap     28.5600 (  0.00%)     22.4580 ( 21.37%)
      Max-99%     mmap     29.6032 (  0.00%)     25.5216 ( 13.79%)
      Max         mmap   4109.7289 (  0.00%)   4813.9832 (-17.14%)
      Mean        mmap     12.4474 (  0.00%)     19.3027 (-55.07%)
      
      This benchmark is trying to fault an anonymous mapping while there is a
      heavy IO load -- a scenario that desktop users used to complain about
      frequently.  This shows a mix because the ideal case of mapping with THP
      is not hit as often.  However, note that 99% of the mappings complete
      13.79% faster.  The CPU usage here is particularly interesting
      
                     4.4.0       4.4.0
              kcompactd-v1r1nodefrag-v1r3
      User           67.50        0.99
      System       1327.88       91.30
      Elapsed      2079.00     2128.98
      
      And once again we look at the reclaim figures
      
                                       4.4.0       4.4.0
                                kcompactd-v1r1nodefrag-v1r3
      Minor Faults                 335241922  1314582827
      Major Faults                       715         819
      Swap Ins                             0           0
      Swap Outs                            0           0
      Allocation stalls               532723           0
      DMA allocs                           0           0
      DMA32 allocs                1822364341  1177950222
      Normal allocs               1815640808  1517844854
      Movable allocs                       0           0
      Direct pages scanned          21892772           0
      Kswapd pages scanned          20015890    41879484
      Kswapd pages reclaimed        19961986    41822072
      Direct pages reclaimed        21892741           0
      Compaction stalls              1065755           0
      Compaction success                 514           0
      Compaction failures            1065241           0
      
      Allocation stalls and all direct reclaim activity is eliminated as well
      as compaction-related stalls.
      
      THP gives impressive gains in some cases but only if they are quickly
      available.  We're not going to reach the point where they are completely
      free so lets take the costs out of the fast paths finally and defer the
      cost to kswapd, kcompactd and khugepaged where it belongs.
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      444eb2a4
    • Joonsoo Kim's avatar
      mm/slub: query dynamic DEBUG_PAGEALLOC setting · 922d566c
      Joonsoo Kim authored
      
      
      We can disable debug_pagealloc processing even if the code is compiled
      with CONFIG_DEBUG_PAGEALLOC.  This patch changes the code to query
      whether it is enabled or not in runtime.
      
      [akpm@linux-foundation.org: clean up code, per Christian]
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Reviewed-by: default avatarChristian Borntraeger <borntraeger@de.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Takashi Iwai <tiwai@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      922d566c
    • Vladimir Davydov's avatar
      mm: memcontrol: report slab usage in cgroup2 memory.stat · 27ee57c9
      Vladimir Davydov authored
      
      
      Show how much memory is used for storing reclaimable and unreclaimable
      in-kernel data structures allocated from slab caches.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      27ee57c9
  15. 15 Mar, 2016 8 commits
    • Vlastimil Babka's avatar
      mm, sl[au]b: print gfp_flags as strings in slab_out_of_memory() · 5b3810e5
      Vlastimil Babka authored
      
      
      We can now print gfp_flags more human-readable.  Make use of this in
      slab_out_of_memory() for SLUB and SLAB.  Also convert the SLAB variant
      it to pr_warn() along the way.
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5b3810e5
    • Joonsoo Kim's avatar
      mm/slub: support left redzone · d86bd1be
      Joonsoo Kim authored
      
      
      SLUB already has a redzone debugging feature.  But it is only positioned
      at the end of object (aka right redzone) so it cannot catch left oob.
      Although current object's right redzone acts as left redzone of next
      object, first object in a slab cannot take advantage of this effect.
      This patch explicitly adds a left red zone to each object to detect left
      oob more precisely.
      
      Background:
      
      Someone complained to me that left OOB doesn't catch even if KASAN is
      enabled which does page allocation debugging.  That page is out of our
      control so it would be allocated when left OOB happens and, in this
      case, we can't find OOB.  Moreover, SLUB debugging feature can be
      enabled without page allocator debugging and, in this case, we will miss
      that OOB.
      
      Before trying to implement, I expected that changes would be too
      complex, but, it doesn't look that complex to me now.  Almost changes
      are applied to debug specific functions so I feel okay.
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d86bd1be
    • Laura Abbott's avatar
      slub: relax CMPXCHG consistency restrictions · 149daaf3
      Laura Abbott authored
      
      
      When debug options are enabled, cmpxchg on the page is disabled.  This
      is because the page must be locked to ensure there are no false
      positives when performing consistency checks.  Some debug options such
      as poisoning and red zoning only act on the object itself.  There is no
      need to protect other CPUs from modification on only the object.  Allow
      cmpxchg to happen with poisoning and red zoning are set on a slab.
      
      Credit to Mathias Krause for the original work which inspired this
      series
      Signed-off-by: default avatarLaura Abbott <labbott@fedoraproject.org>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Mathias Krause <minipli@googlemail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      149daaf3
    • Laura Abbott's avatar
      slub: convert SLAB_DEBUG_FREE to SLAB_CONSISTENCY_CHECKS · becfda68
      Laura Abbott authored
      
      
      SLAB_DEBUG_FREE allows expensive consistency checks at free to be turned
      on or off.  Expand its use to be able to turn off all consistency
      checks.  This gives a nice speed up if you only want features such as
      poisoning or tracing.
      
      Credit to Mathias Krause for the original work which inspired this
      series
      Signed-off-by: default avatarLaura Abbott <labbott@fedoraproject.org>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Mathias Krause <minipli@googlemail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      becfda68
    • Laura Abbott's avatar
      slub: fix/clean free_debug_processing return paths · 804aa132
      Laura Abbott authored
      Since commit 19c7ff9e
      
       ("slub: Take node lock during object free
      checks") check_object has been incorrectly returning success as it
      follows the out label which just returns the node.
      
      Thanks to refactoring, the out and fail paths are now basically the
      same.  Combine the two into one and just use a single label.
      
      Credit to Mathias Krause for the original work which inspired this
      series
      Signed-off-by: default avatarLaura Abbott <labbott@fedoraproject.org>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Mathias Krause <minipli@googlemail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      804aa132
    • Laura Abbott's avatar
      slub: drop lock at the end of free_debug_processing · 282acb43
      Laura Abbott authored
      
      
      This series takes the suggestion of Christoph Lameter and only focuses
      on optimizing the slow path where the debug processing runs.  The two
      main optimizations in this series are letting the consistency checks be
      skipped and relaxing the cmpxchg restrictions when we are not doing
      consistency checks.  With hackbench -g 20 -l 1000 averaged over 100
      runs:
      
      Before slub_debug=P
        mean 15.607
        variance .086
        stdev .294
      
      After slub_debug=P
        mean 10.836
        variance .155
        stdev .394
      
      This still isn't as fast as what is in grsecurity unfortunately so there's
      still work to be done.  Profiling ___slab_alloc shows that 25-50% of time
      is spent in deactivate_slab.  I haven't looked too closely to see if this
      is something that can be optimized.  My plan for now is to focus on
      getting all of this merged (if appropriate) before digging in to another
      task.
      
      This patch (of 4):
      
      Currently, free_debug_processing has a comment "Keep node_lock to preserve
      integrity until the object is actually freed".  In actuallity, the lock is
      dropped immediately in __slab_free.  Rather than wait until __slab_free
      and potentially throw off the unlikely marking, just drop the lock in
      __slab_free.  This also lets free_debug_processing take its own copy of
      the spinlock flags rather than trying to share the ones from __slab_free.
      Since there is no use for the node afterwards, change the return type of
      free_debug_processing to return an int like alloc_debug_processing.
      
      Credit to Mathias Krause for the original work which inspired this series
      
      [akpm@linux-foundation.org: fix build]
      Signed-off-by: default avatarLaura Abbott <labbott@fedoraproject.org>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Mathias Krause <minipli@googlemail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      282acb43
    • Jesper Dangaard Brouer's avatar
      mm: new API kfree_bulk() for SLAB+SLUB allocators · ca257195
      Jesper Dangaard Brouer authored
      
      
      This patch introduce a new API call kfree_bulk() for bulk freeing memory
      objects not bound to a single kmem_cache.
      
      Christoph pointed out that it is possible to implement freeing of
      objects, without knowing the kmem_cache pointer as that information is
      available from the object's page->slab_cache.  Proposing to remove the
      kmem_cache argument from the bulk free API.
      
      Jesper demonstrated that these extra steps per object comes at a
      performance cost.  It is only in the case CONFIG_MEMCG_KMEM is compiled
      in and activated runtime that these steps are done anyhow.  The extra
      cost is most visible for SLAB allocator, because the SLUB allocator does
      the page lookup (virt_to_head_page()) anyhow.
      
      Thus, the conclusion was to keep the kmem_cache free bulk API with a
      kmem_cache pointer, but we can still implement a kfree_bulk() API fairly
      easily.  Simply by handling if kmem_cache_free_bulk() gets called with a
      kmem_cache NULL pointer.
      
      This does increase the code size a bit, but implementing a separate
      kfree_bulk() call would likely increase code size even more.
      
      Below benchmarks cost of alloc+free (obj size 256 bytes) on CPU i7-4790K
      @ 4.00GHz, no PREEMPT and CONFIG_MEMCG_KMEM=y.
      
      Code size increase for SLAB:
      
       add/remove: 0/0 grow/shrink: 1/0 up/down: 74/0 (74)
       function                                     old     new   delta
       kmem_cache_free_bulk                         660     734     +74
      
      SLAB fastpath: 87 cycles(tsc) 21.814
        sz - fallback             - kmem_cache_free_bulk - kfree_bulk
         1 - 103 cycles 25.878 ns -  41 cycles 10.498 ns - 81 cycles 20.312 ns
         2 -  94 cycles 23.673 ns -  26 cycles  6.682 ns - 42 cycles 10.649 ns
         3 -  92 cycles 23.181 ns -  21 cycles  5.325 ns - 39 cycles 9.950 ns
         4 -  90 cycles 22.727 ns -  18 cycles  4.673 ns - 26 cycles 6.693 ns
         8 -  89 cycles 22.270 ns -  14 cycles  3.664 ns - 23 cycles 5.835 ns
        16 -  88 cycles 22.038 ns -  14 cycles  3.503 ns - 22 cycles 5.543 ns
        30 -  89 cycles 22.284 ns -  13 cycles  3.310 ns - 20 cycles 5.197 ns
        32 -  88 cycles 22.249 ns -  13 cycles  3.420 ns - 20 cycles 5.166 ns
        34 -  88 cycles 22.224 ns -  14 cycles  3.643 ns - 20 cycles 5.170 ns
        48 -  88 cycles 22.088 ns -  14 cycles  3.507 ns - 20 cycles 5.203 ns
        64 -  88 cycles 22.063 ns -  13 cycles  3.428 ns - 20 cycles 5.152 ns
       128 -  89 cycles 22.483 ns -  15 cycles  3.891 ns - 23 cycles 5.885 ns
       158 -  89 cycles 22.381 ns -  15 cycles  3.779 ns - 22 cycles 5.548 ns
       250 -  91 cycles 22.798 ns -  16 cycles  4.152 ns - 23 cycles 5.967 ns
      
      SLAB when enabling MEMCG_KMEM runtime:
       - kmemcg fastpath: 130 cycles(tsc) 32.684 ns (step:0)
       1 - 148 cycles 37.220 ns -  66 cycles 16.622 ns - 66 cycles 16.583 ns
       2 - 141 cycles 35.510 ns -  51 cycles 12.820 ns - 58 cycles 14.625 ns
       3 - 140 cycles 35.017 ns -  37 cycles 9.326 ns - 33 cycles 8.474 ns
       4 - 137 cycles 34.507 ns -  31 cycles 7.888 ns - 33 cycles 8.300 ns
       8 - 140 cycles 35.069 ns -  25 cycles 6.461 ns - 25 cycles 6.436 ns
       16 - 138 cycles 34.542 ns -  23 cycles 5.945 ns - 22 cycles 5.670 ns
       30 - 136 cycles 34.227 ns -  22 cycles 5.502 ns - 22 cycles 5.587 ns
       32 - 136 cycles 34.253 ns -  21 cycles 5.475 ns - 21 cycles 5.324 ns
       34 - 136 cycles 34.254 ns -  21 cycles 5.448 ns - 20 cycles 5.194 ns
       48 - 136 cycles 34.075 ns -  21 cycles 5.458 ns - 21 cycles 5.367 ns
       64 - 135 cycles 33.994 ns -  21 cycles 5.350 ns - 21 cycles 5.259 ns
       128 - 137 cycles 34.446 ns -  23 cycles 5.816 ns - 22 cycles 5.688 ns
       158 - 137 cycles 34.379 ns -  22 cycles 5.727 ns - 22 cycles 5.602 ns
       250 - 138 cycles 34.755 ns -  24 cycles 6.093 ns - 23 cycles 5.986 ns
      
      Code size increase for SLUB:
       function                                     old     new   delta
       kmem_cache_free_bulk                         717     799     +82
      
      SLUB benchmark:
       SLUB fastpath: 46 cycles(tsc) 11.691 ns (step:0)
        sz - fallback             - kmem_cache_free_bulk - kfree_bulk
         1 -  61 cycles 15.486 ns -  53 cycles 13.364 ns - 57 cycles 14.464 ns
         2 -  54 cycles 13.703 ns -  32 cycles  8.110 ns - 33 cycles 8.482 ns
         3 -  53 cycles 13.272 ns -  25 cycles  6.362 ns - 27 cycles 6.947 ns
         4 -  51 cycles 12.994 ns -  24 cycles  6.087 ns - 24 cycles 6.078 ns
         8 -  50 cycles 12.576 ns -  21 cycles  5.354 ns - 22 cycles 5.513 ns
        16 -  49 cycles 12.368 ns -  20 cycles  5.054 ns - 20 cycles 5.042 ns
        30 -  49 cycles 12.273 ns -  18 cycles  4.748 ns - 19 cycles 4.758 ns
        32 -  49 cycles 12.401 ns -  19 cycles  4.821 ns - 19 cycles 4.810 ns
        34 -  98 cycles 24.519 ns -  24 cycles  6.154 ns - 24 cycles 6.157 ns
        48 -  83 cycles 20.833 ns -  21 cycles  5.446 ns - 21 cycles 5.429 ns
        64 -  75 cycles 18.891 ns -  20 cycles  5.247 ns - 20 cycles 5.238 ns
       128 -  93 cycles 23.271 ns -  27 cycles  6.856 ns - 27 cycles 6.823 ns
       158 - 102 cycles 25.581 ns -  30 cycles  7.714 ns - 30 cycles 7.695 ns
       250 - 107 cycles 26.917 ns -  38 cycles  9.514 ns - 38 cycles 9.506 ns
      
      SLUB when enabling MEMCG_KMEM runtime:
       - kmemcg fastpath: 71 cycles(tsc) 17.897 ns (step:0)
       1 - 85 cycles 21.484 ns -  78 cycles 19.569 ns - 75 cycles 18.938 ns
       2 - 81 cycles 20.363 ns -  45 cycles 11.258 ns - 44 cycles 11.076 ns
       3 - 78 cycles 19.709 ns -  33 cycles 8.354 ns - 32 cycles 8.044 ns
       4 - 77 cycles 19.430 ns -  28 cycles 7.216 ns - 28 cycles 7.003 ns
       8 - 101 cycles 25.288 ns -  23 cycles 5.849 ns - 23 cycles 5.787 ns
       16 - 76 cycles 19.148 ns -  20 cycles 5.162 ns - 20 cycles 5.081 ns
       30 - 76 cycles 19.067 ns -  19 cycles 4.868 ns - 19 cycles 4.821 ns
       32 - 76 cycles 19.052 ns -  19 cycles 4.857 ns - 19 cycles 4.815 ns
       34 - 121 cycles 30.291 ns -  25 cycles 6.333 ns - 25 cycles 6.268 ns
       48 - 108 cycles 27.111 ns -  21 cycles 5.498 ns - 21 cycles 5.458 ns
       64 - 100 cycles 25.164 ns -  20 cycles 5.242 ns - 20 cycles 5.229 ns
       128 - 155 cycles 38.976 ns -  27 cycles 6.886 ns - 27 cycles 6.892 ns
       158 - 132 cycles 33.034 ns -  30 cycles 7.711 ns - 30 cycles 7.728 ns
       250 - 130 cycles 32.612 ns -  38 cycles 9.560 ns - 38 cycles 9.549 ns
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ca257195
    • Jesper Dangaard Brouer's avatar
      mm/slab: move SLUB alloc hooks to common mm/slab.h · 11c7aec2
      Jesper Dangaard Brouer authored
      
      
      First step towards sharing alloc_hook's between SLUB and SLAB
      allocators.  Move the SLUB allocators *_alloc_hook to the common
      mm/slab.h for internal slab definitions.
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      11c7aec2