1. 06 Nov, 2015 10 commits
    • Johannes Weiner's avatar
      mm: page_counter: let page_counter_try_charge() return bool · 6071ca52
      Johannes Weiner authored
      page_counter_try_charge() currently returns 0 on success and -ENOMEM on
      failure, which is surprising behavior given the function name.
      Make it follow the expected pattern of try_stuff() functions that return a
      boolean true to indicate success, or false for failure.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Johannes Weiner's avatar
      mm: memcontrol: eliminate root memory.current · f5fc3c5d
      Johannes Weiner authored
      memory.current on the root level doesn't add anything that wouldn't be
      more accurate and detailed using system statistics.  It already doesn't
      include slabs, and it'll be a pain to keep in sync when further memory
      types are accounted in the memory controller.  Remove it.
      Note that this applies to the new unified hierarchy interface only.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Hugh Dickins's avatar
      mm: rename mem_cgroup_migrate to mem_cgroup_replace_page · 45637bab
      Hugh Dickins authored
      After v4.3's commit 0610c25d
       ("memcg: fix dirty page migration")
      mem_cgroup_migrate() doesn't have much to offer in page migration: convert
      migrate_misplaced_transhuge_page() to set_page_memcg() instead.
      Then rename mem_cgroup_migrate() to mem_cgroup_replace_page(), since its
      remaining callers are replace_page_cache_page() and shmem_replace_page():
      both of whom passed lrucare true, so just eliminate that argument.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Vladimir Davydov's avatar
      memcg: simplify and inline __mem_cgroup_from_kmem · df406551
      Vladimir Davydov authored
      Before the previous patch ("memcg: unify slab and other kmem pages
      charging"), __mem_cgroup_from_kmem had to handle two types of kmem - slab
      pages and pages allocated with alloc_kmem_pages - memcg in the page
      struct.  Now we can unify it.  Since after it, this function becomes tiny
      we can fold it into mem_cgroup_from_kmem.
      [hughd@google.com: move mem_cgroup_from_kmem into list_lru.c]
      Signed-off-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Vladimir Davydov's avatar
      memcg: unify slab and other kmem pages charging · f3ccb2c4
      Vladimir Davydov authored
      We have memcg_kmem_charge and memcg_kmem_uncharge methods for charging and
      uncharging kmem pages to memcg, but currently they are not used for
      charging slab pages (i.e.  they are only used for charging pages allocated
      with alloc_kmem_pages).  The only reason why the slab subsystem uses
      special helpers, memcg_charge_slab and memcg_uncharge_slab, is that it
      needs to charge to the memcg of kmem cache while memcg_charge_kmem charges
      to the memcg that the current task belongs to.
      To remove this diversity, this patch adds an extra argument to
      __memcg_kmem_charge that can be a pointer to a memcg or NULL.  If it is
      not NULL, the function tries to charge to the memcg it points to,
      otherwise it charge to the current context.  Next, it makes the slab
      subsystem use this function to charge slab pages.
      Since memcg_charge_kmem and memcg_uncharge_kmem helpers are now used only
      in __memcg_kmem_charge and __memcg_kmem_uncharge, they are inlined.  Since
      __memcg_kmem_charge stores a pointer to the memcg in the page struct, we
      don't need memcg_uncharge_slab anymore and can use free_kmem_pages.
      Besides, one can now detect which memcg a slab page belongs to by reading
      Note, this patch switches slab to charge-after-alloc design.  Since this
      design is already used for all other memcg charges, it should not make any
      [hannes@cmpxchg.org: better to have an outer function than a magic parameter for the memcg lookup]
      Signed-off-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Vladimir Davydov's avatar
      memcg: simplify charging kmem pages · d05e83a6
      Vladimir Davydov authored
      Charging kmem pages proceeds in two steps.  First, we try to charge the
      allocation size to the memcg the current task belongs to, then we allocate
      a page and "commit" the charge storing the pointer to the memcg in the
      page struct.
      Such a design looks overcomplicated, because there is not much sense in
      trying charging the allocation before actually allocating a page: we won't
      be able to consume much memory over the limit even if we charge after
      doing the actual allocation, besides we already charge user pages post
      factum, so being pedantic with kmem pages just looks pointless.
      So this patch simplifies the design by merging the "charge" and the
      "commit" steps into the same function, which takes the allocated page.
      Also, rename the charge and uncharge methods to memcg_kmem_charge and
      memcg_kmem_uncharge and make the charge method return error code instead
      of bool to conform to mem_cgroup_try_charge.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Jerome Marchand's avatar
      mm/memcontrol.c: fix order calculation in try_charge() · 3608de07
      Jerome Marchand authored
      Since commit 6539cc05
       ("mm: memcontrol: fold mem_cgroup_do_charge()"),
      the order to pass to mem_cgroup_oom() is calculated by passing the
      number of pages to get_order() instead of the expected size in bytes.
      AFAICT, it only affects the value displayed in the oom warning message.
      This patch fix this.
      Michal said:
      : We haven't noticed that just because the OOM is enabled only for page
      : faults of order-0 (single page) and get_order work just fine.  Thanks for
      : noticing this.  If we ever start triggering OOM on different orders this
      : would be broken.
      Signed-off-by: default avatarJerome Marchand <jmarchan@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Tejun Heo's avatar
      memcg: ratify and consolidate over-charge handling · 10d53c74
      Tejun Heo authored
      try_charge() is the main charging logic of memcg.  When it hits the limit
      but either can't fail the allocation due to __GFP_NOFAIL or the task is
      likely to free memory very soon, being OOM killed, has SIGKILL pending or
      exiting, it "bypasses" the charge to the root memcg and returns -EINTR.
      While this is one approach which can be taken for these situations, it has
      several issues.
      * It unnecessarily lies about the reality.  The number itself doesn't
        go over the limit but the actual usage does.  memcg is either forced
        to or actively chooses to go over the limit because that is the
        right behavior under the circumstances, which is completely fine,
        but, if at all avoidable, it shouldn't be misrepresenting what's
        happening by sneaking the charges into the root memcg.
      * Despite trying, we already do over-charge.  kmemcg can't deal with
        switching over to the root memcg by the point try_charge() returns
        -EINTR, so it open-codes over-charing.
      * It complicates the callers.  Each try_charge() user has to handle
        the weird -EINTR exception.  memcg_charge_kmem() does the manual
        over-charging.  mem_cgroup_do_precharge() performs unnecessary
        uncharging of root memcg, which BTW is inconsistent with what
        memcg_charge_kmem() does but not broken as [un]charging are noops on
        root memcg.  mem_cgroup_try_charge() needs to switch the returned
        cgroup to the root one.
      The reality is that in memcg there are cases where we are forced and/or
      willing to go over the limit.  Each such case needs to be scrutinized and
      justified but there definitely are situations where that is the right
      thing to do.  We alredy do this but with a superficial and inconsistent
      disguise which leads to unnecessary complications.
      This patch updates try_charge() so that it over-charges and returns 0 when
      deemed necessary.  -EINTR return is removed along with all special case
      handling in the callers.
      While at it, remove the local variable @ret, which was initialized to zero
      and never changed, along with done: label which just returned the always
      zero @ret.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Tejun Heo's avatar
      memcg: punt high overage reclaim to return-to-userland path · b23afb93
      Tejun Heo authored
      Currently, try_charge() tries to reclaim memory synchronously when the
      high limit is breached; however, if the allocation doesn't have
      __GFP_WAIT, synchronous reclaim is skipped.  If a process performs only
      speculative allocations, it can blow way past the high limit.  This is
      actually easily reproducible by simply doing "find /".  slab/slub
      allocator tries speculative allocations first, so as long as there's
      memory which can be consumed without blocking, it can keep allocating
      memory regardless of the high limit.
      This patch makes try_charge() always punt the over-high reclaim to the
      return-to-userland path.  If try_charge() detects that high limit is
      breached, it adds the overage to current->memcg_nr_pages_over_high and
      schedules execution of mem_cgroup_handle_over_high() which performs
      synchronous reclaim from the return-to-userland path.
      As long as kernel doesn't have a run-away allocation spree, this should
      provide enough protection while making kmemcg behave more consistently.
      It also has the following benefits.
      - All over-high reclaims can use GFP_KERNEL regardless of the specific
        gfp mask in use, e.g. GFP_NOFS, when the limit was breached.
      - It copes with prio inversion.  Previously, a low-prio task with
        small memory.high might perform over-high reclaim with a bunch of
        locks held.  If a higher prio task needed any of these locks, it
        would have to wait until the low prio task finished reclaim and
        released the locks.  By handing over-high reclaim to the task exit
        path this issue can be avoided.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarMichal Hocko <mhocko@kernel.org>
      Reviewed-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Tejun Heo's avatar
      memcg: flatten task_struct->memcg_oom · 626ebc41
      Tejun Heo authored
      task_struct->memcg_oom is a sub-struct containing fields which are used
      for async memcg oom handling.  Most task_struct fields aren't packaged
      this way and it can lead to unnecessary alignment paddings.  This patch
      flattens it.
      * task.memcg_oom.memcg          -> task.memcg_in_oom
      * task.memcg_oom.gfp_mask	-> task.memcg_oom_gfp_mask
      * task.memcg_oom.order          -> task.memcg_oom_order
      * task.memcg_oom.may_oom        -> task.memcg_may_oom
      In addition, task.memcg_may_oom is relocated to where other bitfields are
      which reduces the size of task_struct.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  2. 16 Oct, 2015 1 commit
  3. 12 Oct, 2015 1 commit
    • Tejun Heo's avatar
      writeback: fix incorrect calculation of available memory for memcg domains · c5edf9cd
      Tejun Heo authored
      For memcg domains, the amount of available memory was calculated as
       min(the amount currently in use + headroom according to memcg,
           total clean memory)
      This isn't quite correct as what should be capped by the amount of
      clean memory is the headroom, not the sum of memory in use and
      headroom.  For example, if a memcg domain has a significant amount of
      dirty memory, the above can lead to a value which is lower than the
      current amount in use which doesn't make much sense.  In most
      circumstances, the above leads to a number which is somewhat but not
      drastically lower.
      As the amount of memory which can be readily allocated to the memcg
      domain is capped by the amount of system-wide clean memory which is
      not already assigned to the memcg itself, the number we want is
       the amount currently in use +
       min(headroom according to memcg, clean memory elsewhere in the system)
      This patch updates mem_cgroup_wb_stats() to return the number of
      filepages and headroom instead of the calculated available pages.
      mdtc_cap_avail() is renamed to mdtc_calc_avail() and performs the
      above calculation from file, headroom, dirty and globally clean pages.
      v2: Dummy mem_cgroup_wb_stats() implementation wasn't updated leading
          to build failure when !CGROUP_WRITEBACK.  Fixed.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Fixes: c2aa723a
       ("writeback: implement memcg writeback domain based throttling")
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
  4. 02 Oct, 2015 2 commits
  5. 10 Sep, 2015 2 commits
    • Vladimir Davydov's avatar
      memcg: zap try_get_mem_cgroup_from_page · e993d905
      Vladimir Davydov authored
      It is only used in mem_cgroup_try_charge, so fold it in and zap it.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Reviewed-by: default avatarAndres Lagar-Cavilla <andreslc@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Vladimir Davydov's avatar
      memcg: add page_cgroup_ino helper · 2fc04524
      Vladimir Davydov authored
      This patchset introduces a new user API for tracking user memory pages
      that have not been used for a given period of time.  The purpose of this
      is to provide the userspace with the means of tracking a workload's
      working set, i.e.  the set of pages that are actively used by the
      workload.  Knowing the working set size can be useful for partitioning the
      system more efficiently, e.g.  by tuning memory cgroup limits
      appropriately, or for job placement within a compute cluster.
      ==== USE CASES ====
      The unified cgroup hierarchy has memory.low and memory.high knobs, which
      are defined as the low and high boundaries for the workload working set
      size.  However, the working set size of a workload may be unknown or
      change in time.  With this patch set, one can periodically estimate the
      amount of memory unused by each cgroup and tune their memory.low and
      memory.high parameters accordingly, therefore optimizing the overall
      memory utilization.
      Another use case is balancing workloads within a compute cluster.  Knowing
      how much memory is not really used by a workload unit may help take a more
      optimal decision when considering migrating the unit to another node
      within the cluster.
      Also, as noted by Minchan, this would be useful for per-process reclaim
      (https://lwn.net/Articles/545668/). With idle tracking, we could reclaim idle
      pages only by smart user memory manager.
      ==== USER API ====
      The user API consists of two new files:
       * /sys/kernel/mm/page_idle/bitmap.  This file implements a bitmap where each
         bit corresponds to a page, indexed by PFN. When the bit is set, the
         corresponding page is idle. A page is considered idle if it has not been
         accessed since it was marked idle. To mark a page idle one should set the
         bit corresponding to the page by writing to the file. A value written to the
         file is OR-ed with the current bitmap value. Only user memory pages can be
         marked idle, for other page types input is silently ignored. Writing to this
         file beyond max PFN results in the ENXIO error. Only available when
         This file can be used to estimate the amount of pages that are not
         used by a particular workload as follows:
         1. mark all pages of interest idle by setting corresponding bits in the
         2. wait until the workload accesses its working set
         3. read /sys/kernel/mm/page_idle/bitmap and count the number of bits set
       * /proc/kpagecgroup.  This file contains a 64-bit inode number of the
         memory cgroup each page is charged to, indexed by PFN. Only available when
         CONFIG_MEMCG is set.
         This file can be used to find all pages (including unmapped file pages)
         accounted to a particular cgroup. Using /sys/kernel/mm/page_idle/bitmap, one
         can then estimate the cgroup working set size.
      For an example of using these files for estimating the amount of unused
      memory pages per each memory cgroup, please see the script attached
      ==== REASONING ====
      The reason to introduce the new user API instead of using
      /proc/PID/{clear_refs,smaps} is that the latter has two serious
       - it does not count unmapped file pages
       - it affects the reclaimer logic
      The new API attempts to overcome them both. For more details on how it
      is achieved, please see the comment to patch 6.
      ==== PATCHSET STRUCTURE ====
      The patch set is organized as follows:
       - patch 1 adds page_cgroup_ino() helper for the sake of
         /proc/kpagecgroup and patches 2-3 do related cleanup
       - patch 4 adds /proc/kpagecgroup, which reports cgroup ino each page is
         charged to
       - patch 5 introduces a new mmu notifier callback, clear_young, which is
         a lightweight version of clear_flush_young; it is used in patch 6
       - patch 6 implements the idle page tracking feature, including the
         userspace API, /sys/kernel/mm/page_idle/bitmap
       - patch 7 exports idle flag via /proc/kpageflags
      ==== SIMILAR WORKS ====
      Originally, the patch for tracking idle memory was proposed back in 2011
      by Michel Lespinasse (see http://lwn.net/Articles/459269/).  The main
      difference between Michel's patch and this one is that Michel implemented
      a kernel space daemon for estimating idle memory size per cgroup while
      this patch only provides the userspace with the minimal API for doing the
      job, leaving the rest up to the userspace.  However, they both share the
      same idea of Idle/Young page flags to avoid affecting the reclaimer logic.
      SPECjvm2008 (https://www.spec.org/jvm2008/) was used to evaluate the
      performance impact introduced by this patch set.  Three runs were carried
       - base: kernel without the patch
       - patched: patched kernel, the feature is not used
       - patched-active: patched kernel, 1 minute-period daemon is used for
         tracking idle memory
      For tracking idle memory, idlememstat utility was used:
      testcase            base            patched        patched-active
      compiler       537.40 ( 0.00)%   532.26 (-0.96)%   538.31 ( 0.17)%
      compress       305.47 ( 0.00)%   301.08 (-1.44)%   300.71 (-1.56)%
      crypto         284.32 ( 0.00)%   282.21 (-0.74)%   284.87 ( 0.19)%
      derby          411.05 ( 0.00)%   413.44 ( 0.58)%   412.07 ( 0.25)%
      mpegaudio      189.96 ( 0.00)%   190.87 ( 0.48)%   189.42 (-0.28)%
      scimark.large   46.85 ( 0.00)%    46.41 (-0.94)%    47.83 ( 2.09)%
      scimark.small  412.91 ( 0.00)%   415.41 ( 0.61)%   421.17 ( 2.00)%
      serial         204.23 ( 0.00)%   213.46 ( 4.52)%   203.17 (-0.52)%
      startup         36.76 ( 0.00)%    35.49 (-3.45)%    35.64 (-3.05)%
      sunflow        115.34 ( 0.00)%   115.08 (-0.23)%   117.37 ( 1.76)%
      xml            620.55 ( 0.00)%   619.95 (-0.10)%   620.39 (-0.03)%
      composite      211.50 ( 0.00)%   211.15 (-0.17)%   211.67 ( 0.08)%
      time idlememstat:
      17.20user 65.16system 2:15:23elapsed 1%CPU (0avgtext+0avgdata 8476maxresident)k
      448inputs+40outputs (1major+36052minor)pagefaults 0swaps
      #! /usr/bin/python
      import os
      import stat
      import errno
      import struct
      CGROUP_MOUNT = "/sys/fs/cgroup/memory"
      BUFSIZE = 8 * 1024  # must be multiple of 8
      def get_hugepage_size():
          with open("/proc/meminfo", "r") as f:
              for s in f:
                  k, v = s.split(":")
                  if k == "Hugepagesize":
                      return int(v.split()[0]) * 1024
      PAGE_SIZE = os.sysconf("SC_PAGE_SIZE")
      HUGEPAGE_SIZE = get_hugepage_size()
      def set_idle():
          f = open("/sys/kernel/mm/page_idle/bitmap", "wb", BUFSIZE)
          while True:
                  f.write(struct.pack("Q", pow(2, 64) - 1))
              except IOError as err:
                  if err.errno == errno.ENXIO:
      def count_idle():
          f_flags = open("/proc/kpageflags", "rb", BUFSIZE)
          f_cgroup = open("/proc/kpagecgroup", "rb", BUFSIZE)
          with open("/sys/kernel/mm/page_idle/bitmap", "rb", BUFSIZE) as f:
              while f.read(BUFSIZE): pass  # update idle flag
          idlememsz = {}
          while True:
              s1, s2 = f_flags.read(8), f_cgroup.read(8)
              if not s1 or not s2:
              flags, = struct.unpack('Q', s1)
              cgino, = struct.unpack('Q', s2)
              unevictable = (flags >> 18) & 1
              huge = (flags >> 22) & 1
              idle = (flags >> 25) & 1
              if idle and not unevictable:
                  idlememsz[cgino] = idlememsz.get(cgino, 0) + \
                      (HUGEPAGE_SIZE if huge else PAGE_SIZE)
          return idlememsz
      if __name__ == "__main__":
          print "Setting the idle flag for each page..."
          raw_input("Wait until the workload accesses its working set, "
                    "then press Enter")
          print "Counting idle pages..."
          idlememsz = count_idle()
          for dir, subdirs, files in os.walk(CGROUP_MOUNT):
              ino = os.stat(dir)[stat.ST_INO]
              print dir + ": " + str(idlememsz.get(ino, 0) / 1024) + " kB"
      ==== END SCRIPT ====
      This patch (of 8):
      Add page_cgroup_ino() helper to memcg.
      This function returns the inode number of the closest online ancestor of
      the memory cgroup a page is charged to.  It is required for exporting
      information about which page is charged to which cgroup to userspace,
      which will be introduced by a following patch.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Reviewed-by: default avatarAndres Lagar-Cavilla <andreslc@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  6. 08 Sep, 2015 6 commits
  7. 04 Sep, 2015 1 commit
    • Sebastian Andrzej Siewior's avatar
      mm: memcontrol: bring back the VM_BUG_ON() in mem_cgroup_swapout() · ce9ce665
      Sebastian Andrzej Siewior authored
      Clark stumbled over a VM_BUG_ON() in -RT which was then was removed by
      Johannes in commit f371763a
       ("mm: memcontrol: fix false-positive
      VM_BUG_ON() on -rt").  The comment before that patch was a tiny bit better
      than it is now.  While the patch claimed to fix a false-postive on -RT
      this was not the case.  None of the -RT folks ACKed it and it was not a
      false positive report.  That was a *real* problem.
      This patch updates the comment that is improper because it refers to
      "disabled preemption" as a consequence of that lock being taken.  A
      spin_lock() disables preemption, true, but in this case the code relies on
      the fact that the lock _also_ disables interrupts once it is acquired.
      And this is the important detail (which was checked the VM_BUG_ON()) which
      needs to be pointed out.  This is the hint one needs while looking at the
      code.  It was explained by Johannes on the list that the per-CPU variables
      are protected by local_irq_save().  The BUG_ON() was helpful.  This code
      has been workarounded in -RT in the meantime.  I wouldn't mind running
      into more of those if the code in question uses *special* kind of locking
      since now there is no verification (in terms of lockdep or BUG_ON()) and
      therefore I bring the VM_BUG_ON() check back in.
      The two functions after the comment could also have a "local_irq_save()"
      dance around them in order to serialize access to the per-CPU variables.
      This has been avoided because the interrupts should be off.
      Signed-off-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Clark Williams <williams@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  8. 25 Jun, 2015 4 commits
    • Tejun Heo's avatar
      memcg: convert mem_cgroup->under_oom from atomic_t to int · c2b42d3c
      Tejun Heo authored
      memcg->under_oom tracks whether the memcg is under OOM conditions and is
      an atomic_t counter managed with mem_cgroup_[un]mark_under_oom().  While
      atomic_t appears to be simple synchronization-wise, when used as a
      synchronization construct like here, it's trickier and more error-prone
      due to weak memory ordering rules, especially around atomic_read(), and
      false sense of security.
      For example, both non-trivial read sites of memcg->under_oom are a bit
      problematic although not being actually broken.
      * mem_cgroup_oom_register_event()
        It isn't explicit what guarantees the memory ordering between event
        addition and memcg->under_oom check.  This isn't broken only because
        memcg_oom_lock is used for both event list and memcg->oom_lock.
      * memcg_oom_recover()
        The lockless test doesn't have any explanation why this would be
      mem_cgroup_[un]mark_under_oom() are very cold paths and there's no point
      in avoiding locking memcg_oom_lock there.  This patch converts
      memcg->under_oom from atomic_t to int, puts their modifications under
      memcg_oom_lock and documents why the lockless test in
      memcg_oom_recover() is safe.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Tejun Heo's avatar
      memcg: remove unused mem_cgroup->oom_wakeups · f4b90b70
      Tejun Heo authored
      Since commit 49426420
       ("mm: memcg: handle non-error OOM situations
      more gracefully"), nobody uses mem_cgroup->oom_wakeups.  Remove it.
      While at it, also fold memcg_wakeup_oom() into memcg_oom_recover() which
      is its only user.  This cleanup was suggested by Michal.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Johannes Weiner's avatar
      mm: oom_kill: simplify OOM killer locking · dc56401f
      Johannes Weiner authored
      The zonelist locking and the oom_sem are two overlapping locks that are
      used to serialize global OOM killing against different things.
      The historical zonelist locking serializes OOM kills from allocations with
      overlapping zonelists against each other to prevent killing more tasks
      than necessary in the same memory domain.  Only when neither tasklists nor
      zonelists from two concurrent OOM kills overlap (tasks in separate memcgs
      bound to separate nodes) are OOM kills allowed to execute in parallel.
      The younger oom_sem is a read-write lock to serialize OOM killing against
      the PM code trying to disable the OOM killer altogether.
      However, the OOM killer is a fairly cold error path, there is really no
      reason to optimize for highly performant and concurrent OOM kills.  And
      the oom_sem is just flat-out redundant.
      Replace both locking schemes with a single global mutex serializing OOM
      kills regardless of context.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Johannes Weiner's avatar
      mm: oom_kill: clean up victim marking and exiting interfaces · 16e95196
      Johannes Weiner authored
      Rename unmark_oom_victim() to exit_oom_victim().  Marking and unmarking
      are related in functionality, but the interface is not symmetrical at
      all: one is an internal OOM killer function used during the killing, the
      other is for an OOM victim to signal its own death on exit later on.
      This has locking implications, see follow-up changes.
      While at it, rename mark_tsk_oom_victim() to mark_oom_victim(), which
      is easier on the eye.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  9. 10 Jun, 2015 2 commits
  10. 02 Jun, 2015 8 commits
    • Tejun Heo's avatar
      writeback: implement memcg writeback domain based throttling · c2aa723a
      Tejun Heo authored
      While cgroup writeback support now connects memcg and blkcg so that
      writeback IOs are properly attributed and controlled, the IO back
      pressure propagation mechanism implemented in balance_dirty_pages()
      and its subroutines wasn't aware of cgroup writeback.
      Processes belonging to a memcg may have access to only subset of total
      memory available in the system and not factoring this into dirty
      throttling rendered it completely ineffective for processes under
      memcg limits and memcg ended up building a separate ad-hoc degenerate
      mechanism directly into vmscan code to limit page dirtying.
      The previous patches updated balance_dirty_pages() and its subroutines
      so that they can deal with multiple wb_domain's (writeback domains)
      and defined per-memcg wb_domain.  Processes belonging to a non-root
      memcg are bound to two wb_domains, global wb_domain and memcg
      wb_domain, and should be throttled according to IO pressures from both
      domains.  This patch updates dirty throttling code so that it repeats
      similar calculations for the two domains - the differences between the
      two are few and minor - and applies the lower of the two sets of
      resulting constraints.
      wb_over_bg_thresh(), which controls when background writeback
      terminates, is also updated to consider both global and memcg
      wb_domains.  It returns true if dirty is over bg_thresh for either
      This makes the dirty throttling mechanism operational for memcg
      domains including writeback-bandwidth-proportional dirty page
      distribution inside them but the ad-hoc memcg throttling mechanism in
      vmscan is still in place.  The next patch will rip it out.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
    • Tejun Heo's avatar
      writeback: reset wb_domain->dirty_limit[_tstmp] when memcg domain size changes · 2529bb3a
      Tejun Heo authored
      The amount of available memory to a memcg wb_domain can change as
      memcg configuration changes.  A domain's ->dirty_limit exists to
      smooth out sudden drops in dirty threshold; however, when a domain's
      size actually drops significantly, it hinders the dirty throttling
      from adjusting to the new configuration leading to unexpected
      behaviors including unnecessary OOM kills.
      This patch resolves the issue by adding wb_domain_size_changed() which
      resets ->dirty_limit[_tstmp] and making memcg call it on configuration
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
    • Tejun Heo's avatar
      writeback: implement memcg wb_domain · 841710aa
      Tejun Heo authored
      Dirtyable memory is distributed to a wb (bdi_writeback) according to
      the relative bandwidth the wb is writing out in the whole system.
      This distribution is global - each wb is measured against all other
      wb's and gets the proportinately sized portion of the memory in the
      whole system.
      For cgroup writeback, the amount of dirtyable memory is scoped by
      memcg and thus each wb would need to be measured and controlled in its
      memcg.  IOW, a wb will belong to two writeback domains - the global
      and memcg domains.
      The previous patches laid the groundwork to support the two wb_domains
      and this patch implements memcg wb_domain.  memcg->cgwb_domain is
      initialized on css online and destroyed on css release,
      wb->memcg_completions is added, and __wb_writeout_inc() is updated to
      increment completions against both global and memcg wb_domains.
      The following patches will update balance_dirty_pages() and its
      subroutines to actually consider memcg wb_domain for throttling.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
    • Tejun Heo's avatar
      memcg: make mem_cgroup_read_{stat|event}() iterate possible cpus instead of online · 733a572e
      Tejun Heo authored
      cpu_possible_mask represents the CPUs which are actually possible
      during that boot instance.  For systems which don't support CPU
      hotplug, this will match cpu_online_mask exactly in most cases.  Even
      for systems which support CPU hotplug, the number of possible CPU
      slots is highly unlikely to diverge greatly from the number of online
      CPUs.  The only cases where the difference between possible and online
      caused problems were when the boot code failed to initialize the
      possible mask and left it fully set at NR_CPUS - 1.
      As such, most per-cpu constructs allocate for all possible CPUs and
      often iterate over the possibles, which also has the benefit of
      avoiding the blocking CPU hotplug synchronization.
      memcg open codes per-cpu stat counting for mem_cgroup_read_stat() and
      mem_cgroup_read_events(), which iterates over online CPUs and handles
      CPU hotplug operations explicitly.  This complexity doesn't actually
      buy anything.  Switch to iterating over the possibles and drop the
      explicit CPU hotplug handling.
      Eventually, we want to convert memcg to use percpu_counter instead of
      its own custom implementation which also benefits from quick access
      w/o summing for cases where larger error margin is acceptable.
      This will allow mem_cgroup_read_stat() to be called from non-sleepable
      contexts which will be used by cgroup writeback.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
    • Tejun Heo's avatar
      writeback: make backing_dev_info host cgroup-specific bdi_writebacks · 52ebea74
      Tejun Heo authored
      For the planned cgroup writeback support, on each bdi
      (backing_dev_info), each memcg will be served by a separate wb
      (bdi_writeback).  This patch updates bdi so that a bdi can host
      multiple wbs (bdi_writebacks).
      On the default hierarchy, blkcg implicitly enables memcg.  This allows
      using memcg's page ownership for attributing writeback IOs, and every
      memcg - blkcg combination can be served by its own wb by assigning a
      dedicated wb to each memcg.  This means that there may be multiple
      wb's of a bdi mapped to the same blkcg.  As congested state is per
      blkcg - bdi combination, those wb's should share the same congested
      state.  This is achieved by tracking congested state via
      bdi_writeback_congested structs which are keyed by blkcg.
      bdi->wb remains unchanged and will keep serving the root cgroup.
      cgwb's (cgroup wb's) for non-root cgroups are created on-demand or
      looked up while dirtying an inode according to the memcg of the page
      being dirtied or current task.  Each cgwb is indexed on bdi->cgwb_tree
      by its memcg id.  Once an inode is associated with its wb, it can be
      retrieved using inode_to_wb().
      Currently, none of the filesystems has FS_CGROUP_WRITEBACK and all
      pages will keep being associated with bdi->wb.
      v3: inode_attach_wb() in account_page_dirtied() moved inside
          mapping_cap_account_dirty() block where it's known to be !NULL.
          Also, an unnecessary NULL check before kfree() removed.  Both
          detected by the kbuild bot.
      v2: Updated so that wb association is per inode and wb is per memcg
          rather than blkcg.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: kbuild test robot <fengguang.wu@intel.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
    • Tejun Heo's avatar
      memcg: implement mem_cgroup_css_from_page() · ad7fa852
      Tejun Heo authored
      Implement mem_cgroup_css_from_page() which returns the
      cgroup_subsys_state of the memcg associated with a given page on the
      default hierarchy.  This will be used by cgroup writeback support.
      This function assumes that page->mem_cgroup association doesn't change
      until the page is released, which is true on the default hierarchy as
      long as replace_page_cache_page() is not used.  As the only user of
      replace_page_cache_page() is FUSE which won't support cgroup writeback
      for the time being, this works for now, and replace_page_cache_page()
      will soon be updated so that the invariant actually holds.
      Note that the RCU protected page->mem_cgroup access is consistent with
      other usages across memcg but ultimately incorrect.  These unlocked
      accesses are missing required barriers.  page->mem_cgroup should be
      made an RCU pointer and updated and accessed using RCU operations.
      v4: Instead of triggering WARN, return the root css on the traditional
          hierarchies.  This makes the function a lot easier to deal with
          especially as there's no light way to synchronize against
          hierarchy rebinding.
      v3: s/mem_cgroup_migrate()/mem_cgroup_css_from_page()/
      v2: Trigger WARN if the function is used on the traditional
          hierarchies and add comment about the assumed invariant.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
    • Tejun Heo's avatar
      memcg: add mem_cgroup_root_css · 56161634
      Tejun Heo authored
      Add global mem_cgroup_root_css which points to the root memcg css.
      This will be used by cgroup writeback support.  If memcg is disabled,
      it's defined as ERR_PTR(-EINVAL).
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      aCc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
    • Greg Thelen's avatar
      memcg: add per cgroup dirty page accounting · c4843a75
      Greg Thelen authored
      When modifying PG_Dirty on cached file pages, update the new
      MEM_CGROUP_STAT_DIRTY counter.  This is done in the same places where
      global NR_FILE_DIRTY is managed.  The new memcg stat is visible in the
      per memcg memory.stat cgroupfs file.  The most recent past attempt at
      this was http://thread.gmane.org/gmane.linux.kernel.cgroups/8632
      The new accounting supports future efforts to add per cgroup dirty
      page throttling and writeback.  It also helps an administrator break
      down a container's memory usage and provides evidence to understand
      memcg oom kills (the new dirty count is included in memcg oom kill
      The ability to move page accounting between memcg
      (memory.move_charge_at_immigrate) makes this accounting more
      complicated than the global counter.  The existing
      mem_cgroup_{begin,end}_page_stat() lock is used to serialize move
      accounting with stat updates.
      Typical update operation:
      	memcg = mem_cgroup_begin_page_stat(page)
      	if (TestSetPageDirty()) {
      Summary of mem_cgroup_end_page_stat() overhead:
      - Without CONFIG_MEMCG it's a no-op
      - With CONFIG_MEMCG and no inter memcg task movement, it's just
      - With CONFIG_MEMCG and inter memcg  task movement, it's
        rcu_read_lock() + spin_lock_irqsave()
      A memcg parameter is added to several routines because their callers
      now grab mem_cgroup_begin_page_stat() which returns the memcg later
      needed by for mem_cgroup_update_page_stat().
      Because mem_cgroup_begin_page_stat() may disable interrupts, some
      adjustments are needed:
      - move __mark_inode_dirty() from __set_page_dirty() to its caller.
        __mark_inode_dirty() locking does not want interrupts disabled.
      - use spin_lock_irqsave(tree_lock) rather than spin_lock_irq() in
        __delete_from_page_cache(), replace_page_cache_page(),
        invalidate_complete_page2(), and __remove_mapping().
         text    data     bss      dec    hex filename
      8925147 1774832 1785856 12485835 be84cb vmlinux-!CONFIG_MEMCG-before
      8925339 1774832 1785856 12486027 be858b vmlinux-!CONFIG_MEMCG-after
                                  +192 text bytes
      8965977 1784992 1785856 12536825 bf4bf9 vmlinux-CONFIG_MEMCG-before
      8966750 1784992 1785856 12537598 bf4efe vmlinux-CONFIG_MEMCG-after
                                  +773 text bytes
      Performance tests run on v4.0-rc1-36-g4f671fe2
      .  Lower is better for
      all metrics, they're all wall clock or cycle counts.  The read and write
      fault benchmarks just measure fault time, they do not include I/O time.
      * CONFIG_MEMCG not set:
                                  baseline                              patched
        kbuild                 1m25.030000(+-0.088% 3 samples)       1m25.426667(+-0.120% 3 samples)
        dd write 100 MiB          0.859211561 +-15.10%                  0.874162885 +-15.03%
        dd write 200 MiB          1.670653105 +-17.87%                  1.669384764 +-11.99%
        dd write 1000 MiB         8.434691190 +-14.15%                  8.474733215 +-14.77%
        read fault cycles       254.0(+-0.000% 10 samples)            253.0(+-0.000% 10 samples)
        write fault cycles     2021.2(+-3.070% 10 samples)           1984.5(+-1.036% 10 samples)
      * CONFIG_MEMCG=y root_memcg:
                                  baseline                              patched
        kbuild                 1m25.716667(+-0.105% 3 samples)       1m25.686667(+-0.153% 3 samples)
        dd write 100 MiB          0.855650830 +-14.90%                  0.887557919 +-14.90%
        dd write 200 MiB          1.688322953 +-12.72%                  1.667682724 +-13.33%
        dd write 1000 MiB         8.418601605 +-14.30%                  8.673532299 +-15.00%
        read fault cycles       266.0(+-0.000% 10 samples)            266.0(+-0.000% 10 samples)
        write fault cycles     2051.7(+-1.349% 10 samples)           2049.6(+-1.686% 10 samples)
      * CONFIG_MEMCG=y non-root_memcg:
                                  baseline                              patched
        kbuild                 1m26.120000(+-0.273% 3 samples)       1m25.763333(+-0.127% 3 samples)
        dd write 100 MiB          0.861723964 +-15.25%                  0.818129350 +-14.82%
        dd write 200 MiB          1.669887569 +-13.30%                  1.698645885 +-13.27%
        dd write 1000 MiB         8.383191730 +-14.65%                  8.351742280 +-14.52%
        read fault cycles       265.7(+-0.172% 10 samples)            267.0(+-0.000% 10 samples)
        write fault cycles     2070.6(+-1.512% 10 samples)           2084.4(+-2.148% 10 samples)
      As expected anon page faults are not affected by this patch.
      tj: Updated to apply on top of the recent cancel_dirty_page() changes.
      Signed-off-by: default avatarSha Zhengju <handai.szj@gmail.com>
      Signed-off-by: default avatarGreg Thelen <gthelen@google.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
  11. 15 Apr, 2015 3 commits