1. 25 Mar, 2016 40 commits
    • Nicolai Stange's avatar
      mm/filemap: generic_file_read_iter(): check for zero reads unconditionally · e7080a43
      Nicolai Stange authored
      
      
      If
       - generic_file_read_iter() gets called with a zero read length,
       - the read offset is at a page boundary,
       - IOCB_DIRECT is not set
      -  and the page in question hasn't made it into the page cache yet,
      then do_generic_file_read() will trigger a readahead with a req_size hint
      of zero.
      
      Since roundup_pow_of_two(0) is undefined, UBSAN reports
      
        UBSAN: Undefined behaviour in include/linux/log2.h:63:13
        shift exponent 64 is too large for 64-bit type 'long unsigned int'
        CPU: 3 PID: 1017 Comm: sa1 Tainted: G L 4.5.0-next-20160318+ #14
        [...]
        Call Trace:
         [...]
         [<ffffffff813ef61a>] ondemand_readahead+0x3aa/0x3d0
         [<ffffffff813ef61a>] ? ondemand_readahead+0x3aa/0x3d0
         [<ffffffff813c73bd>] ? find_get_entry+0x2d/0x210
         [<ffffffff813ef9c3>] page_cache_sync_readahead+0x63/0xa0
         [<ffffffff813cc04d>] do_generic_file_read+0x80d/0xf90
         [<ffffffff813cc955>] generic_file_read_iter+0x185/0x420
         [...]
         [<ffffffff81510b06>] __vfs_read+0x256/0x3d0
         [...]
      
      when get_init_ra_size() gets called from ondemand_readahead().
      
      The net effect is that the initial readahead size is arch dependent for
      requested read lengths of zero: for example, since
      
        1UL << (sizeof(unsigned long) * 8)
      
      evaluates to 1 on x86 while its result is 0 on ARMv7, the initial readahead
      size becomes 4 on the former and 0 on the latter.
      
      What's more, whether or not the file access timestamp is updated for zero
      length reads is decided differently for the two cases of IOCB_DIRECT
      being set or cleared: in the first case, generic_file_read_iter()
      explicitly skips updating that timestamp while in the latter case, it is
      always updated through the call to do_generic_file_read().
      
      According to POSIX, zero length reads "do not modify the last data access
      timestamp" and thus, the IOCB_DIRECT behaviour is POSIXly correct.
      
      Let generic_file_read_iter() unconditionally check the requested read
      length at its entry and return immediately with success if it is zero.
      Signed-off-by: default avatarNicolai Stange <nicstange@gmail.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e7080a43
    • Alexander Potapenko's avatar
      kasan: test fix: warn if the UAF could not be detected in kmalloc_uaf2 · 9dcadd38
      Alexander Potapenko authored
      
      Signed-off-by: default avatarAlexander Potapenko <glider@google.com>
      Acked-by: default avatarAndrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Andrey Konovalov <adech.fo@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Konstantin Serebryany <kcc@google.com>
      Cc: Dmitry Chernenkov <dmitryc@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9dcadd38
    • Alexander Potapenko's avatar
      mm, kasan: stackdepot implementation. Enable stackdepot for SLAB · cd11016e
      Alexander Potapenko authored
      
      
      Implement the stack depot and provide CONFIG_STACKDEPOT.  Stack depot
      will allow KASAN store allocation/deallocation stack traces for memory
      chunks.  The stack traces are stored in a hash table and referenced by
      handles which reside in the kasan_alloc_meta and kasan_free_meta
      structures in the allocated memory chunks.
      
      IRQ stack traces are cut below the IRQ entry point to avoid unnecessary
      duplication.
      
      Right now stackdepot support is only enabled in SLAB allocator.  Once
      KASAN features in SLAB are on par with those in SLUB we can switch SLUB
      to stackdepot as well, thus removing the dependency on SLUB stack
      bookkeeping, which wastes a lot of memory.
      
      This patch is based on the "mm: kasan: stack depots" patch originally
      prepared by Dmitry Chernenkov.
      
      Joonsoo has said that he plans to reuse the stackdepot code for the
      mm/page_owner.c debugging facility.
      
      [akpm@linux-foundation.org: s/depot_stack_handle/depot_stack_handle_t]
      [aryabinin@virtuozzo.com: comment style fixes]
      Signed-off-by: default avatarAlexander Potapenko <glider@google.com>
      Signed-off-by: default avatarAndrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Andrey Konovalov <adech.fo@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Konstantin Serebryany <kcc@google.com>
      Cc: Dmitry Chernenkov <dmitryc@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cd11016e
    • Alexander Potapenko's avatar
      arch, ftrace: for KASAN put hard/soft IRQ entries into separate sections · be7635e7
      Alexander Potapenko authored
      
      
      KASAN needs to know whether the allocation happens in an IRQ handler.
      This lets us strip everything below the IRQ entry point to reduce the
      number of unique stack traces needed to be stored.
      
      Move the definition of __irq_entry to <linux/interrupt.h> so that the
      users don't need to pull in <linux/ftrace.h>.  Also introduce the
      __softirq_entry macro which is similar to __irq_entry, but puts the
      corresponding functions to the .softirqentry.text section.
      Signed-off-by: default avatarAlexander Potapenko <glider@google.com>
      Acked-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Andrey Konovalov <adech.fo@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Konstantin Serebryany <kcc@google.com>
      Cc: Dmitry Chernenkov <dmitryc@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      be7635e7
    • Alexander Potapenko's avatar
      mm, kasan: add GFP flags to KASAN API · 505f5dcb
      Alexander Potapenko authored
      
      
      Add GFP flags to KASAN hooks for future patches to use.
      
      This patch is based on the "mm: kasan: unified support for SLUB and SLAB
      allocators" patch originally prepared by Dmitry Chernenkov.
      Signed-off-by: default avatarAlexander Potapenko <glider@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Andrey Konovalov <adech.fo@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Konstantin Serebryany <kcc@google.com>
      Cc: Dmitry Chernenkov <dmitryc@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      505f5dcb
    • Alexander Potapenko's avatar
      mm, kasan: SLAB support · 7ed2f9e6
      Alexander Potapenko authored
      
      
      Add KASAN hooks to SLAB allocator.
      
      This patch is based on the "mm: kasan: unified support for SLUB and SLAB
      allocators" patch originally prepared by Dmitry Chernenkov.
      Signed-off-by: default avatarAlexander Potapenko <glider@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Andrey Konovalov <adech.fo@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Konstantin Serebryany <kcc@google.com>
      Cc: Dmitry Chernenkov <dmitryc@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7ed2f9e6
    • Alexander Potapenko's avatar
      kasan: modify kmalloc_large_oob_right(), add kmalloc_pagealloc_oob_right() · e6e8379c
      Alexander Potapenko authored
      
      
      This patchset implements SLAB support for KASAN
      
      Unlike SLUB, SLAB doesn't store allocation/deallocation stacks for heap
      objects, therefore we reimplement this feature in mm/kasan/stackdepot.c.
      The intention is to ultimately switch SLUB to use this implementation as
      well, which will save a lot of memory (right now SLUB bloats each object
      by 256 bytes to store the allocation/deallocation stacks).
      
      Also neither SLUB nor SLAB delay the reuse of freed memory chunks, which
      is necessary for better detection of use-after-free errors.  We
      introduce memory quarantine (mm/kasan/quarantine.c), which allows
      delayed reuse of deallocated memory.
      
      This patch (of 7):
      
      Rename kmalloc_large_oob_right() to kmalloc_pagealloc_oob_right(), as
      the test only checks the page allocator functionality.  Also reimplement
      kmalloc_large_oob_right() so that the test allocates a large enough
      chunk of memory that still does not trigger the page allocator fallback.
      Signed-off-by: default avatarAlexander Potapenko <glider@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Andrey Konovalov <adech.fo@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Konstantin Serebryany <kcc@google.com>
      Cc: Dmitry Chernenkov <dmitryc@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e6e8379c
    • Tetsuo Handa's avatar
      include/linux/oom.h: remove undefined oom_kills_count()/note_oom_kill() · aaf4fb71
      Tetsuo Handa authored
      A leftover from commit c32b3cbe
      
       ("oom, PM: make OOM detection in the
      freezer path raceless").
      Signed-off-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      aaf4fb71
    • Vlastimil Babka's avatar
      mm/page_alloc: prevent merging between isolated and other pageblocks · d9dddbf5
      Vlastimil Babka authored
      Hanjun Guo has reported that a CMA stress test causes broken accounting of
      CMA and free pages:
      
      > Before the test, I got:
      > -bash-4.3# cat /proc/meminfo | grep Cma
      > CmaTotal:         204800 kB
      > CmaFree:          195044 kB
      >
      >
      > After running the test:
      > -bash-4.3# cat /proc/meminfo | grep Cma
      > CmaTotal:         204800 kB
      > CmaFree:         6602584 kB
      >
      > So the freed CMA memory is more than total..
      >
      > Also the the MemFree is more than mem total:
      >
      > -bash-4.3# cat /proc/meminfo
      > MemTotal:       16342016 kB
      > MemFree:        22367268 kB
      > MemAvailable:   22370528 kB
      
      Laura Abbott has confirmed the issue and suspected the freepage accounting
      rewrite around 3.18/4.0 by Joonsoo Kim.  Joonsoo had a theory that this is
      caused by unexpected merging between MIGRATE_ISOLATE and MIGRATE_CMA
      pageblocks:
      
      > CMA isolates MAX_ORDER aligned blocks, but, during the process,
      > partialy isolated block exists. If MAX_ORDER is 11 and
      > pageblock_order is 9, two pageblocks make up MAX_ORDER
      > aligned block and I can think following scenario because pageblock
      > (un)isolation would be done one by one.
      >
      > (each character means one pageblock. 'C', 'I' means MIGRATE_CMA,
      > MIGRATE_ISOLATE, respectively.
      >
      > CC -> IC -> II (Isolation)
      > II -> CI -> CC (Un-isolation)
      >
      > If some pages are freed at this intermediate state such as IC or CI,
      > that page could be merged to the other page that is resident on
      > different type of pageblock and it will cause wrong freepage count.
      
      This was supposed to be prevented by CMA operating on MAX_ORDER blocks,
      but since it doesn't hold the zone->lock between pageblocks, a race
      window does exist.
      
      It's also likely that unexpected merging can occur between
      MIGRATE_ISOLATE and non-CMA pageblocks.  This should be prevented in
      __free_one_page() since commit 3c605096 ("mm/page_alloc: restrict
      max order of merging on isolated pageblock").  However, we only check
      the migratetype of the pageblock where buddy merging has been initiated,
      not the migratetype of the buddy pageblock (or group of pageblocks)
      which can be MIGRATE_ISOLATE.
      
      Joonsoo has suggested checking for buddy migratetype as part of
      page_is_buddy(), but that would add extra checks in allocator hotpath
      and bloat-o-meter has shown significant code bloat (the function is
      inline).
      
      This patch reduces the bloat at some expense of more complicated code.
      The buddy-merging while-loop in __free_one_page() is initially bounded
      to pageblock_border and without any migratetype checks.  The checks are
      placed outside, bumping the max_order if merging is allowed, and
      returning to the while-loop with a statement which can't be possibly
      considered harmful.
      
      This fixes the accounting bug and also removes the arguably weird state
      in the original commit 3c605096 where buddies could be left
      unmerged.
      
      Fixes: 3c605096 ("mm/page_alloc: restrict max order of merging on isolated pageblock")
      Link: https://lkml.org/lkml/2016/3/2/280
      
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reported-by: default avatarHanjun Guo <guohanjun@huawei.com>
      Tested-by: default avatarHanjun Guo <guohanjun@huawei.com>
      Acked-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Debugged-by: default avatarLaura Abbott <labbott@redhat.com>
      Debugged-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: <stable@vger.kernel.org>	[3.18+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d9dddbf5
    • Arnd Bergmann's avatar
      drivers/memstick/host/r592.c: avoid gcc-6 warning · f419a08f
      Arnd Bergmann authored
      
      
      The r592 driver relies on behavior of the DMA mapping API that is
      normally observed but not guaranteed by the API.  Instead it uses a
      runtime check to fail transfers if the API ever behaves
      
      When CONFIG_NEED_SG_DMA_LENGTH is not set, one of the checks turns into a
      comparison of a variable with itself, which gcc-6.0 now warns about:
      
      drivers/memstick/host/r592.c: In function 'r592_transfer_fifo_dma':
      drivers/memstick/host/r592.c:302:31: error: self-comparison always evaluates to false [-Werror=tautological-compare]
          (sg_dma_len(&dev->req->sg) < dev->req->sg.length)) {
                                     ^
      
      The check itself is not a problem, so this patch just rephrases the
      condition in a way that gcc does not consider an indication of a mistake.
      We already know that dev->req->sg.length was initially R592_LFIFO_SIZE, so
      we can compare it to that constant again.
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Cc: Maxim Levitsky <maximlevitsky@gmail.com>
      Cc: Quentin Lambert <lambert.quentin@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f419a08f
    • Xue jiufei's avatar
      ocfs2: extend enough credits for freeing one truncate record while replaying truncate records · 102c2595
      Xue jiufei authored
      
      
      Now function ocfs2_replay_truncate_records() first modifies tl_used,
      then calls ocfs2_extend_trans() to extend transactions for gd and alloc
      inode used for freeing clusters.  jbd2_journal_restart() may be called
      and it may happen that tl_used in truncate log is decreased but the
      clusters are not freed, which means these clusters are lost.  So we
      should avoid extending transactions in these two operations.
      Signed-off-by: default avatarjoyce.xue <xuejiufei@huawei.com>
      Reviewed-by: default avatarMark Fasheh <mfasheh@suse.de>
      Acked-by: default avatarJoseph Qi <joseph.qi@huawei.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      102c2595
    • Xue jiufei's avatar
      ocfs2: extend transaction for ocfs2_remove_rightmost_path() and... · 17215989
      Xue jiufei authored
      
      ocfs2: extend transaction for ocfs2_remove_rightmost_path() and ocfs2_update_edge_lengths() before to avoid inconsistency between inode and et
      
      I found that jbd2_journal_restart() is called in some places without
      keeping things consistently before.  However, jbd2_journal_restart() may
      commit the handle's transaction and restart another one.  If the first
      transaction is committed successfully while another not, it may cause
      filesystem inconsistency or read only.  This is an effort to fix this
      kind of problems.
      
      This patch (of 3):
      
      The following functions will be called while truncating an extent:
      ocfs2_remove_btree_range
        -> ocfs2_start_trans
        -> ocfs2_remove_extent
           -> ocfs2_truncate_rec
             -> ocfs2_extend_rotate_transaction
               -> jbd2_journal_restart if jbd2_journal_extend fail
             -> ocfs2_rotate_tree_left
               -> ocfs2_remove_rightmost_path
                   -> ocfs2_extend_rotate_transaction
                     -> ocfs2_unlink_subtree
                      -> ocfs2_update_edge_lengths
                        -> ocfs2_extend_trans
                          -> jbd2_journal_restart if jbd2_journal_extend fail
        -> ocfs2_et_update_clusters
        -> ocfs2_commit_trans
      
      jbd2_journal_restart() may be called and it may happened that the buffers
      dirtied in ocfs2_truncate_rec() are committed while buffers dirtied in
      ocfs2_et_update_clusters() are not, the total clusters on extent tree and
      i_clusters in ocfs2_dinode is inconsistency.  So the clusters got from
      ocfs2_dinode is incorrect, and it also cause read-only problem when call
      ocfs2_commit_truncate() with the error message: "Inode %llu has empty
      extent block at %llu".
      
      We should extend enough credits for function ocfs2_remove_rightmost_path
      and ocfs2_update_edge_lengths to avoid this inconsistency.
      Signed-off-by: default avatarjoyce.xue <xuejiufei@huawei.com>
      Acked-by: default avatarJoseph Qi <joseph.qi@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      17215989
    • xuejiufei's avatar
      ocfs2/dlm: move lock to the tail of grant queue while doing in-place convert · e5054c9a
      xuejiufei authored
      
      
      We have found a bug when two nodes doing umount one after another.
      
      1) Node 1 migrate a lockres that has 3 locks in grant queue such as
         N2(PR)<->N3(NL)<->N4(PR) to N2.  After migration, lvb of the lock
         N3(NL) and N4(PR) are empty on node 2 because migration target do not
         copy lvb to these two lock.
      
      2) Node 3 want to convert to PR, it can be granted in
         __dlmconvert_master(), and the order of these locks is unchanged.  The
         lvb of the lock N3(PR) on node 2 is copyed from lockres in function
         dlm_update_lvb() while the lvb of lock N4(PR) is still empty.
      
      3) Node 2 want to leave domain, it will migrate this lockres to node 3.
         Then node 2 will trigger the BUG in dlm_prepare_lvb_for_migration()
         when adding the lock N4(PR) to mres with the following message because
         the lvb of mres is already copied from lock N3(PR), but the lvb of lock
         N4(PR) is empty.
      
      "Mismatched lvb in lock cookie=%u:%llu, name=%.*s, node=%u"
      
      [akpm@linux-foundation.org: tweak comment]
      Signed-off-by: default avatarxuejiufei <xuejiufei@huawei.com>
      Acked-by: default avatarJoseph Qi <joseph.qi@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e5054c9a
    • jiangyiwen's avatar
      ocfs2: solve a problem of crossing the boundary in updating backups · 584dca34
      jiangyiwen authored
      
      
      In update_backups() there exists a problem of crossing the boundary as
      follows:
      
      we assume that lun will be resized to 1TB(cluster_size is 32kb), it will
      include 0~33554431 cluster, in update_backups func, it will backup super
      block in location of 1TB which is the 33554432th cluster, so the
      phenomenon of crossing the boundary happens.
      Signed-off-by: default avatarYiwen Jiang <jiangyiwen@huawei.com>
      Reviewed-by: default avatarJoseph Qi <joseph.qi@huawei.com>
      Cc: Xue jiufei <xuejiufei@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      584dca34
    • jiangyiwen's avatar
      ocfs2: fix occurring deadlock by changing ocfs2_wq from global to local · 35ddf78e
      jiangyiwen authored
      
      
      This patch fixes a deadlock, as follows:
      
        Node 1                Node 2                  Node 3
      1)volume a and b are    only mount vol a        only mount vol b
        mounted
      
      2)                      start to mount b        start to mount a
      
      3)                      check hb of Node 3      check hb of Node 2
                              in vol a, qs_holds++    in vol b, qs_holds++
      
      4) -------------------- all nodes' network down --------------------
      
      5)                      progress of mount b     the same situation as
                              failed, and then call   Node 2
                              ocfs2_dismount_volume.
                              but the process is hung,
                              since there is a work
                              in ocfs2_wq cannot beo
                              completed. This work is
                              about vol a, because
                              ocfs2_wq is global wq.
                              BTW, this work which is
                              scheduled in ocfs2_wq is
                              ocfs2_orphan_scan_work,
                              and the context in this work
                              needs to take inode lock
                              of orphan_dir, because
                              lockres owner are Node 1 and
                              all nodes' nework has been down
                              at the same time, so it can't
                              get the inode lock.
      
      6)                      Why can't this node be fenced
                              when network disconnected?
                              Because the process of
                              mount is hung what caused qs_holds
                              is not equal 0.
      
      Because all works in the ocfs2_wq are relative to the super block.
      
      The solution is to change the ocfs2_wq from global to local.  In other
      words, move it into struct ocfs2_super.
      Signed-off-by: default avatarYiwen Jiang <jiangyiwen@huawei.com>
      Reviewed-by: default avatarJoseph Qi <joseph.qi@huawei.com>
      Cc: Xue jiufei <xuejiufei@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      35ddf78e
    • Joseph Qi's avatar
      ocfs2/dlm: fix BUG in dlm_move_lockres_to_recovery_list · be12b299
      Joseph Qi authored
      
      
      When master handles convert request, it queues ast first and then
      returns status.  This may happen that the ast is sent before the request
      status because the above two messages are sent by two threads.  And
      right after the ast is sent, if master down, it may trigger BUG in
      dlm_move_lockres_to_recovery_list in the requested node because ast
      handler moves it to grant list without clear lock->convert_pending.  So
      remove BUG_ON statement and check if the ast is processed in
      dlmconvert_remote.
      Signed-off-by: default avatarJoseph Qi <joseph.qi@huawei.com>
      Reported-by: default avatarYiwen Jiang <jiangyiwen@huawei.com>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Mark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Tariq Saeed <tariq.x.saeed@oracle.com>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      be12b299
    • Joseph Qi's avatar
      ocfs2/dlm: fix race between convert and recovery · ac7cf246
      Joseph Qi authored
      
      
      There is a race window between dlmconvert_remote and
      dlm_move_lockres_to_recovery_list, which will cause a lock with
      OCFS2_LOCK_BUSY in grant list, thus system hangs.
      
      dlmconvert_remote
      {
              spin_lock(&res->spinlock);
              list_move_tail(&lock->list, &res->converting);
              lock->convert_pending = 1;
              spin_unlock(&res->spinlock);
      
              status = dlm_send_remote_convert_request();
              >>>>>> race window, master has queued ast and return DLM_NORMAL,
                     and then down before sending ast.
                     this node detects master down and calls
                     dlm_move_lockres_to_recovery_list, which will revert the
                     lock to grant list.
                     Then OCFS2_LOCK_BUSY won't be cleared as new master won't
                     send ast any more because it thinks already be authorized.
      
              spin_lock(&res->spinlock);
              lock->convert_pending = 0;
              if (status != DLM_NORMAL)
                      dlm_revert_pending_convert(res, lock);
              spin_unlock(&res->spinlock);
      }
      
      In this case, check if res->state has DLM_LOCK_RES_RECOVERING bit set
      (res is still in recovering) or res master changed (new master has
      finished recovery), reset the status to DLM_RECOVERING, then it will
      retry convert.
      Signed-off-by: default avatarJoseph Qi <joseph.qi@huawei.com>
      Reported-by: default avatarYiwen Jiang <jiangyiwen@huawei.com>
      Reviewed-by: default avatarJunxiao Bi <junxiao.bi@oracle.com>
      Cc: Mark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Tariq Saeed <tariq.x.saeed@oracle.com>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ac7cf246
    • Ryan Ding's avatar
      ocfs2: fix a deadlock issue in ocfs2_dio_end_io_write() · 28888681
      Ryan Ding authored
      
      
      The code should call ocfs2_free_alloc_context() to free meta_ac &
      data_ac before calling ocfs2_run_deallocs().  Because
      ocfs2_run_deallocs() will acquire the system inode's i_mutex hold by
      meta_ac.  So try to release the lock before ocfs2_run_deallocs().
      
      Fixes: af1310367f41 ("ocfs2: fix sparse file & data ordering issue in direct io.")
      Signed-off-by: default avatarRyan Ding <ryan.ding@oracle.com>
      Acked-by: default avatarJunxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <joseph.qi@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      28888681
    • Ryan Ding's avatar
      ocfs2: fix disk file size and memory file size mismatch · ce170828
      Ryan Ding authored
      
      
      When doing append direct write in an already allocated cluster, and fast
      path in ocfs2_dio_get_block() is triggered, function
      ocfs2_dio_end_io_write() will be skipped as there is no context
      allocated.
      
      As a result, the disk file size will not be changed as it should be.
      The solution is to skip fast path when we are about to change file size.
      
      Fixes: af1310367f41 ("ocfs2: fix sparse file & data ordering issue in direct io.")
      Signed-off-by: default avatarRyan Ding <ryan.ding@oracle.com>
      Acked-by: default avatarJunxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <joseph.qi@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ce170828
    • Ryan Ding's avatar
      ocfs2: take ip_alloc_sem in ocfs2_dio_get_block & ocfs2_dio_end_io_write · a86a72a4
      Ryan Ding authored
      
      
      Take ip_alloc_sem to prevent concurrent access to extent tree, which may
      cause the extent tree in an unstable state.
      Signed-off-by: default avatarRyan Ding <ryan.ding@oracle.com>
      Reviewed-by: default avatarJunxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <joseph.qi@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a86a72a4
    • Ryan Ding's avatar
      ocfs2: fix ip_unaligned_aio deadlock with dio work queue · e63890f3
      Ryan Ding authored
      
      
      In the current implementation of unaligned aio+dio, lock order behave as
      follow:
      
      in user process context:
        -> call io_submit()
          -> get i_mutex
      		<== window1
            -> get ip_unaligned_aio
              -> submit direct io to block device
          -> release i_mutex
        -> io_submit() return
      
      in dio work queue context(the work queue is created in __blockdev_direct_IO):
        -> release ip_unaligned_aio
      		<== window2
          -> get i_mutex
            -> clear unwritten flag & change i_size
          -> release i_mutex
      
      There is a limitation to the thread number of dio work queue.  256 at
      default.  If all 256 thread are in the above 'window2' stage, and there
      is a user process in the 'window1' stage, the system will became
      deadlock.  Since the user process hold i_mutex to wait ip_unaligned_aio
      lock, while there is a direct bio hold ip_unaligned_aio mutex who is
      waiting for a dio work queue thread to be schedule.  But all the dio
      work queue thread is waiting for i_mutex lock in 'window2'.
      
      This case only happened in a test which send a large number(more than
      256) of aio at one io_submit() call.
      
      My design is to remove ip_unaligned_aio lock.  Change it to a sync io
      instead.  Just like ip_unaligned_aio lock, serialize the unaligned aio
      dio.
      
      [akpm@linux-foundation.org: remove OCFS2_IOCB_UNALIGNED_IO, per Junxiao Bi]
      Signed-off-by: default avatarRyan Ding <ryan.ding@oracle.com>
      Reviewed-by: default avatarJunxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <joseph.qi@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e63890f3
    • Ryan Ding's avatar
      ocfs2: code clean up for direct io · f1f973ff
      Ryan Ding authored
      
      
      Clean up ocfs2_file_write_iter & ocfs2_prepare_inode_for_write:
       * remove append dio check: it will be checked in ocfs2_direct_IO()
       * remove file hole check: file hole is supported for now
       * remove inline data check: it will be checked in ocfs2_direct_IO()
       * remove the full_coherence check when append dio: we will get the
         inode_lock in ocfs2_dio_get_block, there is no need to fall back to
         buffer io to ensure the coherence semantics.
      
      Now the drop dio procedure is gone.  :)
      
      [akpm@linux-foundation.org: remove unused label]
      Signed-off-by: default avatarRyan Ding <ryan.ding@oracle.com>
      Reviewed-by: default avatarJunxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <joseph.qi@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f1f973ff
    • Ryan Ding's avatar
      ocfs2: fix sparse file & data ordering issue in direct io · c15471f7
      Ryan Ding authored
      There are mainly three issues in the direct io code path after commit
      24c40b32
      
       ("ocfs2: implement ocfs2_direct_IO_write"):
      
        * Does not support sparse file.
        * Does not support data ordering.  eg: when write to a file hole, it
          will alloc extent first.  If system crashed before io finished, data
          will corrupt.
        * Potential risk when doing aio+dio.  The -EIOCBQUEUED return value is
          likely to be ignored by ocfs2_direct_IO_write().
      
      To resolve above problems, re-design direct io code with following ideas:
        * Use buffer io to fill in holes.  And this will make better
          performance also.
        * Clear unwritten after direct write finished.  So we can make sure
          meta data changes after data write to disk.  (Unwritten extent is
          invisible to user, from user's view, meta data is not changed when
          allocate an unwritten extent.)
        * Clear ocfs2_direct_IO_write().  Do all ending work in end_io.
      
      This patch has passed fs,dio,ltp-aiodio.part1,ltp-aiodio.part2,ltp-aiodio.part4
      test cases of ltp.
      
      For performance improvement, see following test result:
      ocfs2 cluster size 1MB, ocfs2 volume is mounted on /mnt/.
      The original way:
        + rm /mnt/test.img -f
        + dd if=/dev/zero of=/mnt/test.img bs=4K count=1048576 oflag=direct
        1048576+0 records in
        1048576+0 records out
        4294967296 bytes (4.3 GB) copied, 1707.83 s, 2.5 MB/s
        + rm /mnt/test.img -f
        + dd if=/dev/zero of=/mnt/test.img bs=256K count=16384 oflag=direct
        16384+0 records in
        16384+0 records out
        4294967296 bytes (4.3 GB) copied, 582.705 s, 7.4 MB/s
      
      After this patch:
        + rm /mnt/test.img -f
        + dd if=/dev/zero of=/mnt/test.img bs=4K count=1048576 oflag=direct
        1048576+0 records in
        1048576+0 records out
        4294967296 bytes (4.3 GB) copied, 64.6412 s, 66.4 MB/s
        + rm /mnt/test.img -f
        + dd if=/dev/zero of=/mnt/test.img bs=256K count=16384 oflag=direct
        16384+0 records in
        16384+0 records out
        4294967296 bytes (4.3 GB) copied, 34.7611 s, 124 MB/s
      Signed-off-by: default avatarRyan Ding <ryan.ding@oracle.com>
      Reviewed-by: default avatarJunxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <joseph.qi@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c15471f7
    • Ryan Ding's avatar
      ocfs2: record UNWRITTEN extents when populate write desc · 4506cfb6
      Ryan Ding authored
      
      
      To support direct io in ocfs2_write_begin_nolock & ocfs2_write_end_nolock.
      
      There is still one issue in the direct write procedure.
      
      phase 1: alloc extent with UNWRITTEN flag
      phase 2: submit direct data to disk, add zero page to page cache
      phase 3: clear UNWRITTEN flag when data has been written to disk
      
      When there are 2 direct write A(0~3KB),B(4~7KB) writing to the same
      cluster 0~7KB (cluster size 8KB).  Write request A arrive phase 2 first,
      it will zero the region (4~7KB).  Before request A enter to phase 3,
      request B arrive phase 2, it will zero region (0~3KB).  This is just like
      request B steps request A.
      
      To resolve this issue, we should let request B knows this cluster is already
      under zero, to prevent it from steps the previous write request.
      
      This patch will add function ocfs2_unwritten_check() to do this job.  It
      will record all clusters that are under direct write(it will be recorded
      in the 'ip_unwritten_list' member of inode info), and prevent the later
      direct write writing to the same cluster to do the zero work again.
      Signed-off-by: default avatarRyan Ding <ryan.ding@oracle.com>
      Reviewed-by: default avatarJunxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <joseph.qi@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4506cfb6
    • Ryan Ding's avatar
      ocfs2: return the physical address in ocfs2_write_cluster · 2de6a3c7
      Ryan Ding authored
      
      
      To support direct io in ocfs2_write_begin_nolock & ocfs2_write_end_nolock.
      
      Direct io needs to get the physical address from write_begin, to map the
      user page.  This patch is to change the arg 'phys' of
      ocfs2_write_cluster to a pointer, so it can be retrieved to write_begin.
      And we can retrieve it to the direct io procedure.
      Signed-off-by: default avatarRyan Ding <ryan.ding@oracle.com>
      Reviewed-by: default avatarJunxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <joseph.qi@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2de6a3c7
    • Ryan Ding's avatar
      ocfs2: do not change i_size in write_end for direct io · 46e62556
      Ryan Ding authored
      
      
      To support direct io in ocfs2_write_begin_nolock & ocfs2_write_end_nolock.
      
      Append direct io do not change i_size in get block phase.  It only move
      to orphan when starting write.  After data is written to disk, it will
      delete itself from orphan and update i_size.  So skip i_size change
      section in write_begin for direct io.
      
      And when there is no extents alloc, no meta data changes needed for
      direct io (since write_begin start trans for 2 reason: alloc extents &
      change i_size.  Now none of them needed).  So we can skip start trans
      procedure.
      Signed-off-by: default avatarRyan Ding <ryan.ding@oracle.com>
      Reviewed-by: default avatarJunxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <joseph.qi@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      46e62556
    • Ryan Ding's avatar
      ocfs2: test target page before change it · 65c4db8c
      Ryan Ding authored
      
      
      To support direct io in ocfs2_write_begin_nolock & ocfs2_write_end_nolock.
      
      Direct io data will not appear in buffer.  The w_target_page member will
      not be filled by direct io.  So avoid to use it when it's NULL.  Unlinke
      buffer io and mmap, direct io will call write_begin with more than 1
      page a time.  So the target_index is not sufficient to describe the
      actual data.  change it to a range start at target_index, end in
      end_index.
      Signed-off-by: default avatarRyan Ding <ryan.ding@oracle.com>
      Reviewed-by: default avatarJunxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <joseph.qi@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      65c4db8c
    • Ryan Ding's avatar
      ocfs2: use c_new to indicate newly allocated extents · b46637d5
      Ryan Ding authored
      
      
      To support direct io in ocfs2_write_begin_nolock & ocfs2_write_end_nolock.
      
      There is a problem in ocfs2's direct io implement: if system crashed
      after extents allocated, and before data return, we will get a extent
      with dirty data on disk.  This problem violate the journal=order
      semantics, which means meta changes take effect after data written to
      disk.  To resolve this issue, direct write can use the UNWRITTEN flag to
      describe a extent during direct data writeback.  The direct write
      procedure should act in the following order:
      
      phase 1: alloc extent with UNWRITTEN flag
      phase 2: submit direct data to disk, add zero page to page cache
      phase 3: clear UNWRITTEN flag when data has been written to disk
      
      This patch is to change the 'c_unwritten' member of
      ocfs2_write_cluster_desc to 'c_clear_unwritten'.  Means whether to clear
      the unwritten flag.  It do not care if a extent is allocated or not.
      And use 'c_new' to specify a newly allocated extent.  So the direct io
      procedure can use c_clear_unwritten to control the UNWRITTEN bit on
      extent.
      Signed-off-by: default avatarRyan Ding <ryan.ding@oracle.com>
      Reviewed-by: default avatarJunxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <joseph.qi@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b46637d5
    • Ryan Ding's avatar
      ocfs2: add ocfs2_write_type_t type to identify the caller of write · c1ad1e3c
      Ryan Ding authored
      
      
      Patchset: fix ocfs2 direct io code patch to support sparse file and data
      ordering semantics
      
      The idea is to use buffer io(more precisely use the interface
      ocfs2_write_begin_nolock & ocfs2_write_end_nolock) to do the zero work
      beyond block size.  And clear UNWRITTEN flag until direct io data has
      been written to disk, which can prevent data corruption when system
      crashed during direct write.
      
      And we will also archive a better performance: eg.  dd direct write new
      file with block size 4KB: before this patchset:
        2.5 MB/s
      after this patchset:
        66.4 MB/s
      
      This patch (of 8):
      
      To support direct io in ocfs2_write_begin_nolock &
      ocfs2_write_end_nolock.
      
      Remove unused args filp & flags.  Add new arg type.  The type is one of
      buffer/direct/mmap.  Indicate 3 way to perform write.  buffer/mmap type
      has implemented.  direct type will be implemented later.
      Signed-off-by: default avatarRyan Ding <ryan.ding@oracle.com>
      Reviewed-by: default avatarJunxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <joseph.qi@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c1ad1e3c
    • Junxiao Bi's avatar
      ocfs2: o2hb: fix double free bug · 9e13f1f9
      Junxiao Bi authored
      This is a regression issue and caused the following kernel panic when do
      ocfs2 multiple test.
      
        BUG: unable to handle kernel paging request at 00000002000800c0
        IP: [<ffffffff81192978>] kmem_cache_alloc+0x78/0x160
        PGD 7bbe5067 PUD 0
        Oops: 0000 [#1] SMP
        Modules linked in: ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi xen_kbdfront xen_netfront xen_fbfront xen_blkfront
        CPU: 2 PID: 4044 Comm: mpirun Not tainted 4.5.0-rc5-next-20160225 #1
        Hardware name: Xen HVM domU, BIOS 4.3.1OVM 05/14/2014
        task: ffff88007a521a80 ti: ffff88007aed0000 task.ti: ffff88007aed0000
        RIP: 0010:[<ffffffff81192978>]  [<ffffffff81192978>] kmem_cache_alloc+0x78/0x160
        RSP: 0018:ffff88007aed3a48  EFLAGS: 00010282
        RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000001991
        RDX: 0000000000001990 RSI: 00000000024000c0 RDI: 000000000001b330
        RBP: ffff88007aed3a98 R08: ffff88007d29b330 R09: 00000002000800c0
        R10: 0000000c51376d87 R11: ffff8800792cac38 R12: ffff88007cc30f00
        R13: 00000000024000c0 R14: ffffffff811b053f R15: ffff88007aed3ce7
        FS:  0000000000000000(0000) GS:ffff88007d280000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00000002000800c0 CR3: 000000007aeb2000 CR4: 00000000000406e0
        Call Trace:
          __d_alloc+0x2f/0x1a0
          d_alloc+0x17/0x80
          lookup_dcache+0x8a/0xc0
          path_openat+0x3c3/0x1210
          do_filp_open+0x80/0xe0
          do_sys_open+0x110/0x200
          SyS_open+0x19/0x20
          do_syscall_64+0x72/0x230
          entry_SYSCALL64_slow_path+0x25/0x25
        Code: 05 e6 77 e7 7e 4d 8b 08 49 8b 40 10 4d 85 c9 0f 84 dd 00 00 00 48 85 c0 0f 84 d4 00 00 00 49 63 44 24 20 49 8b 3c 24 48 8d 4a 01 <49> 8b 1c 01 4c 89 c8 65 48 0f c7 0f 0f 94 c0 3c 01 75 b6 49 63
        RIP   kmem_cache_alloc+0x78/0x160
        CR2: 00000002000800c0
        ---[ end trace 823969e602e4aaac ]---
      
      Fixes: a4a1dfa4
      
      ("ocfs2/cluster: fix memory leak in o2hb_region_release")
      Signed-off-by: default avatarJunxiao Bi <junxiao.bi@oracle.com>
      Reviewed-by: default avatarJoseph Qi <joseph.qi@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9e13f1f9
    • Andrew Morton's avatar
      drivers/input: eliminate INPUT_COMPAT_TEST macro · b8b4ead1
      Andrew Morton authored
      INPUT_COMPAT_TEST became much simpler after commit f4056b52
      
      
      ("input: redefine INPUT_COMPAT_TEST as in_compat_syscall()") so we can
      cleanly eliminate it altogether.
      Acked-by: default avatarDmitry Torokhov <dmitry.torokhov@gmail.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Andy Lutomirski <luto@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b8b4ead1
    • Tetsuo Handa's avatar
      oom, oom_reaper: protect oom_reaper_list using simpler way · bb29902a
      Tetsuo Handa authored
      
      
      "oom, oom_reaper: disable oom_reaper for oom_kill_allocating_task" tried
      to protect oom_reaper_list using MMF_OOM_KILLED flag.  But we can do it
      by simply checking tsk->oom_reaper_list != NULL.
      Signed-off-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bb29902a
    • Michal Hocko's avatar
      oom: make oom_reaper freezable · e2679606
      Michal Hocko authored
      
      
      After "oom: clear TIF_MEMDIE after oom_reaper managed to unmap the
      address space" oom_reaper will call exit_oom_victim on the target task
      after it is done.  This might however race with the PM freezer:
      
      CPU0				CPU1				CPU2
      freeze_processes
        try_to_freeze_tasks
        				# Allocation request
      				out_of_memory
        oom_killer_disable
      				  wake_oom_reaper(P1)
      				  				__oom_reap_task
      								  exit_oom_victim(P1)
          wait_event(oom_victims==0)
      [...]
          				do_exit(P1)
      				  perform IO/interfere with the freezer
      
      which breaks the oom_killer_disable semantic.  We no longer have a
      guarantee that the oom victim won't interfere with the freezer because
      it might be anywhere on the way to do_exit while the freezer thinks the
      task has already terminated.  It might trigger IO or touch devices which
      are frozen already.
      
      In order to close this race, make the oom_reaper thread freezable.  This
      will work because
      	a) already running oom_reaper will block freezer to enter the
      	   quiescent state
      	b) wake_oom_reaper will not wake up the reaper after it has been
      	   frozen
      	c) the only way to call exit_oom_victim after try_to_freeze_tasks
      	   is from the oom victim's context when we know the further
      	   interference shouldn't be possible
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e2679606
    • Vladimir Davydov's avatar
      oom: make oom_reaper_list single linked · 29c696e1
      Vladimir Davydov authored
      
      
      Entries are only added/removed from oom_reaper_list at head so we can
      use a single linked list and hence save a word in task_struct.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      29c696e1
    • Michal Hocko's avatar
      oom, oom_reaper: disable oom_reaper for oom_kill_allocating_task · 855b0183
      Michal Hocko authored
      
      
      Tetsuo has reported that oom_kill_allocating_task=1 will cause
      oom_reaper_list corruption because oom_kill_process doesn't follow
      standard OOM exclusion (aka ignores TIF_MEMDIE) and allows to enqueue
      the same task multiple times - e.g.  by sacrificing the same child
      multiple times.
      
      This patch fixes the issue by introducing a new MMF_OOM_KILLED mm flag
      which is set in oom_kill_process atomically and oom reaper is disabled
      if the flag was already set.
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reported-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      855b0183
    • Michal Hocko's avatar
      mm, oom_reaper: implement OOM victims queuing · 03049269
      Michal Hocko authored
      
      
      wake_oom_reaper has allowed only 1 oom victim to be queued.  The main
      reason for that was the simplicity as other solutions would require some
      way of queuing.  The current approach is racy and that was deemed
      sufficient as the oom_reaper is considered a best effort approach to
      help with oom handling when the OOM victim cannot terminate in a
      reasonable time.  The race could lead to missing an oom victim which can
      get stuck
      
      out_of_memory
        wake_oom_reaper
          cmpxchg // OK
          			oom_reaper
      			  oom_reap_task
      			    __oom_reap_task
      oom_victim terminates
      			      atomic_inc_not_zero // fail
      out_of_memory
        wake_oom_reaper
          cmpxchg // fails
      			  task_to_reap = NULL
      
      This race requires 2 OOM invocations in a short time period which is not
      very likely but certainly not impossible.  E.g.  the original victim
      might have not released a lot of memory for some reason.
      
      The situation would improve considerably if wake_oom_reaper used a more
      robust queuing.  This is what this patch implements.  This means adding
      oom_reaper_list list_head into task_struct (eat a hole before embeded
      thread_struct for that purpose) and a oom_reaper_lock spinlock for
      queuing synchronization.  wake_oom_reaper will then add the task on the
      queue and oom_reaper will dequeue it.
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Cc: Andrea Argangeli <andrea@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      03049269
    • Michal Hocko's avatar
      mm, oom_reaper: report success/failure · bc448e89
      Michal Hocko authored
      
      
      Inform about the successful/failed oom_reaper attempts and dump all the
      held locks to tell us more who is blocking the progress.
      
      [akpm@linux-foundation.org: fix CONFIG_MMU=n build]
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Andrea Argangeli <andrea@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bc448e89
    • Michal Hocko's avatar
      oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space · 36324a99
      Michal Hocko authored
      
      
      When oom_reaper manages to unmap all the eligible vmas there shouldn't
      be much of the freable memory held by the oom victim left anymore so it
      makes sense to clear the TIF_MEMDIE flag for the victim and allow the
      OOM killer to select another task.
      
      The lack of TIF_MEMDIE also means that the victim cannot access memory
      reserves anymore but that shouldn't be a problem because it would get
      the access again if it needs to allocate and hits the OOM killer again
      due to the fatal_signal_pending resp.  PF_EXITING check.  We can safely
      hide the task from the OOM killer because it is clearly not a good
      candidate anymore as everyhing reclaimable has been torn down already.
      
      This patch will allow to cap the time an OOM victim can keep TIF_MEMDIE
      and thus hold off further global OOM killer actions granted the oom
      reaper is able to take mmap_sem for the associated mm struct.  This is
      not guaranteed now but further steps should make sure that mmap_sem for
      write should be blocked killable which will help to reduce such a lock
      contention.  This is not done by this patch.
      
      Note that exit_oom_victim might be called on a remote task from
      __oom_reap_task now so we have to check and clear the flag atomically
      otherwise we might race and underflow oom_victims or wake up waiters too
      early.
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Suggested-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Suggested-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Andrea Argangeli <andrea@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      36324a99
    • Michal Hocko's avatar
      mm, oom: introduce oom reaper · aac45363
      Michal Hocko authored
      
      
      This patch (of 5):
      
      This is based on the idea from Mel Gorman discussed during LSFMM 2015
      and independently brought up by Oleg Nesterov.
      
      The OOM killer currently allows to kill only a single task in a good
      hope that the task will terminate in a reasonable time and frees up its
      memory.  Such a task (oom victim) will get an access to memory reserves
      via mark_oom_victim to allow a forward progress should there be a need
      for additional memory during exit path.
      
      It has been shown (e.g.  by Tetsuo Handa) that it is not that hard to
      construct workloads which break the core assumption mentioned above and
      the OOM victim might take unbounded amount of time to exit because it
      might be blocked in the uninterruptible state waiting for an event (e.g.
      lock) which is blocked by another task looping in the page allocator.
      
      This patch reduces the probability of such a lockup by introducing a
      specialized kernel thread (oom_reaper) which tries to reclaim additional
      memory by preemptively reaping the anonymous or swapped out memory owned
      by the oom victim under an assumption that such a memory won't be needed
      when its owner is killed and kicked from the userspace anyway.  There is
      one notable exception to this, though, if the OOM victim was in the
      process of coredumping the result would be incomplete.  This is
      considered a reasonable constrain because the overall system health is
      more important than debugability of a particular application.
      
      A kernel thread has been chosen because we need a reliable way of
      invocation so workqueue context is not appropriate because all the
      workers might be busy (e.g.  allocating memory).  Kswapd which sounds
      like another good fit is not appropriate as well because it might get
      blocked on locks during reclaim as well.
      
      oom_reaper has to take mmap_sem on the target task for reading so the
      solution is not 100% because the semaphore might be held or blocked for
      write but the probability is reduced considerably wrt.  basically any
      lock blocking forward progress as described above.  In order to prevent
      from blocking on the lock without any forward progress we are using only
      a trylock and retry 10 times with a short sleep in between.  Users of
      mmap_sem which need it for write should be carefully reviewed to use
      _killable waiting as much as possible and reduce allocations requests
      done with the lock held to absolute minimum to reduce the risk even
      further.
      
      The API between oom killer and oom reaper is quite trivial.
      wake_oom_reaper updates mm_to_reap with cmpxchg to guarantee only
      NULL->mm transition and oom_reaper clear this atomically once it is done
      with the work.  This means that only a single mm_struct can be reaped at
      the time.  As the operation is potentially disruptive we are trying to
      limit it to the ncessary minimum and the reaper blocks any updates while
      it operates on an mm.  mm_struct is pinned by mm_count to allow parallel
      exit_mmap and a race is detected by atomic_inc_not_zero(mm_users).
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Suggested-by: default avatarOleg Nesterov <oleg@redhat.com>
      Suggested-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andrea Argangeli <andrea@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      aac45363
    • Andrew Morton's avatar
      sched: add schedule_timeout_idle() · 69b27baf
      Andrew Morton authored
      
      
      This will be needed in the patch "mm, oom: introduce oom reaper".
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      69b27baf