1. 20 Feb, 2015 1 commit
  2. 04 Feb, 2015 1 commit
    • Dave Chinner's avatar
      aio: annotate aio_read_event_ring for sleep patterns · 9c9ce763
      Dave Chinner authored
      
      
      Under CONFIG_DEBUG_ATOMIC_SLEEP=y, aio_read_event_ring() will throw
      warnings like the following due to being called from wait_event
      context:
      
       WARNING: CPU: 0 PID: 16006 at kernel/sched/core.c:7300 __might_sleep+0x7f/0x90()
       do not call blocking ops when !TASK_RUNNING; state=1 set at [<ffffffff810d85a3>] prepare_to_wait_event+0x63/0x110
       Modules linked in:
       CPU: 0 PID: 16006 Comm: aio-dio-fcntl-r Not tainted 3.19.0-rc6-dgc+ #705
       Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
        ffffffff821c0372 ffff88003c117cd8 ffffffff81daf2bd 000000000000d8d8
        ffff88003c117d28 ffff88003c117d18 ffffffff8109beda ffff88003c117cf8
        ffffffff821c115e 0000000000000061 0000000000000000 00007ffffe4aa300
       Call Trace:
        [<ffffffff81daf2bd>] dump_stack+0x4c/0x65
        [<ffffffff8109beda>] warn_slowpath_common+0x8a/0xc0
        [<ffffffff8109bf56>] warn_slowpath_fmt+0x46/0x50
        [<ffffffff810d85a3>] ? prepare_to_wait_event+0x63/0x110
        [<ffffffff810d85a3>] ? prepare_to_wait_event+0x63/0x110
        [<ffffffff810bdfcf>] __might_sleep+0x7f/0x90
        [<ffffffff81db8344>] mutex_lock+0x24/0x45
        [<ffffffff81216b7c>] aio_read_events+0x4c/0x290
        [<ffffffff81216fac>] read_events+0x1ec/0x220
        [<ffffffff810d8650>] ? prepare_to_wait_event+0x110/0x110
        [<ffffffff810fdb10>] ? hrtimer_get_res+0x50/0x50
        [<ffffffff8121899d>] SyS_io_getevents+0x4d/0xb0
        [<ffffffff81dba5a9>] system_call_fastpath+0x12/0x17
       ---[ end trace bde69eaf655a4fea ]---
      
      There is not actually a bug here, so annotate the code to tell the
      debug logic that everything is just fine and not to fire a false
      positive.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarBenjamin LaHaise <bcrl@kvack.org>
      9c9ce763
  3. 20 Jan, 2015 2 commits
  4. 13 Dec, 2014 2 commits
    • Fam Zheng's avatar
      aio: Skip timer for io_getevents if timeout=0 · 5f785de5
      Fam Zheng authored
      
      
      In this case, it is basically a polling. Let's not involve timer at all
      because that would hurt performance for application event loops.
      
      In an arbitrary test I've done, io_getevents syscall elapsed time
      reduces from 50000+ nanoseconds to a few hundereds.
      Signed-off-by: default avatarFam Zheng <famz@redhat.com>
      Signed-off-by: default avatarBenjamin LaHaise <bcrl@kvack.org>
      5f785de5
    • Pavel Emelyanov's avatar
      aio: Make it possible to remap aio ring · e4a0d3e7
      Pavel Emelyanov authored
      
      
      There are actually two issues this patch addresses. Let me start with
      the one I tried to solve in the beginning.
      
      So, in the checkpoint-restore project (criu) we try to dump tasks'
      state and restore one back exactly as it was. One of the tasks' state
      bits is rings set up with io_setup() call. There's (almost) no problems
      in dumping them, there's a problem restoring them -- if I dump a task
      with aio ring originally mapped at address A, I want to restore one
      back at exactly the same address A. Unfortunately, the io_setup() does
      not allow for that -- it mmaps the ring at whatever place mm finds
      appropriate (it calls do_mmap_pgoff() with zero address and without
      the MAP_FIXED flag).
      
      To make restore possible I'm going to mremap() the freshly created ring
      into the address A (under which it was seen before dump). The problem is
      that the ring's virtual address is passed back to the user-space as the
      context ID and this ID is then used as search key by all the other io_foo()
      calls. Reworking this ID to be just some integer doesn't seem to work, as
      this value is already used by libaio as a pointer using which this library
      accesses memory for aio meta-data.
      
      So, to make restore work we need to make sure that
      
      a) ring is mapped at desired virtual address
      b) kioctx->user_id matches this value
      
      Having said that, the patch makes mremap() on aio region update the
      kioctx's user_id and mmap_base values.
      
      Here appears the 2nd issue I mentioned in the beginning of this mail.
      If (regardless of the C/R dances I do) someone creates an io context
      with io_setup(), then mremap()-s the ring and then destroys the context,
      the kill_ioctx() routine will call munmap() on wrong (old) address.
      This will result in a) aio ring remaining in memory and b) some other
      vma get unexpectedly unmapped.
      
      What do you think?
      Signed-off-by: default avatarPavel Emelyanov <xemul@parallels.com>
      Acked-by: default avatarDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: default avatarBenjamin LaHaise <bcrl@kvack.org>
      e4a0d3e7
  5. 06 Nov, 2014 1 commit
    • Gu Zheng's avatar
      aio: fix uncorrent dirty pages accouting when truncating AIO ring buffer · 835f252c
      Gu Zheng authored
      https://bugzilla.kernel.org/show_bug.cgi?id=86831
      
      
      
      Markus reported that when shutting down mysqld (with AIO support,
      on a ext3 formatted Harddrive) leads to a negative number of dirty pages
      (underrun to the counter). The negative number results in a drastic reduction
      of the write performance because the page cache is not used, because the kernel
      thinks it is still 2 ^ 32 dirty pages open.
      
      Add a warn trace in __dec_zone_state will catch this easily:
      
      static inline void __dec_zone_state(struct zone *zone, enum
      	zone_stat_item item)
      {
           atomic_long_dec(&zone->vm_stat[item]);
      +    WARN_ON_ONCE(item == NR_FILE_DIRTY &&
      	atomic_long_read(&zone->vm_stat[item]) < 0);
           atomic_long_dec(&vm_stat[item]);
      }
      
      [   21.341632] ------------[ cut here ]------------
      [   21.346294] WARNING: CPU: 0 PID: 309 at include/linux/vmstat.h:242
      cancel_dirty_page+0x164/0x224()
      [   21.355296] Modules linked in: wutbox_cp sata_mv
      [   21.359968] CPU: 0 PID: 309 Comm: kworker/0:1 Not tainted 3.14.21-WuT #80
      [   21.366793] Workqueue: events free_ioctx
      [   21.370760] [<c0016a64>] (unwind_backtrace) from [<c0012f88>]
      (show_stack+0x20/0x24)
      [   21.378562] [<c0012f88>] (show_stack) from [<c03f8ccc>]
      (dump_stack+0x24/0x28)
      [   21.385840] [<c03f8ccc>] (dump_stack) from [<c0023ae4>]
      (warn_slowpath_common+0x84/0x9c)
      [   21.393976] [<c0023ae4>] (warn_slowpath_common) from [<c0023bb8>]
      (warn_slowpath_null+0x2c/0x34)
      [   21.402800] [<c0023bb8>] (warn_slowpath_null) from [<c00c0688>]
      (cancel_dirty_page+0x164/0x224)
      [   21.411524] [<c00c0688>] (cancel_dirty_page) from [<c00c080c>]
      (truncate_inode_page+0x8c/0x158)
      [   21.420272] [<c00c080c>] (truncate_inode_page) from [<c00c0a94>]
      (truncate_inode_pages_range+0x11c/0x53c)
      [   21.429890] [<c00c0a94>] (truncate_inode_pages_range) from
      [<c00c0f6c>] (truncate_pagecache+0x88/0xac)
      [   21.439252] [<c00c0f6c>] (truncate_pagecache) from [<c00c0fec>]
      (truncate_setsize+0x5c/0x74)
      [   21.447731] [<c00c0fec>] (truncate_setsize) from [<c013b3a8>]
      (put_aio_ring_file.isra.14+0x34/0x90)
      [   21.456826] [<c013b3a8>] (put_aio_ring_file.isra.14) from
      [<c013b424>] (aio_free_ring+0x20/0xcc)
      [   21.465660] [<c013b424>] (aio_free_ring) from [<c013b4f4>]
      (free_ioctx+0x24/0x44)
      [   21.473190] [<c013b4f4>] (free_ioctx) from [<c003d8d8>]
      (process_one_work+0x134/0x47c)
      [   21.481132] [<c003d8d8>] (process_one_work) from [<c003e988>]
      (worker_thread+0x130/0x414)
      [   21.489350] [<c003e988>] (worker_thread) from [<c00448ac>]
      (kthread+0xd4/0xec)
      [   21.496621] [<c00448ac>] (kthread) from [<c000ec18>]
      (ret_from_fork+0x14/0x20)
      [   21.503884] ---[ end trace 79c4bf42c038c9a1 ]---
      
      The cause is that we set the aio ring file pages as *DIRTY* via SetPageDirty
      (bypasses the VFS dirty pages increment) when init, and aio fs uses
      *default_backing_dev_info* as the backing dev, which does not disable
      the dirty pages accounting capability.
      So truncating aio ring file will contribute to accounting dirty pages (VFS
      dirty pages decrement), then error occurs.
      
      The original goal is keeping these pages in memory (can not be reclaimed
      or swapped) in life-time via marking it dirty. But thinking more, we have
      already pinned pages via elevating the page's refcount, which can already
      achieve the goal, so the SetPageDirty seems unnecessary.
      
      In order to fix the issue, using the __set_page_dirty_no_writeback instead
      of the nop .set_page_dirty, and dropped the SetPageDirty (don't manually
      set the dirty flags, don't disable set_page_dirty(), rely on default behaviour).
      
      With the above change, the dirty pages accounting can work well. But as we
      known, aio fs is an anonymous one, which should never cause any real write-back,
      we can ignore the dirty pages (write back) accounting by disabling the dirty
      pages (write back) accounting capability. So we introduce an aio private
      backing dev info (disabled the ACCT_DIRTY/WRITEBACK/ACCT_WB capabilities) to
      replace the default one.
      Reported-by: default avatarMarkus Königshaus <m.koenigshaus@wut.de>
      Signed-off-by: default avatarGu Zheng <guz.fnst@cn.fujitsu.com>
      Cc: stable <stable@vger.kernel.org>
      Acked-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarBenjamin LaHaise <bcrl@kvack.org>
      835f252c
  6. 24 Sep, 2014 1 commit
    • Tejun Heo's avatar
      percpu_ref: add PERCPU_REF_INIT_* flags · 2aad2a86
      Tejun Heo authored
      
      
      With the recent addition of percpu_ref_reinit(), percpu_ref now can be
      used as a persistent switch which can be turned on and off repeatedly
      where turning off maps to killing the ref and waiting for it to drain;
      however, there currently isn't a way to initialize a percpu_ref in its
      off (killed and drained) state, which can be inconvenient for certain
      persistent switch use cases.
      
      Similarly, percpu_ref_switch_to_atomic/percpu() allow dynamic
      selection of operation mode; however, currently a newly initialized
      percpu_ref is always in percpu mode making it impossible to avoid the
      latency overhead of switching to atomic mode.
      
      This patch adds @flags to percpu_ref_init() and implements the
      following flags.
      
      * PERCPU_REF_INIT_ATOMIC	: start ref in atomic mode
      * PERCPU_REF_INIT_DEAD		: start ref killed and drained
      
      These flags should be able to serve the above two use cases.
      
      v2: target_core_tpg.c conversion was missing.  Fixed.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarKent Overstreet <kmo@daterainc.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      2aad2a86
  7. 08 Sep, 2014 1 commit
    • Tejun Heo's avatar
      percpu-refcount: add @gfp to percpu_ref_init() · a34375ef
      Tejun Heo authored
      
      
      Percpu allocator now supports allocation mask.  Add @gfp to
      percpu_ref_init() so that !GFP_KERNEL allocation masks can be used
      with percpu_refs too.
      
      This patch doesn't make any functional difference.
      
      v2: blk-mq conversion was missing.  Updated.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Kent Overstreet <koverstreet@google.com>
      Cc: Benjamin LaHaise <bcrl@kvack.org>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Nicholas A. Bellinger <nab@linux-iscsi.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      a34375ef
  8. 04 Sep, 2014 1 commit
  9. 02 Sep, 2014 1 commit
    • Jeff Moyer's avatar
      aio: add missing smp_rmb() in read_events_ring · 2ff396be
      Jeff Moyer authored
      
      
      We ran into a case on ppc64 running mariadb where io_getevents would
      return zeroed out I/O events.  After adding instrumentation, it became
      clear that there was some missing synchronization between reading the
      tail pointer and the events themselves.  This small patch fixes the
      problem in testing.
      
      Thanks to Zach for helping to look into this, and suggesting the fix.
      Signed-off-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: default avatarBenjamin LaHaise <bcrl@kvack.org>
      Cc: stable@vger.kernel.org
      2ff396be
  10. 24 Aug, 2014 1 commit
    • Benjamin LaHaise's avatar
      aio: fix reqs_available handling · d856f32a
      Benjamin LaHaise authored
      As reported by Dan Aloni, commit f8567a38 ("aio: fix aio request
      leak when events are reaped by userspace") introduces a regression when
      user code attempts to perform io_submit() with more events than are
      available in the ring buffer.  Reverting that commit would reintroduce a
      regression when user space event reaping is used.
      
      Fixing this bug is a bit more involved than the previous attempts to fix
      this regression.  Since we do not have a single point at which we can
      count events as being reaped by user space and io_getevents(), we have
      to track event completion by looking at the number of events left in the
      event ring.  So long as there are as many events in the ring buffer as
      there have been completion events generate, we cannot call
      put_reqs_available().  The code to check for this is now placed in
      refill_reqs_available().
      
      A test program from Dan and modified by me for verifying this bug is available
      at http://www.kvack.org/~bcrl/20140824-aio_bug.c
      
       .
      Reported-by: default avatarDan Aloni <dan@kernelim.com>
      Signed-off-by: default avatarBenjamin LaHaise <bcrl@kvack.org>
      Acked-by: default avatarDan Aloni <dan@kernelim.com>
      Cc: Kent Overstreet <kmo@daterainc.com>
      Cc: Mateusz Guzik <mguzik@redhat.com>
      Cc: Petr Matousek <pmatouse@redhat.com>
      Cc: stable@vger.kernel.org      # v3.16 and anything that f8567a38
      
       was backported to
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d856f32a
  11. 24 Jul, 2014 4 commits
  12. 22 Jul, 2014 1 commit
  13. 14 Jul, 2014 1 commit
  14. 28 Jun, 2014 2 commits
    • Tejun Heo's avatar
      percpu-refcount: require percpu_ref to be exited explicitly · 9a1049da
      Tejun Heo authored
      
      
      Currently, a percpu_ref undoes percpu_ref_init() automatically by
      freeing the allocated percpu area when the percpu_ref is killed.
      While seemingly convenient, this has the following niggles.
      
      * It's impossible to re-init a released reference counter without
        going through re-allocation.
      
      * In the similar vein, it's impossible to initialize a percpu_ref
        count with static percpu variables.
      
      * We need and have an explicit destructor anyway for failure paths -
        percpu_ref_cancel_init().
      
      This patch removes the automatic percpu counter freeing in
      percpu_ref_kill_rcu() and repurposes percpu_ref_cancel_init() into a
      generic destructor now named percpu_ref_exit().  percpu_ref_destroy()
      is considered but it gets confusing with percpu_ref_kill() while
      "exit" clearly indicates that it's the counterpart of
      percpu_ref_init().
      
      All percpu_ref_cancel_init() users are updated to invoke
      percpu_ref_exit() instead and explicit percpu_ref_exit() calls are
      added to the destruction path of all percpu_ref users.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarBenjamin LaHaise <bcrl@kvack.org>
      Cc: Kent Overstreet <kmo@daterainc.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Benjamin LaHaise <bcrl@kvack.org>
      Cc: Nicholas A. Bellinger <nab@linux-iscsi.org>
      Cc: Li Zefan <lizefan@huawei.com>
      9a1049da
    • Tejun Heo's avatar
      percpu-refcount, aio: use percpu_ref_cancel_init() in ioctx_alloc() · 55c6c814
      Tejun Heo authored
      
      
      ioctx_alloc() reaches inside percpu_ref and directly frees
      ->pcpu_count in its failure path, which is quite gross.  percpu_ref
      has been providing a proper interface to do this,
      percpu_ref_cancel_init(), for quite some time now.  Let's use that
      instead.
      
      This patch doesn't introduce any behavior changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarBenjamin LaHaise <bcrl@kvack.org>
      Cc: Kent Overstreet <kmo@daterainc.com>
      55c6c814
  15. 24 Jun, 2014 4 commits
    • Oleg Nesterov's avatar
      aio: kill the misleading rcu read locks in ioctx_add_table() and kill_ioctx() · 855ef0de
      Oleg Nesterov authored
      
      
      ioctx_add_table() is the writer, it does not need rcu_read_lock() to
      protect ->ioctx_table. It relies on mm->ioctx_lock and rcu locks just
      add the confusion.
      
      And it doesn't need rcu_dereference() by the same reason, it must see
      any updates previously done under the same ->ioctx_lock. We could use
      rcu_dereference_protected() but the patch uses rcu_dereference_raw(),
      the function is simple enough.
      
      The same for kill_ioctx(), although it does not update the pointer.
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarBenjamin LaHaise <bcrl@kvack.org>
      855ef0de
    • Oleg Nesterov's avatar
      aio: change exit_aio() to load mm->ioctx_table once and avoid rcu_read_lock() · 4b70ac5f
      Oleg Nesterov authored
      
      
      On 04/30, Benjamin LaHaise wrote:
      >
      > > -		ctx->mmap_size = 0;
      > > -
      > > -		kill_ioctx(mm, ctx, NULL);
      > > +		if (ctx) {
      > > +			ctx->mmap_size = 0;
      > > +			kill_ioctx(mm, ctx, NULL);
      > > +		}
      >
      > Rather than indenting and moving the two lines changing mmap_size and the
      > kill_ioctx() call, why not just do "if (!ctx) ... continue;"?  That reduces
      > the number of lines changed and avoid excessive indentation.
      
      OK. To me the code looks better/simpler with "if (ctx)", but this is subjective
      of course, I won't argue.
      
      The patch still removes the empty line between mmap_size = 0 and kill_ioctx(),
      we reset mmap_size only for kill_ioctx(). But feel free to remove this change.
      
      -------------------------------------------------------------------------------
      Subject: [PATCH v3 1/2] aio: change exit_aio() to load mm->ioctx_table once and avoid rcu_read_lock()
      
      1. We can read ->ioctx_table only once and we do not read rcu_read_lock()
         or even rcu_dereference().
      
         This mm has no users, nobody else can play with ->ioctx_table. Otherwise
         the code is buggy anyway, if we need rcu_read_lock() in a loop because
         ->ioctx_table can be updated then kfree(table) is obviously wrong.
      
      2. Update the comment. "exit_mmap(mm) is coming" is the good reason to avoid
         munmap(), but another reason is that we simply can't do vm_munmap() unless
         current->mm == mm and this is not true in general, the caller is mmput().
      
      3. We do not really need to nullify mm->ioctx_table before return, probably
         the current code does this to catch the potential problems. But in this
         case RCU_INIT_POINTER(NULL) looks better.
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarBenjamin LaHaise <bcrl@kvack.org>
      4b70ac5f
    • Benjamin LaHaise's avatar
      aio: fix kernel memory disclosure in io_getevents() introduced in v3.10 · edfbbf38
      Benjamin LaHaise authored
      A kernel memory disclosure was introduced in aio_read_events_ring() in v3.10
      by commit a31ad380
      
      .  The changes made to
      aio_read_events_ring() failed to correctly limit the index into
      ctx->ring_pages[], allowing an attacked to cause the subsequent kmap() of
      an arbitrary page with a copy_to_user() to copy the contents into userspace.
      This vulnerability has been assigned CVE-2014-0206.  Thanks to Mateusz and
      Petr for disclosing this issue.
      
      This patch applies to v3.12+.  A separate backport is needed for 3.10/3.11.
      Signed-off-by: default avatarBenjamin LaHaise <bcrl@kvack.org>
      Cc: Mateusz Guzik <mguzik@redhat.com>
      Cc: Petr Matousek <pmatouse@redhat.com>
      Cc: Kent Overstreet <kmo@daterainc.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: stable@vger.kernel.org
      edfbbf38
    • Benjamin LaHaise's avatar
      aio: fix aio request leak when events are reaped by userspace · f8567a38
      Benjamin LaHaise authored
      
      
      The aio cleanups and optimizations by kmo that were merged into the 3.10
      tree added a regression for userspace event reaping.  Specifically, the
      reference counts are not decremented if the event is reaped in userspace,
      leading to the application being unable to submit further aio requests.
      This patch applies to 3.12+.  A separate backport is required for 3.10/3.11.
      This issue was uncovered as part of CVE-2014-0206.
      Signed-off-by: default avatarBenjamin LaHaise <bcrl@kvack.org>
      Cc: stable@vger.kernel.org
      Cc: Kent Overstreet <kmo@daterainc.com>
      Cc: Mateusz Guzik <mguzik@redhat.com>
      Cc: Petr Matousek <pmatouse@redhat.com>
      f8567a38
  16. 06 May, 2014 1 commit
    • Al Viro's avatar
      new methods: ->read_iter() and ->write_iter() · 293bc982
      Al Viro authored
      
      
      Beginning to introduce those.  Just the callers for now, and it's
      clumsier than it'll eventually become; once we finish converting
      aio_read and aio_write instances, the things will get nicer.
      
      For now, these guys are in parallel to ->aio_read() and ->aio_write();
      they take iocb and iov_iter, with everything in iov_iter already
      validated.  File offset is passed in iocb->ki_pos, iov/nr_segs -
      in iov_iter.
      
      Main concerns in that series are stack footprint and ability to
      split the damn thing cleanly.
      
      [fix from Peter Ujfalusi <peter.ujfalusi@ti.com> folded]
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      293bc982
  17. 01 May, 2014 1 commit
  18. 29 Apr, 2014 2 commits
  19. 22 Apr, 2014 1 commit
  20. 16 Apr, 2014 1 commit
    • Anatol Pomozov's avatar
      aio: block io_destroy() until all context requests are completed · e02ba72a
      Anatol Pomozov authored
      deletes aio context and all resources related to. It makes sense that
      no IO operations connected to the context should be running after the context
      is destroyed. As we removed io_context we have no chance to
      get requests status or call io_getevents().
      
      man page for io_destroy says that this function may block until
      all context's requests are completed. Before kernel 3.11 io_destroy()
      blocked indeed, but since aio refactoring in 3.11 it is not true anymore.
      
      Here is a pseudo-code that shows a testcase for a race condition discovered
      in 3.11:
      
        initialize io_context
        io_submit(read to buffer)
        io_destroy()
      
        // context is destroyed so we can free the resources
        free(buffers);
      
        // if the buffer is allocated by some other user he'll be surprised
        // to learn that the buffer still filled by an outstanding operation
        // from the destroyed io_context
      
      The fix is straight-forward - add a completion struct and wait on it
      in io_destroy, complete() should be called when number of in-fligh requests
      reaches zero.
      
      If two or more io_destroy() called for the same context simultaneously then
      only the first one waits for IO completion, other calls behaviour is undefined.
      
      Tested: ran http://pastebin.com/LrPsQ4RL
      
       testcase for several hours and
        do not see the race condition anymore.
      Signed-off-by: default avatarAnatol Pomozov <anatol.pomozov@gmail.com>
      Signed-off-by: default avatarBenjamin LaHaise <bcrl@kvack.org>
      e02ba72a
  21. 28 Mar, 2014 1 commit
    • Benjamin LaHaise's avatar
      aio: v4 ensure access to ctx->ring_pages is correctly serialised for migration · fa8a53c3
      Benjamin LaHaise authored
      
      
      As reported by Tang Chen, Gu Zheng and Yasuaki Isimatsu, the following issues
      exist in the aio ring page migration support.
      
      As a result, for example, we have the following problem:
      
                  thread 1                      |              thread 2
                                                |
      aio_migratepage()                         |
       |-> take ctx->completion_lock            |
       |-> migrate_page_copy(new, old)          |
       |   *NOW*, ctx->ring_pages[idx] == old   |
                                                |
                                                |    *NOW*, ctx->ring_pages[idx] == old
                                                |    aio_read_events_ring()
                                                |     |-> ring = kmap_atomic(ctx->ring_pages[0])
                                                |     |-> ring->head = head;          *HERE, write to the old ring page*
                                                |     |-> kunmap_atomic(ring);
                                                |
       |-> ctx->ring_pages[idx] = new           |
       |   *BUT NOW*, the content of            |
       |    ring_pages[idx] is old.             |
       |-> release ctx->completion_lock         |
      
      As above, the new ring page will not be updated.
      
      Fix this issue, as well as prevent races in aio_ring_setup() by holding
      the ring_lock mutex during kioctx setup and page migration.  This avoids
      the overhead of taking another spinlock in aio_read_events_ring() as Tang's
      and Gu's original fix did, pushing the overhead into the migration code.
      
      Note that to handle the nesting of ring_lock inside of mmap_sem, the
      migratepage operation uses mutex_trylock().  Page migration is not a 100%
      critical operation in this case, so the ocassional failure can be
      tolerated.  This issue was reported by Sasha Levin.
      
      Based on feedback from Linus, avoid the extra taking of ctx->completion_lock.
      Instead, make page migration fully serialised by mapping->private_lock, and
      have aio_free_ring() simply disconnect the kioctx from the mapping by calling
      put_aio_ring_file() before touching ctx->ring_pages[].  This simplifies the
      error handling logic in aio_migratepage(), and should improve robustness.
      
      v4: always do mutex_unlock() in cases when kioctx setup fails.
      Reported-by: default avatarYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Reported-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: default avatarBenjamin LaHaise <bcrl@kvack.org>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Gu Zheng <guz.fnst@cn.fujitsu.com>
      Cc: stable@vger.kernel.org
      fa8a53c3
  22. 22 Dec, 2013 1 commit
    • Linus Torvalds's avatar
      aio: clean up and fix aio_setup_ring page mapping · 3dc9acb6
      Linus Torvalds authored
      Since commit 36bc08cc ("fs/aio: Add support to aio ring pages
      migration") the aio ring setup code has used a special per-ring backing
      inode for the page allocations, rather than just using random anonymous
      pages.
      
      However, rather than remembering the pages as it allocated them, it
      would allocate the pages, insert them into the file mapping (dirty, so
      that they couldn't be free'd), and then forget about them.  And then to
      look them up again, it would mmap the mapping, and then use
      "get_user_pages()" to get back an array of the pages we just created.
      
      Now, not only is that incredibly inefficient, it also leaked all the
      pages if the mmap failed (which could happen due to excessive number of
      mappings, for example).
      
      So clean it all up, making it much more straightforward.  Also remove
      some left-overs of the previous (broken) mm_populate() usage that was
      removed in commit d6c355c7
      
       ("aio: fix race in ring buffer page
      lookup introduced by page migration support") but left the pointless and
      now misleading MAP_POPULATE flag around.
      Tested-and-acked-by: default avatarBenjamin LaHaise <bcrl@kvack.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3dc9acb6
  23. 21 Dec, 2013 2 commits
    • Benjamin LaHaise's avatar
      aio/migratepages: make aio migrate pages sane · 8e321fef
      Benjamin LaHaise authored
      
      
      The arbitrary restriction on page counts offered by the core
      migrate_page_move_mapping() code results in rather suspicious looking
      fiddling with page reference counts in the aio_migratepage() operation.
      To fix this, make migrate_page_move_mapping() take an extra_count parameter
      that allows aio to tell the code about its own reference count on the page
      being migrated.
      
      While cleaning up aio_migratepage(), make it validate that the old page
      being passed in is actually what aio_migratepage() expects to prevent
      misbehaviour in the case of races.
      Signed-off-by: default avatarBenjamin LaHaise <bcrl@kvack.org>
      8e321fef
    • Benjamin LaHaise's avatar
      aio: fix kioctx leak introduced by "aio: Fix a trinity splat" · 1881686f
      Benjamin LaHaise authored
      e34ecee2
      
       reworked the percpu reference
      counting to correct a bug trinity found.  Unfortunately, the change lead
      to kioctxes being leaked because there was no final reference count to
      put.  Add that reference count back in to fix things.
      Signed-off-by: default avatarBenjamin LaHaise <bcrl@kvack.org>
      Cc: stable@vger.kernel.org
      1881686f
  24. 06 Dec, 2013 1 commit
  25. 19 Nov, 2013 2 commits
  26. 13 Nov, 2013 1 commit
  27. 09 Nov, 2013 1 commit
  28. 11 Oct, 2013 1 commit
    • Kent Overstreet's avatar
      aio: Fix a trinity splat · e34ecee2
      Kent Overstreet authored
      
      
      aio kiocb refcounting was broken - it was relying on keeping track of
      the number of available ring buffer entries, which it needs to do
      anyways; then at shutdown time it'd wait for completions to be delivered
      until the # of available ring buffer entries equalled what it was
      initialized to.
      
      Problem with  that is that the ring buffer is mapped writable into
      userspace, so userspace could futz with the head and tail pointers to
      cause the kernel to see extra completions, and cause free_ioctx() to
      return while there were still outstanding kiocbs. Which would be bad.
      
      Fix is just to directly refcount the kiocbs - which is more
      straightforward, and with the new percpu refcounting code doesn't cost
      us any cacheline bouncing which was the whole point of the original
      scheme.
      
      Also clean up ioctx_alloc()'s error path and fix a bug where it wasn't
      subtracting from aio_nr if ioctx_add_table() failed.
      Signed-off-by: default avatarKent Overstreet <kmo@daterainc.com>
      e34ecee2