1. 24 Oct, 2011 1 commit
  2. 28 Sep, 2011 1 commit
    • Hannes Reinecke's avatar
      block: Free queue resources at blk_release_queue() · 777eb1bf
      Hannes Reinecke authored
      
      
      A kernel crash is observed when a mounted ext3/ext4 filesystem is
      physically removed. The problem is that blk_cleanup_queue() frees up
      some resources eg by calling elevator_exit(), which are not checked for
      in normal operation. So we should rather move these calls to the
      destructor function blk_release_queue() as at that point all remaining
      references are gone. However, in doing so we have to ensure that any
      externally supplied queue_lock is disconnected as the driver might free
      up the lock after the call of blk_cleanup_queue(),
      Signed-off-by: default avatarHannes Reinecke <hare@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      777eb1bf
  3. 24 Aug, 2011 2 commits
  4. 15 Aug, 2011 1 commit
    • Jeff Moyer's avatar
      block: fix flush machinery for stacking drivers with differring flush flags · 4853abaa
      Jeff Moyer authored
      Commit ae1b1539
      
      , block: reimplement
      FLUSH/FUA to support merge, introduced a performance regression when
      running any sort of fsyncing workload using dm-multipath and certain
      storage (in our case, an HP EVA).  The test I ran was fs_mark, and it
      dropped from ~800 files/sec on ext4 to ~100 files/sec.  It turns out
      that dm-multipath always advertised flush+fua support, and passed
      commands on down the stack, where those flags used to get stripped off.
      The above commit changed that behavior:
      
      static inline struct request *__elv_next_request(struct request_queue *q)
      {
              struct request *rq;
      
              while (1) {
      -               while (!list_empty(&q->queue_head)) {
      +               if (!list_empty(&q->queue_head)) {
                              rq = list_entry_rq(q->queue_head.next);
      -                       if (!(rq->cmd_flags & (REQ_FLUSH | REQ_FUA)) ||
      -                           (rq->cmd_flags & REQ_FLUSH_SEQ))
      -                               return rq;
      -                       rq = blk_do_flush(q, rq);
      -                       if (rq)
      -                               return rq;
      +                       return rq;
                      }
      
      Note that previously, a command would come in here, have
      REQ_FLUSH|REQ_FUA set, and then get handed off to blk_do_flush:
      
      struct request *blk_do_flush(struct request_queue *q, struct request *rq)
      {
              unsigned int fflags = q->flush_flags; /* may change, cache it */
              bool has_flush = fflags & REQ_FLUSH, has_fua = fflags & REQ_FUA;
              bool do_preflush = has_flush && (rq->cmd_flags & REQ_FLUSH);
              bool do_postflush = has_flush && !has_fua && (rq->cmd_flags &
              REQ_FUA);
              unsigned skip = 0;
      ...
              if (blk_rq_sectors(rq) && !do_preflush && !do_postflush) {
                      rq->cmd_flags &= ~REQ_FLUSH;
      		if (!has_fua)
      			rq->cmd_flags &= ~REQ_FUA;
      	        return rq;
      	}
      
      So, the flush machinery was bypassed in such cases (q->flush_flags == 0
      && rq->cmd_flags & (REQ_FLUSH|REQ_FUA)).
      
      Now, however, we don't get into the flush machinery at all.  Instead,
      __elv_next_request just hands a request with flush and fua bits set to
      the scsi_request_fn, even if the underlying request_queue does not
      support flush or fua.
      
      The agreed upon approach is to fix the flush machinery to allow
      stacking.  While this isn't used in practice (since there is only one
      request-based dm target, and that target will now reflect the flush
      flags of the underlying device), it does future-proof the solution, and
      make it function as designed.
      
      In order to make this work, I had to add a field to the struct request,
      inside the flush structure (to store the original req->end_io).  Shaohua
      had suggested overloading the union with rb_node and completion_data,
      but the completion data is used by device mapper and can also be used by
      other drivers.  So, I didn't see a way around the additional field.
      
      I tested this patch on an HP EVA with both ext4 and xfs, and it recovers
      the lost performance.  Comments and other testers, as always, are
      appreciated.
      
      Cheers,
      Jeff
      Signed-off-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <jaxboe@fusionio.com>
      4853abaa
  5. 04 Aug, 2011 1 commit
    • Akinobu Mita's avatar
      fault-injection: add ability to export fault_attr in arbitrary directory · dd48c085
      Akinobu Mita authored
      
      
      init_fault_attr_dentries() is used to export fault_attr via debugfs.
      But it can only export it in debugfs root directory.
      
      Per Forlin is working on mmc_fail_request which adds support to inject
      data errors after a completed host transfer in MMC subsystem.
      
      The fault_attr for mmc_fail_request should be defined per mmc host and
      export it in debugfs directory per mmc host like
      /sys/kernel/debug/mmc0/mmc_fail_request.
      
      init_fault_attr_dentries() doesn't help for mmc_fail_request.  So this
      introduces fault_create_debugfs_attr() which is able to create a
      directory in the arbitrary directory and replace
      init_fault_attr_dentries().
      
      [akpm@linux-foundation.org: extraneous semicolon, per Randy]
      Signed-off-by: default avatarAkinobu Mita <akinobu.mita@gmail.com>
      Tested-by: default avatarPer Forlin <per.forlin@linaro.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Matt Mackall <mpm@selenic.com>
      Cc: Randy Dunlap <rdunlap@xenotime.net>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dd48c085
  6. 26 Jul, 2011 2 commits
    • Akinobu Mita's avatar
      fail_make_request: cleanup should_fail_request · b2c9cd37
      Akinobu Mita authored
      
      
      This changes should_fail_request() to more usable wrapper function of
      should_fail().  It can avoid putting #ifdef CONFIG_FAIL_MAKE_REQUEST in
      the middle of a function.
      Signed-off-by: default avatarAkinobu Mita <akinobu.mita@gmail.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b2c9cd37
    • Jens Axboe's avatar
      block: fix warning with calling smp_processor_id() in preemptible section · 11ccf116
      Jens Axboe authored
      After commit 5757a6d7
      
       introduced an unsafe calling of
      smp_processor_id(), with preempt debuggin turned on we spew a lot of:
      
      BUG: using smp_processor_id() in preemptible [00000000] code: kjournald/514
      caller is __make_request+0x1b8/0x308
      [<c0019f44>] (unwind_backtrace+0x0/0xe8) from [<c024b4cc>] (debug_smp_processor_id+0xbc/0xf0)
      [<c024b4cc>] (debug_smp_processor_id+0xbc/0xf0) from [<c0223d14>] (__make_request+0x1b8/0x308)
      [<c0223d14>] (__make_request+0x1b8/0x308) from [<c02215ac>] (generic_make_request+0x4dc/0x558)
      [<c02215ac>] (generic_make_request+0x4dc/0x558) from [<c022173c>] (submit_bio+0x114/0x138)
      [<c022173c>] (submit_bio+0x114/0x138) from [<c011f504>] (submit_bh+0x148/0x16c)
      [<c011f504>] (submit_bh+0x148/0x16c) from [<c0121ed8>] (__sync_dirty_buffer+0x88/0xd8)
      [<c0121ed8>] (__sync_dirty_buffer+0x88/0xd8) from [<c01aff78>] (journal_commit_transaction+0x1198/0x1688)
      [<c01aff78>] (journal_commit_transaction+0x1198/0x1688) from [<c01b4034>] (kjournald+0xb4/0x224)
      [<c01b4034>] (kjournald+0xb4/0x224) from [<c0069ea0>] (kthread+0x8c/0x94)
      [<c0069ea0>] (kthread+0x8c/0x94) from [<c00137f8>] (kernel_thread_exit+0x0/0x8)
      
      Fix this by just using raw_smp_processor_id(), it's just a hint
      after all. There's no pinning of the CPU or accessing per-cpu
      structures involved.
      Reported-by: default avatarMing Lei <tom.leiming@gmail.com>
      Signed-off-by: default avatarJens Axboe <jaxboe@fusionio.com>
      11ccf116
  7. 23 Jul, 2011 1 commit
    • Dan Williams's avatar
      block: strict rq_affinity · 5757a6d7
      Dan Williams authored
      
      
      Some systems benefit from completions always being steered to the strict
      requester cpu rather than the looser "per-socket" steering that
      blk_cpu_to_group() attempts by default. This is because the first
      CPU in the group mask ends up being completely overloaded with work,
      while the others (including the original submitter) has power left
      to spare.
      
      Allow the strict mode to be set by writing '2' to the sysfs control
      file. This is identical to the scheme used for the nomerges file,
      where '2' is a more aggressive setting than just being turned on.
      
      echo 2 > /sys/block/<bdev>/queue/rq_affinity
      
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Roland Dreier <roland@purestorage.com>
      Tested-by: default avatarDave Jiang <dave.jiang@intel.com>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Signed-off-by: default avatarJens Axboe <jaxboe@fusionio.com>
      5757a6d7
  8. 21 Jul, 2011 1 commit
    • James Bottomley's avatar
      [SCSI] fix crash in scsi_dispatch_cmd() · bfe159a5
      James Bottomley authored
      
      
      USB surprise removal of sr is triggering an oops in
      scsi_dispatch_command().  What seems to be happening is that USB is
      hanging on to a queue reference until the last close of the upper
      device, so the crash is caused by surprise remove of a mounted CD
      followed by attempted unmount.
      
      The problem is that USB doesn't issue its final commands as part of
      the SCSI teardown path, but on last close when the block queue is long
      gone.  The long term fix is probably to make sr do the teardown in the
      same way as sd (so remove all the lower bits on ejection, but keep the
      upper disk alive until last close of user space).  However, the
      current oops can be simply fixed by not allowing any commands to be
      sent to a dead queue.
      
      Cc: stable@kernel.org
      Signed-off-by: default avatarJames Bottomley <JBottomley@Parallels.com>
      bfe159a5
  9. 08 Jul, 2011 1 commit
    • Shaohua Li's avatar
      block: avoid building too big plug list · 55c022bb
      Shaohua Li authored
      
      
      When I test fio script with big I/O depth, I found the total throughput drops
      compared to some relative small I/O depth. The reason is the thread accumulates
      big requests in its plug list and causes some delays (surely this depends
      on CPU speed).
      I thought we'd better have a threshold for requests. When a threshold reaches,
      this means there is no request merge and queue lock contention isn't severe
      when pushing per-task requests to queue, so the main advantages of blk plug
      don't exist. We can force a plug list flush in this case.
      With this, my test throughput actually increases and almost equals to small
      I/O depth. Another side effect is irq off time decreases in blk_flush_plug_list()
      for big I/O depth.
      The BLK_MAX_REQUEST_COUNT is choosen arbitarily, but 16 is efficiently to
      reduce lock contention to me. But I'm open here, 32 is ok in my test too.
      Signed-off-by: default avatarShaohua Li <shaohua.li@intel.com>
      Signed-off-by: default avatarJens Axboe <jaxboe@fusionio.com>
      55c022bb
  10. 27 May, 2011 1 commit
  11. 26 May, 2011 1 commit
  12. 23 May, 2011 1 commit
  13. 20 May, 2011 2 commits
  14. 18 May, 2011 1 commit
    • Shaohua Li's avatar
      block: don't delay blk_run_queue_async · 3ec717b7
      Shaohua Li authored
      
      
      Let's check a scenario:
      1. blk_delay_queue(q, SCSI_QUEUE_DELAY);
      2. blk_run_queue_async();
      the second one will became a noop, because q->delay_work already has
      WORK_STRUCT_PENDING_BIT set, so the delayed work will still run after
      SCSI_QUEUE_DELAY. But blk_run_queue_async actually hopes the delayed
      work runs immediately.
      
      Fix this by doing a cancel on potentially pending delayed work
      before queuing an immediate run of the workqueue.
      Signed-off-by: default avatarShaohua Li <shaohua.li@intel.com>
      Signed-off-by: default avatarJens Axboe <jaxboe@fusionio.com>
      3ec717b7
  15. 19 Apr, 2011 2 commits
  16. 18 Apr, 2011 7 commits
  17. 16 Apr, 2011 1 commit
  18. 15 Apr, 2011 2 commits
  19. 14 Apr, 2011 1 commit
  20. 12 Apr, 2011 6 commits
  21. 11 Apr, 2011 1 commit
    • NeilBrown's avatar
      block: splice plug list to local context · 109b8129
      NeilBrown authored
      
      
      If the request_fn ends up blocking, we could be re-entering
      the plug flush. Since the list is protected by explicitly
      not allowing schedule events, this isn't a terribly good idea.
      
      Additionally, it can cause us to recurse. As request_fn called by
      __blk_run_queue is allowed to 'schedule()' (after dropping the queue
      lock of course), it is possible to get a recursive call:
      
       schedule -> blk_flush_plug -> __blk_finish_plug -> flush_plug_list
            -> __blk_run_queue -> request_fn -> schedule
      
      We must make sure that the second schedule does not call into
      blk_flush_plug again.  So instead of leaving the list of requests on
      blk_plug->list, move them to a separate list leaving blk_plug->list
      empty.
      Signed-off-by: default avatarJens Axboe <jaxboe@fusionio.com>
      109b8129
  22. 05 Apr, 2011 2 commits
  23. 31 Mar, 2011 1 commit