1. 15 Jun, 2012 1 commit
    • Asias He's avatar
      block: Avoid missed wakeup in request waitqueue · 458f27a9
      Asias He authored
      After hot-unplug a stressed disk, I found that rl->wait[] is not empty
      while rl->count[] is empty and there are theads still sleeping on
      get_request after the queue cleanup. With simple debug code, I found
      there are exactly nr_sleep - nr_wakeup of theads in D state. So there
      are missed wakeup.
        $ dmesg | grep nr_sleep
        [   52.917115] ---> nr_sleep=1046, nr_wakeup=873, delta=173
        $ vmstat 1
        1 173  0 712640  24292  96172 0 0  0  0  419  757  0  0  0 100  0
      To quote Tejun:
        Ah, okay, freed_request() wakes up single waiter with the assumption
        that after the wakeup there will at least be one successful allocation
        which in turn will continue the wakeup chain until the wait list is
        empty - ie. waiter wakeup is dependent on successful request
        allocation happening after each wakeup.  With queue marked dead, any
        woken up waiter fails the allocation path, so the wakeup chaining is
        lost and we're left with hung waiters. What we need is wake_up_all()
        after drain completion.
      This patch fixes the missed wakeup by waking up all the theads which
      are sleeping on wait queue after queue drain.
      Changes in v2: Drop waitqueue_active() optimization
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAsias He <asias@redhat.com>
      Fixed a bug by me, where stacked devices would oops on calling
      blk_drain_queue() since ->rq.wait[] do not get initialized unless
      it's a full queue setup.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
  2. 20 Apr, 2012 4 commits
    • Tejun Heo's avatar
      block: fix elvpriv allocation failure handling · aaf7c680
      Tejun Heo authored
      Request allocation is mempool backed to guarantee forward progress
      under memory pressure; unfortunately, this property got broken while
      adding elvpriv data.  Failures during elvpriv allocation, including
      ioc and icq creation failures, currently make get_request() fail as
      whole.  There's no forward progress guarantee for these allocations -
      they may fail indefinitely under memory pressure stalling IO and
      deadlocking the system.
      This patch updates get_request() such that elvpriv allocation failure
      doesn't make the whole function fail.  If elvpriv allocation fails,
      the allocation is degraded into !ELVPRIV.  This will force the request
      to ELEVATOR_INSERT_BACK disturbing scheduling but elvpriv alloc
      failures should be rare (nothing is per-request) and anything is
      better than deadlocking.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    • Tejun Heo's avatar
      block: collapse blk_alloc_request() into get_request() · 29e2b09a
      Tejun Heo authored
      Allocation failure handling in get_request() is about to be updated.
      To ease the update, collapse blk_alloc_request() into get_request().
      This patch doesn't introduce any functional change.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    • Tejun Heo's avatar
      blkcg: make request_queue bypassing on allocation · b82d4b19
      Tejun Heo authored
      With the previous change to guarantee bypass visiblity for RCU read
      lock regions, entering bypass mode involves non-trivial overhead and
      future changes are scheduled to make use of bypass mode during init
      path.  Combined it may end up adding noticeable delay during boot.
      This patch makes request_queue start its life in bypass mode, which is
      ended on queue init completion at the end of
      blk_init_allocated_queue(), and updates blk_queue_bypass_start() such
      that draining and RCU synchronization are performed only when the
      queue actually enters bypass mode.
      This avoids unnecessarily switching in and out of bypass mode during
      init avoiding the overhead and any nasty surprises which may step from
      leaving bypass mode on half-initialized queues.
      The boot time overhead was pointed out by Vivek.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    • Tejun Heo's avatar
      blkcg: make sure blkg_lookup() returns %NULL if @q is bypassing · 80fd9979
      Tejun Heo authored
      Currently, blkg_lookup() doesn't check @q bypass state.  This patch
      updates blk_queue_bypass_start() to do synchronize_rcu() before
      returning and updates blkg_lookup() to check blk_queue_bypass() and
      return %NULL if bypassing.  This ensures blkg_lookup() returns %NULL
      if @q is bypassing.
      This is to guarantee that nobody is accessing policy data while @q is
      bypassing, which is necessary to allow replacing blkio_cgroup->pd[] in
      place on policy [de]activation.
      v2: Added more comments explaining bypass guarantees as suggested by
      v3: Added more comments explaining why there's no synchronize_rcu() in
          blk_cleanup_queue() as suggested by Vivek.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
  3. 06 Apr, 2012 1 commit
    • Shaohua Li's avatar
      block: make auto block plug flush threshold per-disk based · 1b2e19f1
      Shaohua Li authored
      We do auto block plug flush to reduce latency, the threshold is 16
      requests. This works well if the task is accessing one or two drives.
      The problem is if the task is accessing a raid 0 device and the raid
      disk number is big, say 8 or 16, 16/8 = 2 or 16/16=1, we will have
      heavy lock contention.
      This patch makes the threshold per-disk based. The latency should be
      still ok accessing one or two drives. The setup with application
      accessing a lot of drives in the meantime uaually is big machine,
      avoiding lock contention is more important, because any contention
      will actually increase latency.
      Signed-off-by: default avatarShaohua Li <shli@fusionio.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
  4. 23 Mar, 2012 1 commit
  5. 06 Mar, 2012 10 commits
    • Tejun Heo's avatar
      block: implement bio_associate_current() · 852c788f
      Tejun Heo authored
      IO scheduling and cgroup are tied to the issuing task via io_context
      and cgroup of %current.  Unfortunately, there are cases where IOs need
      to be routed via a different task which makes scheduling and cgroup
      limit enforcement applied completely incorrectly.
      For example, all bios delayed by blk-throttle end up being issued by a
      delayed work item and get assigned the io_context of the worker task
      which happens to serve the work item and dumped to the default block
      cgroup.  This is double confusing as bios which aren't delayed end up
      in the correct cgroup and makes using blk-throttle and cfq propio
      together impossible.
      Any code which punts IO issuing to another task is affected which is
      getting more and more common (e.g. btrfs).  As both io_context and
      cgroup are firmly tied to task including userland visible APIs to
      manipulate them, it makes a lot of sense to match up tasks to bios.
      This patch implements bio_associate_current() which associates the
      specified bio with %current.  The bio will record the associated ioc
      and blkcg at that point and block layer will use the recorded ones
      regardless of which task actually ends up issuing the bio.  bio
      release puts the associated ioc and blkcg.
      It grabs and remembers ioc and blkcg instead of the task itself
      because task may already be dead by the time the bio is issued making
      ioc and blkcg inaccessible and those are all block layer cares about.
      elevator_set_req_fn() is updated such that the bio elvdata is being
      allocated for is available to the elevator.
      This doesn't update block cgroup policies yet.  Further patches will
      implement the support.
      -v2: #ifdef CONFIG_BLK_CGROUP added around bio->bi_ioc dereference in
           rq_ioc() to fix build breakage.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Kent Overstreet <koverstreet@google.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    • Tejun Heo's avatar
      block: interface update for ioc/icq creation functions · 24acfc34
      Tejun Heo authored
      Make the following interface updates to prepare for future ioc related
      * create_io_context() returning ioc only works for %current because it
        doesn't increment ref on the ioc.  Drop @task parameter from it and
        always assume %current.
      * Make create_io_context_slowpath() return 0 or -errno and rename it
        to create_task_io_context().
      * Make ioc_create_icq() take @ioc as parameter instead of assuming
        that of %current.  The caller, get_request(), is updated to create
        ioc explicitly and then pass it into ioc_create_icq().
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    • Tejun Heo's avatar
      block: restructure get_request() · b679281a
      Tejun Heo authored
      get_request() is structured a bit unusually in that failure path is
      inlined in the usual flow with goto labels atop and inside it.
      Relocate the error path to the end of the function.
      This is to prepare for icq handling changes in get_request() and
      doesn't introduce any behavior change.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    • Tejun Heo's avatar
      blkcg: unify blkg's for blkcg policies · e8989fae
      Tejun Heo authored
      Currently, blkg is per cgroup-queue-policy combination.  This is
      unnatural and leads to various convolutions in partially used
      duplicate fields in blkg, config / stat access, and general management
      of blkgs.
      This patch make blkg's per cgroup-queue and let them serve all
      policies.  blkgs are now created and destroyed by blkcg core proper.
      This will allow further consolidation of common management logic into
      blkcg core and API with better defined semantics and layering.
      As a transitional step to untangle blkg management, elvswitch and
      policy [de]registration, all blkgs except the root blkg are being shot
      down during elvswitch and bypass.  This patch adds blkg_root_update()
      to update root blkg in place on policy change.  This is hacky and racy
      but should be good enough as interim step until we get locking
      simplified and switch over to proper in-place update for all blkgs.
      -v2: Root blkgs need to be updated on elvswitch too and blkg_alloc()
           comment wasn't updated according to the function change.  Fixed.
           Both pointed out by Vivek.
      -v3: v2 updated blkg_destroy_all() to invoke update_root_blkg_pd() for
           all policies.  This freed root pd during elvswitch before the
           last queue finished exiting and led to oops.  Directly invoke
           update_root_blkg_pd() only on BLKIO_POLICY_PROP from
           cfq_exit_queue().  This also is closer to what will be done with
           proper in-place blkg update.  Reported by Vivek.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    • Tejun Heo's avatar
      blkcg: move per-queue blkg list heads and counters to queue and blkg · 4eef3049
      Tejun Heo authored
      Currently, specific policy implementations are responsible for
      maintaining list and number of blkgs.  This duplicates code
      unnecessarily, and hinders factoring common code and providing blkcg
      API with better defined semantics.
      After this patch, request_queue hosts list heads and counters and blkg
      has list nodes for both policies.  This patch only relocates the
      necessary fields and the next patch will actually move management code
      into blkcg core.
      Note that request_queue->blkg_list[] and ->nr_blkgs[] are hardcoded to
      have 2 elements.  This is to avoid include dependency and will be
      removed by the next patch.
      This patch doesn't introduce any behavior change.
      -v2: Now unnecessary conditional on CONFIG_BLK_CGROUP_MODULE removed
           as pointed out by Vivek.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    • Tejun Heo's avatar
      blkcg: add blkcg_{init|drain|exit}_queue() · 5efd6113
      Tejun Heo authored
      Currently block core calls directly into blk-throttle for init, drain
      and exit.  This patch adds blkcg_{init|drain|exit}_queue() which wraps
      the blk-throttle functions.  This is to give more control and
      visiblity to blkcg core layer for proper layering.  Further patches
      will add logic common to blkcg policies to the functions.
      While at it, collapse blk_throtl_release() into blk_throtl_exit().
      There's no reason to keep them separate.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    • Tejun Heo's avatar
      blkcg: use the usual get blkg path for root blkio_group · f51b802c
      Tejun Heo authored
      For root blkg, blk_throtl_init() was using throtl_alloc_tg()
      explicitly and cfq_init_queue() was manually initializing embedded
      cfqd->root_group, adding unnecessarily different code paths to blkg
      Make both use the usual blkio_group get functions - throtl_get_tg()
      and cfq_get_cfqg() - for the root blkio_group too.  Note that
      blk_throtl_init() callsite is pushed downwards in
      blk_alloc_queue_node() so that @q is sufficiently initialized for
      This simplifies root blkg handling noticeably for cfq and will allow
      further modularization of blkcg API.
      -v2: Vivek pointed out that using cfq_get_cfqg() won't work if
           CONFIG_CFQ_GROUP_IOSCHED is disabled.  Fix it by factoring out
           initialization of base part of cfqg into cfq_init_cfqg_base() and
           alloc/init/free explicitly if !CONFIG_CFQ_GROUP_IOSCHED.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    • Tejun Heo's avatar
      block: extend queue bypassing to cover blkcg policies · 6ecf23af
      Tejun Heo authored
      Extend queue bypassing such that dying queue is always bypassing and
      blk-throttle is drained on bypass.  With blkcg policies updated to
      test blk_queue_bypass() instead of blk_queue_dead(), this ensures that
      no bio or request is held by or going through blkcg policies on a
      bypassing queue.
      This will be used to implement blkg cleanup on elevator switches and
      policy changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    • Tejun Heo's avatar
      block: implement blk_queue_bypass_start/end() · d732580b
      Tejun Heo authored
      Rename and extend elv_queisce_start/end() to
      blk_queue_bypass_start/end() which are exported and supports nesting
      via @q->bypass_depth.  Also add blk_queue_bypass() to test bypass
      This will be further extended and used for blkio_group management.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    • Tejun Heo's avatar
      block: blk-throttle should be drained regardless of q->elevator · b855b04a
      Tejun Heo authored
      Currently, blk_cleanup_queue() doesn't call elv_drain_elevator() if
      q->elevator doesn't exist; however, bio based drivers don't have
      elevator initialized but can still use blk-throttle.  This patch moves
      q->elevator test inside blk_drain_queue() such that only
      elv_drain_elevator() is skipped if !q->elevator.
      -v2: loop can have registered queue which has NULL request_fn.  Make
           sure we don't call into __blk_run_queue() in such cases.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarVivek Goyal <vgoyal@redhat.com>
      Fold in bug fix from Vivek.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
  6. 08 Feb, 2012 2 commits
    • Tejun Heo's avatar
      block: don't call elevator callbacks for plug merges · 07c2bd37
      Tejun Heo authored
      Plug merge calls two elevator callbacks outside queue lock -
      elevator_allow_merge_fn() and elevator_bio_merged_fn().  Although
      attempt_plug_merge() suggests that elevator is guaranteed to be there
      through the existing request on the plug list, nothing prevents plug
      merge from calling into dying or initializing elevator.
      For regular merges, bypass ensures elvpriv count to reach zero, which
      in turn prevents merges as all !ELVPRIV requests get REQ_SOFTBARRIER
      from forced back insertion.  Plug merge doesn't check ELVPRIV, and, as
      the requests haven't gone through elevator insertion yet, it doesn't
      have SOFTBARRIER set allowing merges on a bypassed queue.
      This, for example, leads to the following crash during elevator
       BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
       IP: [<ffffffff813b34e9>] cfq_allow_merge+0x49/0xa0
       PGD 112cbc067 PUD 115d5c067 PMD 0
       Oops: 0000 [#1] PREEMPT SMP
       CPU 1
       Modules linked in: deadline_iosched
       Pid: 819, comm: dd Not tainted 3.3.0-rc2-work+ #76 Bochs Bochs
       RIP: 0010:[<ffffffff813b34e9>]  [<ffffffff813b34e9>] cfq_allow_merge+0x49/0xa0
       RSP: 0018:ffff8801143a38f8  EFLAGS: 00010297
       RAX: 0000000000000000 RBX: ffff88011817ce28 RCX: ffff880116eb6cc0
       RDX: 0000000000000000 RSI: ffff880118056e20 RDI: ffff8801199512f8
       RBP: ffff8801143a3908 R08: 0000000000000000 R09: 0000000000000000
       R10: 0000000000000001 R11: 0000000000000000 R12: ffff880118195708
       R13: ffff880118052aa0 R14: ffff8801143a3d50 R15: ffff880118195708
       FS:  00007f19f82cb700(0000) GS:ffff88011fc80000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
       CR2: 0000000000000008 CR3: 0000000112c6a000 CR4: 00000000000006e0
       DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
       Process dd (pid: 819, threadinfo ffff8801143a2000, task ffff880116eb6cc0)
        ffff88011817ce28 ffff880118195708 ffff8801143a3928 ffffffff81391bba
        ffff88011817ce28 ffff880118195708 ffff8801143a3948 ffffffff81391bf1
        ffff88011817ce28 0000000000000000 ffff8801143a39a8 ffffffff81398e3e
       Call Trace:
        [<ffffffff81391bba>] elv_rq_merge_ok+0x4a/0x60
        [<ffffffff81391bf1>] elv_try_merge+0x21/0x40
        [<ffffffff81398e3e>] blk_queue_bio+0x8e/0x390
        [<ffffffff81396a5a>] generic_make_request+0xca/0x100
        [<ffffffff81396b04>] submit_bio+0x74/0x100
        [<ffffffff811d45c2>] __blockdev_direct_IO+0x1ce2/0x3450
        [<ffffffff811d0dc7>] blkdev_direct_IO+0x57/0x60
        [<ffffffff811460b5>] generic_file_aio_read+0x6d5/0x760
        [<ffffffff811986b2>] do_sync_read+0xe2/0x120
        [<ffffffff81199345>] vfs_read+0xc5/0x180
        [<ffffffff81199501>] sys_read+0x51/0x90
        [<ffffffff81aeac12>] system_call_fastpath+0x16/0x1b
      There are multiple ways to fix this including making plug merge check
      ELVPRIV; however,
      * Calling into elevator outside queue lock is confusing and
      * Requests on plug list aren't known to the elevator.  They aren't on
        the elevator yet, so there's no elevator specific state to update.
      * Given the nature of plug merges - collecting bio's for the same
        purpose from the same issuer - elevator specific restrictions aren't
      So, simply don't call into elevator methods from plug merge by moving
      elv_bio_merged() from bio_attempt_*_merge() to blk_queue_bio(), and
      using blk_try_merge() in attempt_plug_merge().
      This is based on Jens' patch to skip elevator_allow_merge_fn() from
      plug merge.
      Note that this makes per-cgroup merged stats skip plug merging.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      LKML-Reference: <4F16F3CA.90904@kernel.dk>
      Original-patch-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    • Tejun Heo's avatar
      block: separate out blk_rq_merge_ok() and blk_try_merge() from elevator functions · 050c8ea8
      Tejun Heo authored
      blk_rq_merge_ok() is the elevator-neutral part of merge eligibility
      test.  blk_try_merge() determines merge direction and expects the
      caller to have tested elv_rq_merge_ok() previously.
      elv_rq_merge_ok() now wraps blk_rq_merge_ok() and then calls
      elv_iosched_allow_merge().  elv_try_merge() is removed and the two
      callers are updated to call elv_rq_merge_ok() explicitly followed by
      blk_try_merge().  While at it, make rq_merge_ok() functions return
      This is to prepare for plug merge update and doesn't introduce any
      behavior change.
      This is based on Jens' patch to skip elevator_allow_merge_fn() from
      plug merge.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      LKML-Reference: <4F16F3CA.90904@kernel.dk>
      Original-patch-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
  7. 07 Feb, 2012 1 commit
    • Tejun Heo's avatar
      block: strip out locking optimization in put_io_context() · 11a3122f
      Tejun Heo authored
      put_io_context() performed a complex trylock dancing to avoid
      deferring ioc release to workqueue.  It was also broken on UP because
      trylock was always assumed to succeed which resulted in unbalanced
      preemption count.
      While there are ways to fix the UP breakage, even the most
      pathological microbench (forced ioc allocation and tight fork/exit
      loop) fails to show any appreciable performance benefit of the
      optimization.  Strip it out.  If there turns out to be workloads which
      are affected by this change, simpler optimization from the discussion
      thread can be applied later.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      LKML-Reference: <1328514611.21268.66.camel@sli10-conroe>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
  8. 19 Jan, 2012 1 commit
    • Shaohua Li's avatar
      block: fix NULL icq_cache reference · 05c30b95
      Shaohua Li authored
      Vivek reported a kernel crash:
      [   94.217015] BUG: unable to handle kernel NULL pointer dereference at 000000000000001c
      [   94.218004] IP: [<ffffffff81142fae>] kmem_cache_free+0x5e/0x200
      [   94.218004] PGD 13abda067 PUD 137d52067 PMD 0
      [   94.218004] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
      [   94.218004] CPU 0
      [   94.218004] Modules linked in: [last unloaded: scsi_wait_scan]
      [   94.218004]
      [   94.218004] Pid: 0, comm: swapper/0 Not tainted 3.2.0+ #16 Hewlett-Packard HP xw6600 Workstation/0A9Ch
      [   94.218004] RIP: 0010:[<ffffffff81142fae>]  [<ffffffff81142fae>] kmem_cache_free+0x5e/0x200
      [   94.218004] RSP: 0018:ffff88013fc03de0  EFLAGS: 00010006
      [   94.218004] RAX: ffffffff81e0d020 RBX: ffff880138b3c680 RCX: 00000001801c001b
      [   94.218004] RDX: 00000000003aac1d RSI: ffff880138b3c680 RDI: ffffffff81142fae
      [   94.218004] RBP: ffff88013fc03e10 R08: ffff880137830238 R09: 0000000000000001
      [   94.218004] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
      [   94.218004] R13: ffffea0004e2cf00 R14: ffffffff812f6eb6 R15: 0000000000000246
      [   94.218004] FS:  0000000000000000(0000) GS:ffff88013fc00000(0000) knlGS:0000000000000000
      [   94.218004] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [   94.218004] CR2: 000000000000001c CR3: 00000001395ab000 CR4: 00000000000006f0
      [   94.218004] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [   94.218004] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      [   94.218004] Process swapper/0 (pid: 0, threadinfo ffffffff81e00000, task ffffffff81e0d020)
      [   94.218004] Stack:
      [   94.218004]  0000000000000102 ffff88013fc0db20 ffffffff81e22700 ffff880139500f00
      [   94.218004]  0000000000000001 000000000000000a ffff88013fc03e20 ffffffff812f6eb6
      [   94.218004]  ffff88013fc03e90 ffffffff810c8da2 ffffffff81e01fd8 ffff880137830240
      [   94.218004] Call Trace:
      [   94.218004]  <IRQ>
      [   94.218004]  [<ffffffff812f6eb6>] icq_free_icq_rcu+0x16/0x20
      [   94.218004]  [<ffffffff810c8da2>] __rcu_process_callbacks+0x1c2/0x420
      [   94.218004]  [<ffffffff810c9038>] rcu_process_callbacks+0x38/0x250
      [   94.218004]  [<ffffffff810405ee>] __do_softirq+0xce/0x3e0
      [   94.218004]  [<ffffffff8108ed04>] ? clockevents_program_event+0x74/0x100
      [   94.218004]  [<ffffffff81090104>] ? tick_program_event+0x24/0x30
      [   94.218004]  [<ffffffff8183ed1c>] call_softirq+0x1c/0x30
      [   94.218004]  [<ffffffff8100422d>] do_softirq+0x8d/0xc0
      [   94.218004]  [<ffffffff81040c3e>] irq_exit+0xae/0xe0
      [   94.218004]  [<ffffffff8183f4be>] smp_apic_timer_interrupt+0x6e/0x99
      [   94.218004]  [<ffffffff8183e330>] apic_timer_interrupt+0x70/0x80
      Once a queue is quiesced, it's not supposed to have any elvpriv data or
      icq's, and elevator switching depends on that.  Request alloc path
      followed the rule for elvpriv data but forgot apply it to icq's
      leading to the following crash during elevator switch. Fix it by not
      allocating icq's if ELVPRIV is not set for the request.
      Reported-by: default avatarVivek Goyal <vgoyal@redhat.com>
      Tested-by: default avatarVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarShaohua Li <shaohua.li@intel.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
  9. 15 Dec, 2011 1 commit
    • Tejun Heo's avatar
      block: don't kick empty queue in blk_drain_queue() · 4eabc941
      Tejun Heo authored
      While probing, fd sets up queue, probes hardware and tears down the
      queue if probing fails.  In the process, blk_drain_queue() kicks the
      queue which failed to finish initialization and fd is unhappy about
        floppy0: no floppy controllers found
        ------------[ cut here ]------------
        WARNING: at drivers/block/floppy.c:2929 do_fd_request+0xbf/0xd0()
        Hardware name: To Be Filled By O.E.M.
        VFS: do_fd_request called on non-open device
        Modules linked in:
        Pid: 1, comm: swapper Not tainted 3.2.0-rc4-00077-g5983fe2b
        Call Trace:
         [<ffffffff81039a6a>] warn_slowpath_common+0x7a/0xb0
         [<ffffffff81039b41>] warn_slowpath_fmt+0x41/0x50
         [<ffffffff813d657f>] do_fd_request+0xbf/0xd0
         [<ffffffff81322b95>] blk_drain_queue+0x65/0x80
         [<ffffffff81322c93>] blk_cleanup_queue+0xe3/0x1a0
         [<ffffffff818a809d>] floppy_init+0xdeb/0xe28
         [<ffffffff818a72b2>] ? daring+0x6b/0x6b
         [<ffffffff810002af>] do_one_initcall+0x3f/0x170
         [<ffffffff81884b34>] kernel_init+0x9d/0x11e
         [<ffffffff810317c2>] ? schedule_tail+0x22/0xa0
         [<ffffffff815dbb14>] kernel_thread_helper+0x4/0x10
         [<ffffffff81884a97>] ? start_kernel+0x2be/0x2be
         [<ffffffff815dbb10>] ? gs_change+0xb/0xb
      Avoid it by making blk_drain_queue() kick queue iff dispatch queue has
      something on it.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarRalf Hildebrandt <Ralf.Hildebrandt@charite.de>
      Reported-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Tested-by: default avatarSergei Trofimovich <slyich@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
  10. 13 Dec, 2011 9 commits
    • Tejun Heo's avatar
      block, cfq: move icq creation and rq->elv.icq association to block core · f1f8cc94
      Tejun Heo authored
      Now block layer knows everything necessary to create and associate
      icq's with requests.  Move ioc_create_icq() to blk-ioc.c and update
      get_request() such that, if elevator_type->icq_size is set, requests
      are automatically associated with their matching icq's before
      elv_set_request().  io_context reference is also managed by block core
      on request alloc/free.
      * Only ioprio/cgroup changed handling remains from cfq_get_cic().
        Collapsed into cfq_set_request().
      * This removes queue kicking on icq allocation failure (for now).  As
        icq allocation failure is rare and the only effect of queue kicking
        achieved was possibily accelerating queue processing, this change
        shouldn't be noticeable.
        There is a larger underlying problem.  Unlike request allocation,
        icq allocation is not guaranteed to succeed eventually after
        retries.  The number of icq is unbound and thus mempool can't be the
        solution either.  This effectively adds allocation dependency on
        memory free path and thus possibility of deadlock.
        This usually wouldn't happen because icq allocation is not a hot
        path and, even when the condition triggers, it's highly unlikely
        that none of the writeback workers already has icq.
        However, this is still possible especially if elevator is being
        switched under high memory pressure, so we better get it fixed.
        Probably the only solution is just bypassing elevator and appending
        to dispatch queue on any elevator allocation failure.
      * Comment added to explain how icq's are managed and synchronized.
      This completes cleanup of io_context interface.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    • Tejun Heo's avatar
      block, cfq: move cfqd->icq_list to request_queue and add request->elv.icq · a612fddf
      Tejun Heo authored
      Most of icq management is about to be moved out of cfq into blk-ioc.
      This patch prepares for it.
      * Move cfqd->icq_list to request_queue->icq_list
      * Make request explicitly point to icq instead of through elevator
        private data.  ->elevator_private[3] is replaced with sub struct elv
        which contains icq pointer and priv[2].  cfq is updated accordingly.
      * Meaningless clearing of ->elevator_private[0] removed from
        elv_set_request().  At that point in code, the field was guaranteed
        to be %NULL anyway.
      This patch doesn't introduce any functional change.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    • Tejun Heo's avatar
      block, cfq: replace current_io_context() with create_io_context() · f2dbd76a
      Tejun Heo authored
      When called under queue_lock, current_io_context() triggers lockdep
      warning if it hits allocation path.  This is because io_context
      installation is protected by task_lock which is not IRQ safe, so it
      triggers irq-unsafe-lock -> irq -> irq-safe-lock -> irq-unsafe-lock
      deadlock warning.
      Given the restriction, accessor + creator rolled into one doesn't work
      too well.  Drop current_io_context() and let the users access
      task->io_context directly inside queue_lock combined with explicit
      creation using create_io_context().
      Future ioc updates will further consolidate ioc access and the create
      interface will be unexported.
      While at it, relocate ioc internal interface declarations in blk.h and
      add section comments before and after.
      This patch does not introduce functional change.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    • Tejun Heo's avatar
      block: misc updates to blk_get_queue() · 09ac46c4
      Tejun Heo authored
      * blk_get_queue() is peculiar in that it returns 0 on success and 1 on
        failure instead of 0 / -errno or boolean.  Update it such that it
        returns %true on success and %false on failure.
      * Make sure the caller checks for the return value.
      * Separate out __blk_get_queue() which doesn't check whether @q is
        dead and put it in blk.h.  This will be used later.
      This patch doesn't introduce any functional changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    • Tejun Heo's avatar
      block, cfq: move cfqd->cic_index to q->id · a73f730d
      Tejun Heo authored
      cfq allocates per-queue id using ida and uses it to index cic radix
      tree from io_context.  Move it to q->id and allocate on queue init and
      free on queue release.  This simplifies cfq a bit and will allow for
      further improvements of io context life-cycle management.
      This patch doesn't introduce any functional difference.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    • Tejun Heo's avatar
      block: add missing blk_queue_dead() checks · 8ba61435
      Tejun Heo authored
      blk_insert_cloned_request(), blk_execute_rq_nowait() and
      blk_flush_plug_list() either didn't check whether the queue was dead
      or did it without holding queue_lock.  Update them so that dead state
      is checked while holding queue_lock.
      AFAICS, this plugs all holes (requeue doesn't matter as the request is
      transitioning atomically from in_flight to queued).
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    • Tejun Heo's avatar
      block: fix drain_all condition in blk_drain_queue() · 481a7d64
      Tejun Heo authored
      When trying to drain all requests, blk_drain_queue() checked only
      q->rq.count[]; however, this only tracks REQ_ALLOCED requests.  This
      patch updates blk_drain_queue() such that it looks at all the counters
      and queues so that request_queue is actually empty on completion.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    • Tejun Heo's avatar
      block: add blk_queue_dead() · 34f6055c
      Tejun Heo authored
      There are a number of QUEUE_FLAG_DEAD tests.  Add blk_queue_dead()
      macro and use it.
      This patch doesn't introduce any functional difference.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    • Tejun Heo's avatar
      block, sx8: kill blk_insert_request() · 1ba64ede
      Tejun Heo authored
      The only user left for blk_insert_request() is sx8 and it can be
      trivially switched to use blk_execute_rq_nowait() - special requests
      aren't included in io stat and sx8 doesn't use block layer tagging.
      Switch sx8 and kill blk_insert_requeset().
      This patch doesn't introduce any functional difference.
      Only compile tested.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarJeff Garzik <jgarzik@pobox.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
  11. 23 Nov, 2011 1 commit
    • Mike Snitzer's avatar
      block: initialize request_queue's numa node during · 5151412d
      Mike Snitzer authored
      struct request_queue is allocated with __GFP_ZERO so its "node" field is
      zero before initialization.  This causes an oops if node 0 is offline in
      the page allocator because its zonelists are not initialized.  From Dave
      Young's dmesg:
      	SRAT: Node 1 PXM 2 0-d0000000
      	SRAT: Node 1 PXM 2 100000000-330000000
      	SRAT: Node 0 PXM 1 330000000-630000000
      	Initmem setup node 1 0000000000000000-000000000affb000
      	Built 1 zonelists in Node order, mobility grouping on.
      	BUG: unable to handle kernel paging request at 0000000000001c08
      	IP: [<ffffffff8111c355>] __alloc_pages_nodemask+0xb5/0x870
      and __alloc_pages_nodemask+0xb5 translates to a NULL pointer on
      The fix is to initialize q->node at the time of allocation so the correct
      node is passed to the slab allocator later.
      Since blk_init_allocated_queue_node() is no longer needed, merge it with
      [rientjes@google.com: changelog, initializing q->node]
      Cc: stable@vger.kernel.org [2.6.37+]
      Reported-by: default avatarDave Young <dyoung@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Tested-by: default avatarDave Young <dyoung@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
  12. 16 Nov, 2011 2 commits
  13. 03 Nov, 2011 1 commit
    • Tejun Heo's avatar
      block: don't call blk_drain_queue() if elevator is not up · 6dd9ad7d
      Tejun Heo authored
      blk_cleanup_queue() may be called before elevator is set up on a
      queue which triggers the following oops.
       BUG: unable to handle kernel NULL pointer dereference at           (null)
       IP: [<ffffffff8125a69c>] elv_drain_elevator+0x1c/0x70
       Pid: 830, comm: kworker/0:2 Not tainted 3.1.0-next-20111025_64+ #1590
       Bochs Bochs
       RIP: 0010:[<ffffffff8125a69c>]  [<ffffffff8125a69c>] elv_drain_elevator+0x1c/0x70
       Call Trace:
        [<ffffffff8125da92>] blk_drain_queue+0x42/0x70
        [<ffffffff8125db90>] blk_cleanup_queue+0xd0/0x1c0
        [<ffffffff81469640>] md_free+0x50/0x70
        [<ffffffff8126f43b>] kobject_release+0x8b/0x1d0
        [<ffffffff81270d56>] kref_put+0x36/0xa0
        [<ffffffff8126f2b7>] kobject_put+0x27/0x60
        [<ffffffff814693af>] mddev_delayed_delete+0x2f/0x40
        [<ffffffff81083450>] process_one_work+0x100/0x3b0
        [<ffffffff8108527f>] worker_thread+0x15f/0x3a0
        [<ffffffff81089937>] kthread+0x87/0x90
        [<ffffffff81621834>] kernel_thread_helper+0x4/0x10
      Fix it by making blk_cleanup_queue() check whether q->elevator is set
      up before invoking blk_drain_queue.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-and-tested-by: default avatarJiri Slaby <jslaby@suse.cz>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
  14. 24 Oct, 2011 2 commits
    • Jeff Moyer's avatar
      blk-flush: move the queue kick into · e67b77c7
      Jeff Moyer authored
      A dm-multipath user reported[1] a problem when trying to boot
      a kernel with commit 4853abaa
      (block: fix flush machinery for stacking drivers with differring
      flush flags) applied.  It turns out that an empty flush request
      can be sent into blk_insert_flush.  When the BUG_ON was fixed
      to allow for this, I/O on the underlying device would stall.  The
      reason is that blk_insert_cloned_request does not kick the queue.
      In the aforementioned commit, I had added a special case to
      kick the queue if data was sent down but the queue flags did
      not require a flush.  A better solution is to push the queue
      kick up into blk_insert_cloned_request.
      This patch, along with a follow-on which fixes the BUG_ON, fixes
      the issue reported.
      [1] http://www.redhat.com/archives/dm-devel/2011-September/msg00154.html
      Reported-by: default avatarChristophe Saout <christophe@saout.de>
      Signed-off-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Stable note: 3.1
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    • Tao Ma's avatar
      block: Remove the control of complete cpu from bio. · 9562ad9a
      Tao Ma authored
      bio originally has the functionality to set the complete cpu, but
      it is broken.
      Chirstoph said that "This code is unused, and from the all the
      discussions lately pretty obviously broken.  The only thing keeping
      it serves is creating more confusion and possibly more bugs."
      And Jens replied with "We can kill bio_set_completion_cpu(). I'm fine
      with leaving cpu control to the request based drivers, they are the
      only ones that can toggle the setting anyway".
      So this patch tries to remove all the work of controling complete cpu
      from a bio.
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: default avatarTao Ma <boyu.mt@taobao.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
  15. 19 Oct, 2011 3 commits
    • Tejun Heo's avatar
      block: fix request_queue lifetime handling by making blk_queue_cleanup() properly shutdown · c9a929dd
      Tejun Heo authored
      request_queue is refcounted but actually depdends on lifetime
      management from the queue owner - on blk_cleanup_queue(), block layer
      expects that there's no request passing through request_queue and no
      new one will.
      This is fundamentally broken.  The queue owner (e.g. SCSI layer)
      doesn't have a way to know whether there are other active users before
      calling blk_cleanup_queue() and other users (e.g. bsg) don't have any
      guarantee that the queue is and would stay valid while it's holding a
      With delay added in blk_queue_bio() before queue_lock is grabbed, the
      following oops can be easily triggered when a device is removed with
      in-flight IOs.
       sd 0:0:1:0: [sdb] Stopping disk
       ata1.01: disabled
       general protection fault: 0000 [#1] PREEMPT SMP
       CPU 2
       Modules linked in:
       Pid: 648, comm: test_rawio Not tainted 3.1.0-rc3-work+ #56 Bochs Bochs
       RIP: 0010:[<ffffffff8137d651>]  [<ffffffff8137d651>] elv_rqhash_find+0x61/0x100
       Process test_rawio (pid: 648, threadinfo ffff880019efa000, task ffff880019ef8a80)
       Call Trace:
        [<ffffffff8137d774>] elv_merge+0x84/0xe0
        [<ffffffff81385b54>] blk_queue_bio+0xf4/0x400
        [<ffffffff813838ea>] generic_make_request+0xca/0x100
        [<ffffffff81383994>] submit_bio+0x74/0x100
        [<ffffffff811c53ec>] dio_bio_submit+0xbc/0xc0
        [<ffffffff811c610e>] __blockdev_direct_IO+0x92e/0xb40
        [<ffffffff811c39f7>] blkdev_direct_IO+0x57/0x60
        [<ffffffff8113b1c5>] generic_file_aio_read+0x6d5/0x760
        [<ffffffff8118c1ca>] do_sync_read+0xda/0x120
        [<ffffffff8118ce55>] vfs_read+0xc5/0x180
        [<ffffffff8118cfaa>] sys_pread64+0x9a/0xb0
        [<ffffffff81afaf6b>] system_call_fastpath+0x16/0x1b
      This happens because blk_queue_cleanup() destroys the queue and
      elevator whether IOs are in progress or not and DEAD tests are
      sprinkled in the request processing path without proper
      Similar problem exists for blk-throtl.  On queue cleanup, blk-throtl
      is shutdown whether it has requests in it or not.  Depending on
      timing, it either oopses or throttled bios are lost putting tasks
      which are waiting for bio completion into eternal D state.
      The way it should work is having the usual clear distinction between
      shutdown and release.  Shutdown drains all currently pending requests,
      marks the queue dead, and performs partial teardown of the now
      unnecessary part of the queue.  Even after shutdown is complete,
      reference holders are still allowed to issue requests to the queue
      although they will be immmediately failed.  The rest of teardown
      happens on release.
      This patch makes the following changes to make blk_queue_cleanup()
      behave as proper shutdown.
      * QUEUE_FLAG_DEAD is now set while holding both q->exit_mutex and
      * Unsynchronized DEAD check in generic_make_request_checks() removed.
        This couldn't make any meaningful difference as the queue could die
        after the check.
      * blk_drain_queue() updated such that it can drain all requests and is
        now called during cleanup.
      * blk_throtl updated such that it checks DEAD on grabbing queue_lock,
        drains all throttled bios during cleanup and free td when queue is
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    • Tejun Heo's avatar
      block: drop @tsk from attempt_plug_merge() and explain sync rules · bd87b589
      Tejun Heo authored
      attempt_plug_merge() accesses elevator without holding queue_lock and
      may call into ->elevator_bio_merge_fn().  The elvator is guaranteed to
      be valid because it's accessed iff the plugged list has requests and
      elevator is never exited with live requests, so as long as the
      elevator method can deal with unlocked access, this is safe.
      Explain the sync rules around attempt_plug_merge() and drop the
      unnecessary @tsk parameter.
      This patch doesn't introduce any functional change.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    • Tejun Heo's avatar
      block: make get_request[_wait]() fail if queue is dead · da8303c6
      Tejun Heo authored
      Currently get_request[_wait]() allocates request whether queue is dead
      or not.  This patch makes get_request[_wait]() return NULL if @q is
      dead.  blk_queue_bio() is updated to fail the submitted bio if request
      allocation fails.  While at it, add docbook comments for
      Note that the current code has rather unclear (there are spurious DEAD
      tests scattered around) assumption that the owner of a queue
      guarantees that no request travels block layer if the queue is dead
      and this patch in itself doesn't change much; however, this will allow
      fixing the broken assumption in the next patch.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>