1. 20 Nov, 2013 2 commits
  2. 19 Nov, 2013 1 commit
  3. 15 Nov, 2013 1 commit
  4. 14 Nov, 2013 1 commit
  5. 13 Nov, 2013 1 commit
    • Peter Zijlstra's avatar
      block: Use u64_stats_init() to initialize seqcounts · 90d3839b
      Peter Zijlstra authored
      Now that seqcounts are lockdep enabled objects, we need to explicitly
      initialize runtime allocated seqcounts so that lockdep can track them.
      Without this patch, Fengguang was seeing:
        [    4.127282] INFO: trying to register non-static key.
        [    4.128027] the code is fine but needs lockdep annotation.
        [    4.128027] turning off the locking correctness validator.
        [    4.128027] CPU: 0 PID: 96 Comm: kworker/u4:1 Not tainted 3.12.0-next-20131108-10601-gbad570d #2
        [    4.128027] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
        [    ...     ]
        [    4.128027] Call Trace:
        [    4.128027]  [<7908e744>] ? console_unlock+0x353/0x380
        [    4.128027]  [<79dc7cf2>] dump_stack+0x48/0x60
        [    4.128027]  [<7908953e>] __lock_acquire.isra.26+0x7e3/0xceb
        [    4.128027]  [<7908a1c5>] lock_acquire+0x71/0x9a
        [    4.128027]  [<794079aa>] ? blk_throtl_bio+0x1c3/0x485
        [    4.128027]  [<7940658b>] throtl_update_dispatch_stats+0x7c/0x153
        [    4.128027]  [<794079aa>] ? blk_throtl_bio+0x1c3/0x485
        [    4.128027]  [<794079aa>] blk_throtl_bio+0x1c3/0x485
      Use u64_stats_init() for all affected data structures, which initializes
      the seqcount.
      Reported-and-Tested-by: default avatarFengguang Wu <fengguang.wu@intel.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      [ Folded in another fix from the mailing list as well as a fix to that fix. Tweaked commit message. ]
      Signed-off-by: default avatarJohn Stultz <john.stultz@linaro.org>
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1384314134-6895-1-git-send-email-john.stultz@linaro.org
      [ So I actually think that the two SOBs from PeterZ are the right depiction of the patch route. ]
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
  6. 09 Nov, 2013 1 commit
  7. 08 Nov, 2013 11 commits
    • Duan Jiong's avatar
      block: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO · c7d1ba41
      Duan Jiong authored
      This patch fixes coccinelle error regarding usage of IS_ERR and
      PTR_ERR instead of PTR_ERR_OR_ZERO.
      Signed-off-by: default avatarDuan Jiong <duanj.fnst@cn.fujitsu.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    • Duan Jiong's avatar
      block: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO · 8616ebb1
      Duan Jiong authored
      This patch fixes coccinelle error regarding usage of IS_ERR and
      PTR_ERR instead of PTR_ERR_OR_ZERO.
      Signed-off-by: default avatarDuan Jiong <duanj.fnst@cn.fujitsu.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    • Geert Uytterhoeven's avatar
      block: Do not call sector_div() with a 64-bit divisor · 97597dc0
      Geert Uytterhoeven authored
      do_div() (called by sector_div() if CONFIG_LBDAF=y) is meant for divisions
      of 64-bit number by 32-bit numbers.  Passing 64-bit divisor types caused
      issues in the past on 32-bit platforms, cfr. commit
       ("m68k: Truncate base in
      As queue_limits.max_discard_sectors and .discard_granularity are unsigned
      int, max_discard_sectors and granularity should be unsigned int.
      As bdev_discard_alignment() returns int, alignment should be int.
      Now 2 calls to sector_div() can be replaced by 32-bit arithmetic:
        - The 64-bit modulo operation can become a 32-bit modulo operation,
        - The 64-bit division and multiplication can be replaced by a 32-bit
          modulo operation and a subtraction.
      Signed-off-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    • Kent Overstreet's avatar
      block: Use rw_copy_check_uvector() · e0ce0eac
      Kent Overstreet authored
      No need for silly open coding - and struct sg_iovec has exactly the same
      layout as struct iovec...
      Signed-off-by: default avatarKent Overstreet <kmo@daterainc.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    • Alireza Haghdoost's avatar
      block: Enable sysfs nomerge control for I/O requests in the plug list · 23779fbc
      Alireza Haghdoost authored
      This patch enables the sysfs to control I/O request merge
      functionality in the plug list. While this control has been
      implemented for the request queue, it was dismissed in the plug list.
      Therefore, block layer merges requests together (or attempt to merge)
      even if the merge capability was disable using sysfs nomerge parameter
      value 2.
      This limitation is directly affects functionality of io_submit()
      system call. The system call enables user to submit a bunch of IO
      requests from user space using struct iocb **ios input argument.
      However, the unconditioned merging functionality in the plug list
      potentially merges these requests together down the road. Therefore,
      there is no way to distinguish between an application sending bunch of
      sequential IOs and an application sending one big IO. Ultimately, all
      requests generated by the former app merge within the plug list
      together and looks similar to the second app.
      While the merging functionality is a desirable feature to improve the
      performance of IO subsystem for some applications, it is not useful
      for other application like ours at all.
      Signed-off-by: default avatarAlireza Haghdoost <alireza@cs.umn.edu>
      Reviewed-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Coding style modified.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    • Mike Snitzer's avatar
      block: properly stack underlying max_segment_size to DM device · d82ae52e
      Mike Snitzer authored
      Without this patch all DM devices will default to BLK_MAX_SEGMENT_SIZE
      (65536) even if the underlying device(s) have a larger value -- this is
      due to blk_stack_limits() using min_not_zero() when stacking the
      max_segment_size limit.
      before patch:
      after patch:
      Reported-by: default avatarLukasz Flis <l.flis@cyfronet.pl>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org # v3.3+
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    • Tomoki Sekiyama's avatar
      elevator: acquire q->sysfs_lock in elevator_change() · 7c8a3679
      Tomoki Sekiyama authored
      Add locking of q->sysfs_lock into elevator_change() (an exported function)
      to ensure it is held to protect q->elevator from elevator_init(), even if
      elevator_change() is called from non-sysfs paths.
      sysfs path (elv_iosched_store) uses __elevator_change(), non-locking
      version, as the lock is already taken by elv_iosched_store().
      Signed-off-by: default avatarTomoki Sekiyama <tomoki.sekiyama@hds.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    • Tomoki Sekiyama's avatar
      elevator: Fix a race in elevator switching and md device initialization · eb1c160b
      Tomoki Sekiyama authored
      The soft lockup below happens at the boot time of the system using dm
      multipath and the udev rules to switch scheduler.
      [  356.127001] BUG: soft lockup - CPU#3 stuck for 22s! [sh:483]
      [  356.127001] RIP: 0010:[<ffffffff81072a7d>]  [<ffffffff81072a7d>] lock_timer_base.isra.35+0x1d/0x50
      [  356.127001] Call Trace:
      [  356.127001]  [<ffffffff81073810>] try_to_del_timer_sync+0x20/0x70
      [  356.127001]  [<ffffffff8118b08a>] ? kmem_cache_alloc_node_trace+0x20a/0x230
      [  356.127001]  [<ffffffff810738b2>] del_timer_sync+0x52/0x60
      [  356.127001]  [<ffffffff812ece22>] cfq_exit_queue+0x32/0xf0
      [  356.127001]  [<ffffffff812c98df>] elevator_exit+0x2f/0x50
      [  356.127001]  [<ffffffff812c9f21>] elevator_change+0xf1/0x1c0
      [  356.127001]  [<ffffffff812caa50>] elv_iosched_store+0x20/0x50
      [  356.127001]  [<ffffffff812d1d09>] queue_attr_store+0x59/0xb0
      [  356.127001]  [<ffffffff812143f6>] sysfs_write_file+0xc6/0x140
      [  356.127001]  [<ffffffff811a326d>] vfs_write+0xbd/0x1e0
      [  356.127001]  [<ffffffff811a3ca9>] SyS_write+0x49/0xa0
      [  356.127001]  [<ffffffff8164e899>] system_call_fastpath+0x16/0x1b
      This is caused by a race between md device initialization by multipathd and
      shell script to switch the scheduler using sysfs.
       - multipathd:
         SyS_ioctl -> do_vfs_ioctl -> dm_ctl_ioctl -> ctl_ioctl -> table_load
         -> dm_setup_md_queue -> blk_init_allocated_queue -> elevator_init
          q->elevator = elevator_alloc(q, e); // not yet initialized
       - sh -c 'echo deadline > /sys/$DEVPATH/queue/scheduler':
         elevator_switch (in the call trace above)
          struct elevator_queue *old = q->elevator;
          q->elevator = elevator_alloc(q, new_e);
          elevator_exit(old);                 // lockup! (*)
       - multipathd: (cont.)
          err = e->ops.elevator_init_fn(q);   // init fails; q->elevator is modified
      (*) When del_timer_sync() is called, lock_timer_base() will loop infinitely
      while timer->base == NULL. In this case, as timer will never initialized,
      it results in lockup.
      This patch introduces acquisition of q->sysfs_lock around elevator_init()
      into blk_init_allocated_queue(), to provide mutual exclusion between
      initialization of the q->scheduler and switching of the scheduler.
      This should fix this bugzilla:
      Signed-off-by: default avatarTomoki Sekiyama <tomoki.sekiyama@hds.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    • Christoph Lameter's avatar
      block: Replace __get_cpu_var uses · 170d800a
      Christoph Lameter authored
      __get_cpu_var() is used for multiple purposes in the kernel source. One of
      them is address calculation via the form &__get_cpu_var(x).  This calculates
      the address for the instance of the percpu variable of the current processor
      based on an offset.
      Other use cases are for storing and retrieving data from the current
      processors percpu area.  __get_cpu_var() can be used as an lvalue when
      writing data or on the right side of an assignment.
      __get_cpu_var() is defined as :
      #define __get_cpu_var(var) (*this_cpu_ptr(&(var)))
      __get_cpu_var() always only does an address determination. However, store
      and retrieve operations could use a segment prefix (or global register on
      other platforms) to avoid the address calculation.
      this_cpu_write() and this_cpu_read() can directly take an offset into a
      percpu area and use optimized assembly code to read and write per cpu
      This patch converts __get_cpu_var into either an explicit address
      calculation using this_cpu_ptr() or into a use of this_cpu operations that
      use the offset.  Thereby address calculations are avoided and less registers
      are used when code is generated.
      At the end of the patch set all uses of __get_cpu_var have been removed so
      the macro is removed too.
      The patch set includes passes over all arches as well. Once these operations
      are used throughout then specialized macros can be defined in non -x86
      arches as well in order to optimize per cpu access by f.e.  using a global
      register that may be set to the per cpu base.
      Transformations done to __get_cpu_var()
      1. Determine the address of the percpu instance of the current processor.
      	DEFINE_PER_CPU(int, y);
      	int *x = &__get_cpu_var(y);
          Converts to
      	int *x = this_cpu_ptr(&y);
      2. Same as #1 but this time an array structure is involved.
      	DEFINE_PER_CPU(int, y[20]);
      	int *x = __get_cpu_var(y);
          Converts to
      	int *x = this_cpu_ptr(y);
      3. Retrieve the content of the current processors instance of a per cpu
      	DEFINE_PER_CPU(int, y);
      	int x = __get_cpu_var(y)
         Converts to
      	int x = __this_cpu_read(y);
      4. Retrieve the content of a percpu struct
      	DEFINE_PER_CPU(struct mystruct, y);
      	struct mystruct x = __get_cpu_var(y);
         Converts to
      	memcpy(&x, this_cpu_ptr(&y), sizeof(x));
      5. Assignment to a per cpu variable
      	DEFINE_PER_CPU(int, y)
      	__get_cpu_var(y) = x;
         Converts to
      	this_cpu_write(y, x);
      6. Increment/Decrement etc of a per cpu variable
      	DEFINE_PER_CPU(int, y);
         Converts to
      Signed-off-by: default avatarChristoph Lameter <cl@linux.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    • Mikulas Patocka's avatar
      blk-core: Fix memory corruption if blkcg_init_queue fails · fff4996b
      Mikulas Patocka authored
      If blkcg_init_queue fails, blk_alloc_queue_node doesn't call bdi_destroy
      to clean up structures allocated by the backing dev.
      ------------[ cut here ]------------
      WARNING: at lib/debugobjects.c:260 debug_print_object+0x85/0xa0()
      ODEBUG: free active (active state 0) object type: percpu_counter hint:           (null)
      Modules linked in: dm_loop dm_mod ip6table_filter ip6_tables uvesafb cfbcopyarea cfbimgblt cfbfillrect fbcon font bitblit fbcon_rotate fbcon_cw fbcon_ud fbcon_ccw softcursor fb fbdev ipt_MASQUERADE iptable_nat nf_nat_ipv4 msr nf_conntrack_ipv4 nf_defrag_ipv4 xt_state ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge stp llc tun ipv6 cpufreq_userspace cpufreq_stats cpufreq_powersave cpufreq_ondemand cpufreq_conservative spadfs fuse hid_generic usbhid hid raid0 md_mod dmi_sysfs nf_nat_ftp nf_nat nf_conntrack_ftp nf_conntrack lm85 hwmon_vid snd_usb_audio snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd_page_alloc snd_hwdep snd_usbmidi_lib snd_rawmidi snd soundcore acpi_cpufreq freq_table mperf sata_svw serverworks kvm_amd ide_core ehci_pci ohci_hcd libata ehci_hcd kvm usbcore tg3 usb_common libphy k10temp pcspkr ptp i2c_piix4 i2c_core evdev microcode hwmon rtc_cmos pps_core e100 skge floppy mii processor button unix
      CPU: 0 PID: 2739 Comm: lvchange Tainted: G        W
      3.10.15-devel #14
      Hardware name: empty empty/S3992-E, BIOS 'V1.06   ' 06/09/2009
       0000000000000009 ffff88023c3c1ae8 ffffffff813c8fd4 ffff88023c3c1b20
       ffffffff810399eb ffff88043d35cd58 ffffffff81651940 ffff88023c3c1bf8
       ffffffff82479d90 0000000000000005 ffff88023c3c1b80 ffffffff81039a67
      Call Trace:
       [<ffffffff813c8fd4>] dump_stack+0x19/0x1b
       [<ffffffff810399eb>] warn_slowpath_common+0x6b/0xa0
       [<ffffffff81039a67>] warn_slowpath_fmt+0x47/0x50
       [<ffffffff8122aaaf>] ? debug_check_no_obj_freed+0xcf/0x250
       [<ffffffff81229a15>] debug_print_object+0x85/0xa0
       [<ffffffff8122abe3>] debug_check_no_obj_freed+0x203/0x250
       [<ffffffff8113c4ac>] kmem_cache_free+0x20c/0x3a0
       [<ffffffff811f6709>] blk_alloc_queue_node+0x2a9/0x2c0
       [<ffffffff811f672e>] blk_alloc_queue+0xe/0x10
       [<ffffffffa04c0093>] dm_create+0x1a3/0x530 [dm_mod]
       [<ffffffffa04c6bb0>] ? list_version_get_info+0xe0/0xe0 [dm_mod]
       [<ffffffffa04c6c07>] dev_create+0x57/0x2b0 [dm_mod]
       [<ffffffffa04c6bb0>] ? list_version_get_info+0xe0/0xe0 [dm_mod]
       [<ffffffffa04c6bb0>] ? list_version_get_info+0xe0/0xe0 [dm_mod]
       [<ffffffffa04c6528>] ctl_ioctl+0x268/0x500 [dm_mod]
       [<ffffffff81097662>] ? get_lock_stats+0x22/0x70
       [<ffffffffa04c67ce>] dm_ctl_ioctl+0xe/0x20 [dm_mod]
       [<ffffffff81161aad>] do_vfs_ioctl+0x2ed/0x520
       [<ffffffff8116cfc7>] ? fget_light+0x377/0x4e0
       [<ffffffff81161d2b>] SyS_ioctl+0x4b/0x90
       [<ffffffff813cff16>] system_call_fastpath+0x1a/0x1f
      ---[ end trace 4b5ff0d55673d986 ]---
      ------------[ cut here ]------------
      This fix should be backported to stable kernels starting with 2.6.37. Note
      that in the kernels prior to 3.5 the affected code is different, but the
      bug is still there - bdi_init is called and bdi_destroy isn't.
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Cc: stable@kernel.org	# 2.6.37+
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    • Jeff Moyer's avatar
      block: fix race between request completion and timeout handling · 4912aa6c
      Jeff Moyer authored
      crocode i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support shpchp ioatdma dca be2net sg ses enclosure ext4 mbcache jbd2 sd_mod crc_t10dif ahci megaraid_sas(U) dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
      Pid: 491, comm: scsi_eh_0 Tainted: G        W  ----------------   2.6.32-220.13.1.el6.x86_64 #1 IBM  -[8722PAX]-/00D1461
      RIP: 0010:[<ffffffff8124e424>]  [<ffffffff8124e424>] blk_requeue_request+0x94/0xa0
      RSP: 0018:ffff881057eefd60  EFLAGS: 00010012
      RAX: ffff881d99e3e8a8 RBX: ffff881d99e3e780 RCX: ffff881d99e3e8a8
      RDX: ffff881d99e3e8a8 RSI: ffff881d99e3e780 RDI: ffff881d99e3e780
      RBP: ffff881057eefd80 R08: ffff881057eefe90 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000000 R12: ffff881057f92338
      R13: 0000000000000000 R14: ffff881057f92338 R15: ffff883058188000
      FS:  0000000000000000(0000) GS:ffff880040200000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
      CR2: 00000000006d3ec0 CR3: 000000302cd7d000 CR4: 00000000000406b0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      Process scsi_eh_0 (pid: 491, threadinfo ffff881057eee000, task ffff881057e29540)
       0000000000001057 0000000000000286 ffff8810275efdc0 ffff881057f16000
      <0> ffff881057eefdd0 ffffffff81362323 ffff881057eefe20 ffffffff8135f393
      <0> ffff881057e29af8 ffff8810275efdc0 ffff881057eefe78 ffff881057eefe90
      Call Trace:
       [<ffffffff81362323>] __scsi_queue_insert+0xa3/0x150
       [<ffffffff8135f393>] ? scsi_eh_ready_devs+0x5e3/0x850
       [<ffffffff81362a23>] scsi_queue_insert+0x13/0x20
       [<ffffffff8135e4d4>] scsi_eh_flush_done_q+0x104/0x160
       [<ffffffff8135fb6b>] scsi_error_handler+0x35b/0x660
       [<ffffffff8135f810>] ? scsi_error_handler+0x0/0x660
       [<ffffffff810908c6>] kthread+0x96/0xa0
       [<ffffffff8100c14a>] child_rip+0xa/0x20
       [<ffffffff81090830>] ? kthread+0x0/0xa0
       [<ffffffff8100c140>] ? child_rip+0x0/0x20
      Code: 00 00 eb d1 4c 8b 2d 3c 8f 97 00 4d 85 ed 74 bf 49 8b 45 00 49 83 c5 08 48 89 de 4c 89 e7 ff d0 49 8b 45 00 48 85 c0 75 eb eb a4 <0f> 0b eb fe 0f 1f 84 00 00 00 00 00 55 48 89 e5 0f 1f 44 00 00
      RIP  [<ffffffff8124e424>] blk_requeue_request+0x94/0xa0
       RSP <ffff881057eefd60>
      The RIP is this line:
      After digging through the code, I think there may be a race between the
      request completion and the timer handler running.
      A timer is started for each request put on the device's queue (see
      blk_start_request->blk_add_timer).  If the request does not complete
      before the timer expires, the timer handler (blk_rq_timed_out_timer)
      will mark the request complete atomically:
      static inline int blk_mark_rq_complete(struct request *rq)
              return test_and_set_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags);
      and then call blk_rq_timed_out.  The latter function will call
      scsi_times_out, which will return one of BLK_EH_HANDLED,
      returned, blk_clear_rq_complete is called, and blk_add_timer is again
      called to simply wait longer for the request to complete.
      Now, if the request happens to complete while this is going on, what
      happens?  Given that we know the completion handler will bail if it
      finds the REQ_ATOM_COMPLETE bit set, we need to focus on the completion
      handler running after that bit is cleared.  So, from the above
      paragraph, after the call to blk_clear_rq_complete.  If the completion
      sets REQ_ATOM_COMPLETE before the BUG_ON in blk_add_timer, we go boom
      there (I haven't seen this in the cores).  Next, if we get the
      completion before the call to list_add_tail, then the timer will
      eventually fire for an old req, which may either be freed or reallocated
      (there is evidence that this might be the case).  Finally, if the
      completion comes in *after* the addition to the timeout list, I think
      it's harmless.  The request will be removed from the timeout list,
      req_atom_complete will be set, and all will be well.
      This will only actually explain the coredumps *IF* the request
      structure was freed, reallocated *and* queued before the error handler
      thread had a chance to process it.  That is possible, but it may make
      sense to keep digging for another race.  I think that if this is what
      was happening, we would see other instances of this problem showing up
      as null pointer or garbage pointer dereferences, for example when the
      request structure was not re-used.  It looks like we actually do run
      into that situation in other reports.
      This patch moves the BUG_ON(test_bit(REQ_ATOM_COMPLETE,
      &req->atomic_flags)); from blk_add_timer to the only caller that could
      trip over it (blk_start_request).  It then inverts the calls to
      blk_clear_rq_complete and blk_add_timer in blk_rq_timed_out to address
      the race.  I've boot tested this patch, but nothing more.
      Signed-off-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Acked-by: default avatarHannes Reinecke <hare@suse.de>
      Cc: stable@kernel.org
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
  8. 31 Oct, 2013 1 commit
  9. 29 Oct, 2013 2 commits
  10. 28 Oct, 2013 1 commit
    • Christoph Hellwig's avatar
      blk-mq: fix for flush deadlock · 3228f48b
      Christoph Hellwig authored
      The flush state machine takes in a struct request, which then is
      submitted multiple times to the underling driver.  The old block code
      requeses the same request for each of those, so it does not have an
      issue with tapping into the request pool.  The new one on the other hand
      allocates a new request for each of the actualy steps of the flush
      sequence. If have already allocated all of the tags for IO, we will
      fail allocating the flush request.
      Set aside a reserved request just for flushes.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
  11. 25 Oct, 2013 4 commits
    • Christoph Hellwig's avatar
      blk-mq: add blk_mq_stop_hw_queues · 280d45f6
      Christoph Hellwig authored
      Add a helper to iterate over all hw queues and stop them.  This is useful
      for driver that implement PM suspend functionality.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Modified to just call blk_mq_stop_hw_queue() by Jens.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    • Jens Axboe's avatar
      blk-mq: new multi-queue block IO queueing mechanism · 320ae51f
      Jens Axboe authored
      Linux currently has two models for block devices:
      - The classic request_fn based approach, where drivers use struct
        request units for IO. The block layer provides various helper
        functionalities to let drivers share code, things like tag
        management, timeout handling, queueing, etc.
      - The "stacked" approach, where a driver squeezes in between the
        block layer and IO submitter. Since this bypasses the IO stack,
        driver generally have to manage everything themselves.
      With drivers being written for new high IOPS devices, the classic
      request_fn based driver doesn't work well enough. The design dates
      back to when both SMP and high IOPS was rare. It has problems with
      scaling to bigger machines, and runs into scaling issues even on
      smaller machines when you have IOPS in the hundreds of thousands
      per device.
      The stacked approach is then most often selected as the model
      for the driver. But this means that everybody has to re-invent
      everything, and along with that we get all the problems again
      that the shared approach solved.
      This commit introduces blk-mq, block multi queue support. The
      design is centered around per-cpu queues for queueing IO, which
      then funnel down into x number of hardware submission queues.
      We might have a 1:1 mapping between the two, or it might be
      an N:M mapping. That all depends on what the hardware supports.
      blk-mq provides various helper functions, which include:
      - Scalable support for request tagging. Most devices need to
        be able to uniquely identify a request both in the driver and
        to the hardware. The tagging uses per-cpu caches for freed
        tags, to enable cache hot reuse.
      - Timeout handling without tracking request on a per-device
        basis. Basically the driver should be able to get a notification,
        if a request happens to fail.
      - Optional support for non 1:1 mappings between issue and
        submission queues. blk-mq can redirect IO completions to the
        desired location.
      - Support for per-request payloads. Drivers almost always need
        to associate a request structure with some driver private
        command structure. Drivers can tell blk-mq this at init time,
        and then any request handed to the driver will have the
        required size of memory associated with it.
      - Support for merging of IO, and plugging. The stacked model
        gets neither of these. Even for high IOPS devices, merging
        sequential IO reduces per-command overhead and thus
        increases bandwidth.
      For now, this is provided as a potential 3rd queueing model, with
      the hope being that, as it matures, it can replace both the classic
      and stacked model. That would get us back to having just 1 real
      model for block devices, leaving the stacked approach to dm/md
      devices (as it was originally intended).
      Contributions in this patch from the following people:
      Shaohua Li <shli@fusionio.com>
      Alexander Gordeev <agordeev@redhat.com>
      Christoph Hellwig <hch@infradead.org>
      Mike Christie <michaelc@cs.wisc.edu>
      Matias Bjorling <m@bjorling.me>
      Jeff Moyer <jmoyer@redhat.com>
      Acked-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    • Christoph Hellwig's avatar
      block: remove request ref_count · 71fe07d0
      Christoph Hellwig authored
      This reference count has been around since before git history, but the only
      place where it's used is in blk_execute_rq, and ther it is entirely useless
      as it is incremented before submitting the request and decremented in the
      end_io handler before waking up the submitter thread.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    • Jens Axboe's avatar
      block: make rq->cmd_flags be 64-bit · 5953316d
      Jens Axboe authored
      We have officially run out of flags in a 32-bit space. Extend it
      to 64-bit even on 32-bit archs.
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
  12. 17 Oct, 2013 1 commit
  13. 30 Sep, 2013 1 commit
    • Paul Gortmaker's avatar
      block: change config option name for cmdline partition parsing · 080506ad
      Paul Gortmaker authored
      Recently commit bab55417
       ("block: support embedded device command
      line partition") introduced CONFIG_CMDLINE_PARSER.  However, that name
      is too generic and sounds like it enables/disables generic kernel boot
      arg processing, when it really is block specific.
      Before this option becomes a part of a full/final release, add the BLK_
      prefix to it so that it is clear in absence of any other context that it
      is block specific.
      In addition, fix up the following less critical items:
       - help text was not really at all helpful.
       - index file for Documentation was not updated
       - add the new arg to Documentation/kernel-parameters.txt
       - clarify wording in source comments
      Signed-off-by: default avatarPaul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Cai Zhiyong <caizhiyong@huawei.com>
      Cc: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  14. 22 Sep, 2013 1 commit
  15. 18 Sep, 2013 1 commit
  16. 15 Sep, 2013 1 commit
  17. 11 Sep, 2013 9 commits