1. 28 Nov, 2013 2 commits
    • NeilBrown's avatar
      md/raid5: fix newly-broken locking in get_active_stripe. · 6d183de4
      NeilBrown authored
      commit 566c09c5
       raid5: relieve lock contention in get_active_stripe()
      modified the locking in get_active_stripe() reducing the range
      protected by the (highly contended) device_lock.
      Unfortunately it reduced the range too much opening up some races.
      One race can occur if get_priority_stripe runs between the
      test on sh->count and device_lock being taken.
      This will mean that sh->lru is not empty while get_active_stripe
      thinks ->count is zero resulting in a 'BUG' firing.
      Another race happens if __release_stripe is called immediately
      after sh->count is tested and found to be non-zero.  If STRIPE_HANDLE
      is not set, get_active_stripe should increment ->active_stripes
      when it increments ->count from 0, but as it didn't think it was 0,
      it doesn't.
      Extending device_lock to cover the test on sh->count close these
      While we are here, fix the two BUG tests:
       -If count is zero, then lru really must not be empty, or we've
        lock the stripe_head somehow - no other tests are relevant.
       -STRIPE_ON_RELEASE_LIST is completely independent of ->lru so
        testing it is pointless.
      Reported-and-tested-by: default avatarBrassow Jonathan <jbrassow@redhat.com>
      Reviewed-by: default avatarShaohua Li <shli@kernel.org>
      Fixes: 566c09c5
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
    • NeilBrown's avatar
      md/raid5: fix new memory-reference bug in alloc_thread_groups. · 0c775d52
      NeilBrown authored
      In alloc_thread_groups, worker_groups is a pointer to an array,
      not an array of pointers.
      is wrong.  It should be
      Found-by: coverity
      Fixes: 60aaf933
      Reported-by: default avatarBen Hutchings <bhutchings@solarflare.com>
      Cc: majianpeng <majianpeng@gmail.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
  2. 19 Nov, 2013 6 commits
    • majianpeng's avatar
      md/raid5: Use conf->device_lock protect changing of multi-thread resources. · 60aaf933
      majianpeng authored
      When we change group_thread_cnt from sysfs entry, it can OOPS.
      The kernel messages are:
      [  135.299021] BUG: unable to handle kernel NULL pointer dereference at           (null)
      [  135.299073] IP: [<ffffffff815188ab>] handle_active_stripes+0x32b/0x440
      [  135.299107] PGD 0
      [  135.299122] Oops: 0000 [#1] SMP
      [  135.299144] Modules linked in: netconsole e1000e ptp pps_core
      [  135.299188] CPU: 3 PID: 2225 Comm: md0_raid5 Not tainted 3.12.0+ #24
      [  135.299214] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS 080015  11/09/2011
      [  135.299255] task: ffff8800b9638f80 ti: ffff8800b77a4000 task.ti: ffff8800b77a4000
      [  135.299283] RIP: 0010:[<ffffffff815188ab>]  [<ffffffff815188ab>] handle_active_stripes+0x32b/0x440
      [  135.299323] RSP: 0018:ffff8800b77a5c48  EFLAGS: 00010002
      [  135.299344] RAX: ffff880037bb5c70 RBX: 0000000000000000 RCX: 0000000000000008
      [  135.299371] RDX: ffff880037bb5cb8 RSI: 0000000000000001 RDI: ffff880037bb5c00
      [  135.299398] RBP: ffff8800b77a5d08 R08: 0000000000000001 R09: 0000000000000000
      [  135.299425] R10: ffff8800b77a5c98 R11: 00000000ffffffff R12: ffff880037bb5c00
      [  135.299452] R13: 0000000000000000 R14: 0000000000000000 R15: ffff880037bb5c70
      [  135.299479] FS:  0000000000000000(0000) GS:ffff88013fd80000(0000) knlGS:0000000000000000
      [  135.299510] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [  135.299532] CR2: 0000000000000000 CR3: 0000000001c0b000 CR4: 00000000000407e0
      [  135.299559] Stack:
      [  135.299570]  ffff8800b77a5c88 ffffffff8107383e ffff8800b77a5c88 ffff880037a64300
      [  135.299611]  000000000000ec08 ffff880037bb5cb8 ffff8800b77a5c98 ffffffffffffffd8
      [  135.299654]  000000000000ec08 ffff880037bb5c60 ffff8800b77a5c98 ffff8800b77a5c98
      [  135.299696] Call Trace:
      [  135.299711]  [<ffffffff8107383e>] ? __wake_up+0x4e/0x70
      [  135.299733]  [<ffffffff81518f88>] raid5d+0x4c8/0x680
      [  135.299756]  [<ffffffff817174ed>] ? schedule_timeout+0x15d/0x1f0
      [  135.299781]  [<ffffffff81524c9f>] md_thread+0x11f/0x170
      [  135.299804]  [<ffffffff81069cd0>] ? wake_up_bit+0x40/0x40
      [  135.299826]  [<ffffffff81524b80>] ? md_rdev_init+0x110/0x110
      [  135.299850]  [<ffffffff81069656>] kthread+0xc6/0xd0
      [  135.299871]  [<ffffffff81069590>] ? kthread_freezable_should_stop+0x70/0x70
      [  135.299899]  [<ffffffff81722ffc>] ret_from_fork+0x7c/0xb0
      [  135.299923]  [<ffffffff81069590>] ? kthread_freezable_should_stop+0x70/0x70
      [  135.299951] Code: ff ff ff 0f 84 d7 fe ff ff e9 5c fe ff ff 66 90 41 8b b4 24 d8 01 00 00 45 31 ed 85 f6 0f 8e 7b fd ff ff 49 8b 9c 24 d0 01 00 00 <48> 3b 1b 49 89 dd 0f 85 67 fd ff ff 48 8d 43 28 31 d2 eb 17 90
      [  135.300005] RIP  [<ffffffff815188ab>] handle_active_stripes+0x32b/0x440
      [  135.300005]  RSP <ffff8800b77a5c48>
      [  135.300005] CR2: 0000000000000000
      [  135.300005] ---[ end trace 504854e5bb7562ed ]---
      [  135.300005] Kernel panic - not syncing: Fatal exception
      This is because raid5d() can be running when the multi-thread
      resources are changed via system. We see need to provide locking.
      mddev->device_lock is suitable, but we cannot simple call
      alloc_thread_groups under this lock as we cannot allocate memory
      while holding a spinlock.
      So change alloc_thread_groups() to allocate and return the data
      structures, then raid5_store_group_thread_cnt() can take the lock
      while updating the pointers to the data structures.
      This fixes a bug introduced in 3.12 and so is suitable for the 3.12.x
      stable series.
      Fixes: b721420e
      Cc: stable@vger.kernel.org (3.12)
      Signed-off-by: default avatarJianpeng Ma <majianpeng@gmail.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Reviewed-by: default avatarShaohua Li <shli@kernel.org>
    • majianpeng's avatar
      md/raid5: Before freeing old multi-thread worker, it should flush them. · d206dcfa
      majianpeng authored
      When changing group_thread_cnt from sysfs entry, the kernel can oops.
      The kernel messages are:
      [  740.961389] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
      [  740.961444] IP: [<ffffffff81062570>] process_one_work+0x30/0x500
      [  740.961476] PGD b9013067 PUD b651e067 PMD 0
      [  740.961503] Oops: 0000 [#1] SMP
      [  740.961525] Modules linked in: netconsole e1000e ptp pps_core
      [  740.961577] CPU: 0 PID: 3683 Comm: kworker/u8:5 Not tainted 3.12.0+ #23
      [  740.961602] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS 080015  11/09/2011
      [  740.961646] task: ffff88013abe0000 ti: ffff88013a246000 task.ti: ffff88013a246000
      [  740.961673] RIP: 0010:[<ffffffff81062570>]  [<ffffffff81062570>] process_one_work+0x30/0x500
      [  740.961708] RSP: 0018:ffff88013a247e08  EFLAGS: 00010086
      [  740.961730] RAX: ffff8800b912b400 RBX: ffff88013a61e680 RCX: ffff8800b912b400
      [  740.961757] RDX: ffff8800b912b600 RSI: ffff8800b912b600 RDI: ffff88013a61e680
      [  740.961782] RBP: ffff88013a247e48 R08: ffff88013a246000 R09: 000000000002c09d
      [  740.961808] R10: 000000000000010f R11: 0000000000000000 R12: ffff88013b00cc00
      [  740.961833] R13: 0000000000000000 R14: ffff88013b00cf80 R15: ffff88013a61e6b0
      [  740.961861] FS:  0000000000000000(0000) GS:ffff88013fc00000(0000) knlGS:0000000000000000
      [  740.961893] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [  740.962001] CR2: 00000000000000b8 CR3: 00000000b24fe000 CR4: 00000000000407f0
      [  740.962001] Stack:
      [  740.962001]  0000000000000008 ffff8800b912b600 ffff88013b00cc00 ffff88013a61e680
      [  740.962001]  ffff88013b00cc00 ffff88013b00cc18 ffff88013b00cf80 ffff88013a61e6b0
      [  740.962001]  ffff88013a247eb8 ffffffff810639c6 0000000000012a80 ffff88013a247fd8
      [  740.962001] Call Trace:
      [  740.962001]  [<ffffffff810639c6>] worker_thread+0x206/0x3f0
      [  740.962001]  [<ffffffff810637c0>] ? manage_workers+0x2c0/0x2c0
      [  740.962001]  [<ffffffff81069656>] kthread+0xc6/0xd0
      [  740.962001]  [<ffffffff81069590>] ? kthread_freezable_should_stop+0x70/0x70
      [  740.962001]  [<ffffffff81722ffc>] ret_from_fork+0x7c/0xb0
      [  740.962001]  [<ffffffff81069590>] ? kthread_freezable_should_stop+0x70/0x70
      [  740.962001] Code: 89 e5 41 57 41 56 41 55 45 31 ed 41 54 53 48 89 fb 48 83 ec 18 48 8b 06 4c 8b 67 48 48 89 c1 30 c9 a8 04 4c 0f 45 e9 80 7f 58 00 <49> 8b 45 08 44 8b b0 00 01 00 00 78 0c 41 f6 44 24 10 04 0f 84
      [  740.962001] RIP  [<ffffffff81062570>] process_one_work+0x30/0x500
      [  740.962001]  RSP <ffff88013a247e08>
      [  740.962001] CR2: 0000000000000008
      [  740.962001] ---[ end trace 39181460000748de ]---
      [  740.962001] Kernel panic - not syncing: Fatal exception
      This can happen if there are some stripes left, fewer than MAX_STRIPE_BATCH.
      A worker is queued to handle them.
      But before calling raid5_do_work, raid5d handles those
      stripes making conf->active_stripe = 0.
      So mddev_suspend() can return.
      We might then free old worker resources before the queued
      raid5_do_work() handled them.  When it runs, it crashes.
      	raid5d()		raid5_store_group_thread_cnt()
      	queue_work		mddev_suspend()
      				free(old worker resources)
      To avoid this, we should only flush the worker resources before freeing them.
      This fixes a bug introduced in 3.12 so is suitable for the 3.12.x
      stable series.
      Cc: stable@vger.kernel.org (3.12)
      Fixes: b721420e
      Signed-off-by: default avatarJianpeng Ma <majianpeng@gmail.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Reviewed-by: default avatarShaohua Li <shli@kernel.org>
    • majianpeng's avatar
      md/raid5: For stripe with R5_ReadNoMerge, we replace REQ_FLUSH with REQ_NOMERGE. · e59aa23f
      majianpeng authored
      For R5_ReadNoMerge,it mean this bio can't merge with other bios or
      request.It used REQ_FLUSH to achieve this. But REQ_NOMERGE can do the
      same work.
      Signed-off-by: default avatarJianpeng Ma <majianpeng@gmail.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
    • NeilBrown's avatar
      md: use MD_RECOVERY_INTR instead of kthread_should_stop in resync thread. · c91abf5a
      NeilBrown authored
      We currently use kthread_should_stop() in various places in the
      sync/reshape code to abort early.
      However some places set MD_RECOVERY_INTR but don't immediately call
      md_reap_sync_thread() (and we will shortly get another one).
      When this happens we are relying on md_check_recovery() to reap the
      thread and that only happen when it finishes normally.
      So MD_RECOVERY_INTR must lead to a normal finish without the
      kthread_should_stop() test.
      So replace all relevant tests, and be more careful when the thread is
      interrupted not to acknowledge that latest step in a reshape as it may
      not be fully committed yet.
      Also add a test on MD_RECOVERY_INTR in the 'is_mddev_idle' loop
      so we don't wait have to wait for the speed to drop before we can abort.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
    • Bian Yu's avatar
      raid5: Retry R5_ReadNoMerge flag when hit a read error. · edfa1f65
      Bian Yu authored
      Because of block layer merge, one bio fails will cause other bios
      which belongs to the same request fails, so raid5_end_read_request
      will record all these bios as badblocks.
      If retry request with R5_ReadNoMerge flag to avoid bios merge,
      badblocks can only record sector which is bad exactly.
      hdparm --yes-i-know-what-i-am-doing --make-bad-sector 300000 /dev/sdb
      mdadm -C /dev/md0 -l5 -n3 /dev/sd[bcd] --assume-clean
      mdadm /dev/md0 -f /dev/sdd
      mdadm /dev/md0 -r /dev/sdd
      mdadm --zero-superblock /dev/sdd
      mdadm /dev/md0 -a /dev/sdd
      1. Without this patch:
      cat /sys/block/md0/md/rd*/bad_blocks
      299776 256
      299776 256
      2. With this patch:
      cat /sys/block/md0/md/rd*/bad_blocks
      300000 8
      300000 8
      Signed-off-by: default avatarBian Yu <bianyu@kedacom.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
    • Shaohua Li's avatar
      raid5: relieve lock contention in get_active_stripe() · 4bda556a
      Shaohua Li authored
      track empty inactive list count, so md_raid5_congested() can use it to make
      Signed-off-by: default avatarShaohua Li <shli@fusionio.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
  3. 15 Nov, 2013 1 commit
  4. 14 Nov, 2013 3 commits
    • Shaohua Li's avatar
      raid5: relieve lock contention in get_active_stripe() · 566c09c5
      Shaohua Li authored
      get_active_stripe() is the last place we have lock contention. It has two
      paths. One is stripe isn't found and new stripe is allocated, the other is
      stripe is found.
      The first path basically calls __find_stripe and init_stripe. It accesses
      conf->generation, conf->previous_raid_disks, conf->raid_disks,
      conf->prev_chunk_sectors, conf->chunk_sectors, conf->max_degraded,
      conf->prev_algo, conf->algorithm, the stripe_hashtbl and inactive_list. Except
      stripe_hashtbl and inactive_list, other fields are changed very rarely.
      With this patch, we split inactive_list and add new hash locks. Each free
      stripe belongs to a specific inactive list. Which inactive list is determined
      by stripe's lock_hash. Note, even a stripe hasn't a sector assigned, it has a
      lock_hash assigned. Stripe's inactive list is protected by a hash lock, which
      is determined by it's lock_hash too. The lock_hash is derivied from current
      stripe_hashtbl hash, which guarantees any stripe_hashtbl list will be assigned
      to a specific lock_hash, so we can use new hash lock to protect stripe_hashtbl
      list too. The goal of the new hash locks introduced is we can only use the new
      locks in the first path of get_active_stripe(). Since we have several hash
      locks, lock contention is relieved significantly.
      The first path of get_active_stripe() accesses other fields, since they are
      changed rarely, changing them now need take conf->device_lock and all hash
      locks. For a slow path, this isn't a problem.
      If we need lock device_lock and hash lock, we always lock hash lock first. The
      tricky part is release_stripe and friends. We need take device_lock first.
      Neil's suggestion is we put inactive stripes to a temporary list and readd it
      to inactive_list after device_lock is released. In this way, we add stripes to
      temporary list with device_lock hold and remove stripes from the list with hash
      lock hold. So we don't allow concurrent access to the temporary list, which
      means we need allocate temporary list for all participants of release_stripe.
      One downside is free stripes are maintained in their inactive list, they can't
      across between the lists. By default, we have total 256 stripes and 8 lists, so
      each list will have 32 stripes. It's possible one list has free stripe but
      other list hasn't. The chance should be rare because stripes allocation are
      even distributed. And we can always allocate more stripes for cache, several
      mega bytes memory isn't a big deal.
      This completely removes the lock contention of the first path of
      get_active_stripe(). It slows down the second code path a little bit though
      because we now need takes two locks, but since the hash lock isn't contended,
      the overhead should be quite small (several atomic instructions). The second
      path of get_active_stripe() (basically sequential write or big request size
      randwrite) still has lock contentions.
      Signed-off-by: default avatarShaohua Li <shli@fusionio.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
    • NeilBrown's avatar
      md/raid5.c: add proper locking to error path of raid5_start_reshape. · ba8805b9
      NeilBrown authored
      If raid5_start_reshape errors out, we need to reset all the fields
      that were updated (not just some), and need to use the seq_counter
      to ensure make_request() doesn't use an inconsitent state.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
    • majianpeng's avatar
      raid5: Use slow_path to release stripe when mddev->thread is null · ad4068de
      majianpeng authored
      When release_stripe() is called in grow_one_stripe(), the
      mddev->thread is null. So it will omit one wakeup this thread to
      release stripe.
      For this condition, use slow_path to release stripe.
      Bug was introduced in 3.12
      Cc: stable@vger.kernel.org (3.12+)
      Fixes: 773ca82f
      Signed-off-by: default avatarJianpeng Ma <majianpeng@gmail.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
  5. 24 Oct, 2013 2 commits
  6. 02 Sep, 2013 1 commit
    • Shaohua Li's avatar
      raid5: only wakeup necessary threads · bfc90cb0
      Shaohua Li authored
      If there are not enough stripes to handle, we'd better not always
      queue all available work_structs. If one worker can only handle small
      or even none stripes, it will impact request merge and create lock
      With this patch, the number of work_struct running will depend on
      pending stripes number. Note: some statistics info used in the patch
      are accessed without locking protection. This should doesn't matter,
      we just try best to avoid queue unnecessary work_struct.
      Signed-off-by: default avatarShaohua Li <shli@fusionio.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
  7. 28 Aug, 2013 6 commits
    • NeilBrown's avatar
      md/raid5: flush out all pending requests before proceeding with reshape. · 4d77e3ba
      NeilBrown authored
      Some requests - particularly 'discard' and 'read' are handled
      differently depending on whether a reshape is active or not.
      It is harmless to assume reshape is active if it isn't but wrong
      to act as though reshape is not active when it is.
      So when we start reshape - after making clear to all requests that
      reshape has started - use mddev_suspend/mddev_resume to flush out all
      requests.  This will ensure that no requests will be assuming the
      absence of reshape once it really starts.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
    • NeilBrown's avatar
      md/raid5: use seqcount to protect access to shape in make_request. · c46501b2
      NeilBrown authored
      make_request() access various shape parameters (raid_disks, chunk_size
      etc) which might be changed by raid5_start_reshape().
      If the later is called at and awkward time during the form, the wrong
      stripe_head might be used.
      So introduce a 'seqcount' and after finding a stripe_head make sure
      there is no reason to expect that we got the wrong one.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
    • Shaohua Li's avatar
      raid5: sysfs entry to control worker thread number · b721420e
      Shaohua Li authored
      Add a sysfs entry to control running workqueue thread number. If
      group_thread_cnt is set to 0, we will disable workqueue offload handling of
      Signed-off-by: default avatarShaohua Li <shli@fusionio.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
    • Shaohua Li's avatar
      raid5: offload stripe handle to workqueue · 851c30c9
      Shaohua Li authored
      This is another attempt to create multiple threads to handle raid5 stripes.
      This time I use workqueue.
      raid5 handles request (especially write) in stripe unit. A stripe is page size
      aligned/long and acrosses all disks. Writing to any disk sector, raid5 runs a
      state machine for the corresponding stripe, which includes reading some disks
      of the stripe, calculating parity, and writing some disks of the stripe. The
      state machine is running in raid5d thread currently. Since there is only one
      thread, it doesn't scale well for high speed storage. An obvious solution is
      To get better performance, we have some requirements:
      a. locality. stripe corresponding to request submitted from one cpu is better
      handled in thread in local cpu or local node. local cpu is preferred but some
      times could be a bottleneck, for example, parity calculation is too heavy.
      local node running has wide adaptability.
      b. configurablity. Different setup of raid5 array might need diffent
      configuration. Especially the thread number. More threads don't always mean
      better performance because of lock contentions.
      My original implementation is creating some kernel threads. There are
      interfaces to control which cpu's stripe each thread should handle. And
      userspace can set affinity of the threads. This provides biggest flexibility
      and configurability. But it's hard to use and apparently a new thread pool
      implementation is disfavor.
      Recent workqueue improvement is quite promising. unbound workqueue will be
      bound to numa node. If WQ_SYSFS is set in workqueue, there are sysfs option to
      do affinity setting. For example, we can only include one HT sibling in
      affinity. Since work is non-reentrant by default, and we can control running
      thread number by limiting dispatched work_struct number.
      In this patch, I created several stripe worker group. A group is a numa node.
      stripes from cpus of one node will be added to a group list. Workqueue thread
      of one node will only handle stripes of worker group of the node. In this way,
      stripe handling has numa node locality. And as I said, we can control thread
      number by limiting dispatched work_struct number.
      The work_struct callback function handles several stripes in one run. A typical
      work queue usage is to run one unit in each work_struct. In raid5 case, the
      unit is a stripe. But we can't do that:
      a. Though handling a stripe doesn't need lock because of reference accounting
      and stripe isn't in any list, queuing a work_struct for each stripe will make
      workqueue lock contended very heavily.
      b. blk_start_plug()/blk_finish_plug() should surround stripe handle, as we
      might dispatch request. If each work_struct only handles one stripe, such block
      plug is meaningless.
      This implementation can't do very fine grained configuration. But the numa
      binding is most popular usage model, should be enough for most workloads.
      Note: since we have only one stripe queue, switching to multi-thread might
      decrease request size dispatching down to low level layer. The impact depends
      on thread number, raid configuration and workload. So multi-thread raid5 might
      not be proper for all setups.
      Changes V1 -> V2:
      1. remove WQ_NON_REENTRANT
      2. disabling multi-threading by default
      3. Add more descriptions in changelog
      Signed-off-by: default avatarShaohua Li <shli@fusionio.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
    • Shaohua Li's avatar
      raid5: fix stripe release order · d265d9dc
      Shaohua Li authored
      patch "make release_stripe lockless" changes the order stripes are released.
      Originally I thought block layer can take care of request merge, but it appears
      there are still some requests not merged. It's easy to fix the order.
      Signed-off-by: default avatarShaohua Li <shli@fusionio.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
    • Shaohua Li's avatar
      raid5: make release_stripe lockless · 773ca82f
      Shaohua Li authored
      release_stripe still has big lock contention. We just add the stripe to a llist
      without taking device_lock. We let the raid5d thread to do the real stripe
      release, which must hold device_lock anyway. In this way, release_stripe
      doesn't hold any locks.
      The side effect is the released stripes order is changed. But sounds not a big
      deal, stripes are never handled in order. And I thought block layer can already
      do nice request merge, which means order isn't that important.
      I kept the unplug release batch, which is unnecessary with this patch from lock
      contention avoid point of view, and actually if we delete it, the stripe_head
      release_list and lru can share storage. But the unplug release batch is also
      helpful for request merge. We probably can delay wakeup raid5d till unplug, but
      I'm still afraid of the case which raid5d is running.
      Signed-off-by: default avatarShaohua Li <shli@fusionio.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
  8. 25 Jul, 2013 1 commit
    • NeilBrown's avatar
      md/raid5: fix interaction of 'replace' and 'recovery'. · f94c0b66
      NeilBrown authored
      If a device in a RAID4/5/6 is being replaced while another is being
      recovered, then the writes to the replacement device currently don't
      happen, resulting in corruption when the replacement completes and the
      new drive takes over.
      This is because the replacement writes are only triggered when
      's.replacing' is set and not when the similar 's.sync' is set (which
      is the case during resync and recovery - it means all devices need to
      be read).
      So schedule those writes when s.replacing is set as well.
      In this case we cannot use "STRIPE_INSYNC" to record that the
      replacement has happened as that is needed for recording that any
      parity calculation is complete.  So introduce STRIPE_REPLACED to
      record if the replacement has happened.
      For safety we should also check that STRIPE_COMPUTE_RUN is not set.
      This has a similar effect to the "s.locked == 0" test.  The latter
      ensure that now IO has been flagged but not started.  The former
      checks if any parity calculation has been flagged by not started.
      We must wait for both of these to complete before triggering the
      Add a similar test to the subsequent check for "are we finished yet".
      This possibly isn't needed (is subsumed in the STRIPE_INSYNC test),
      but it makes it more obvious that the REPLACE will happen before we
      think we are finished.
      Finally if a NeedReplace device is not UPTODATE then that is an
      error.  We really must trigger a warning.
      This bug was introduced in commit 9a3e1101
      (md/raid5:  detect and handle replacements during recovery.)
      which introduced replacement for raid5.
      That was in 3.3-rc3, so any stable kernel since then would benefit
      from this fix.
      Cc: stable@vger.kernel.org (3.3+)
      Reported-by: default avatarqindehua <13691222965@163.com>
      Tested-by: default avatarqindehua <qindehua@163.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
  9. 04 Jul, 2013 1 commit
    • NeilBrown's avatar
      md/raid5: allow 5-device RAID6 to be reshaped to 4-device. · fdcfbbb6
      NeilBrown authored
      There is a bug in 'check_reshape' for raid5.c  To checks
      that the new minimum number of devices is large enough (which is
      good), but it does so also after the reshape has started (bad).
      This is bad because
       - the calculation is now wrong as mddev->raid_disks has changed
         already, and
       - it is pointless because it is now too late to stop.
      So only perform that test when reshape has not been committed to.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
  10. 13 Jun, 2013 2 commits
    • Jingoo Han's avatar
      md: replace strict_strto*() with kstrto*() · b29bebd6
      Jingoo Han authored
      The usage of strict_strtoul() is not preferred, because
      strict_strtoul() is obsolete. Thus, kstrtoul() should be
      Signed-off-by: default avatarJingoo Han <jg1.han@samsung.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
    • H. Peter Anvin's avatar
      md/raid1,5,10: Disable WRITE SAME until a recovery strategy is in place · 5026d7a9
      H. Peter Anvin authored
      There are cases where the kernel will believe that the WRITE SAME
      command is supported by a block device which does not, in fact,
      support WRITE SAME.  This currently happens for SATA drivers behind a
      SAS controller, but there are probably a hundred other ways that can
      happen, including drive firmware bugs.
      After receiving an error for WRITE SAME the block layer will retry the
      request as a plain write of zeroes, but mdraid will consider the
      failure as fatal and consider the drive failed.  This has the effect
      that all the mirrors containing a specific set of data are each
      offlined in very rapid succession resulting in data loss.
      However, just bouncing the request back up to the block layer isn't
      ideal either, because the whole initial request-retry sequence should
      be inside the write bitmap fence, which probably means that md needs
      to do its own conversion of WRITE SAME to write zero.
      Until the failure scenario has been sorted out, disable WRITE SAME for
      raid1, raid5, and raid10.
      [neilb: added raid5]
      This patch is appropriate for any -stable since 3.7 when write_same
      support was added.
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarH. Peter Anvin <hpa@linux.intel.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
  11. 30 May, 2013 1 commit
  12. 24 Apr, 2013 2 commits
  13. 18 Apr, 2013 1 commit
  14. 23 Mar, 2013 3 commits
    • Kent Overstreet's avatar
      raid5: use bio_reset() · 2f6db2a7
      Kent Overstreet authored
      Had to shuffle the code around a bit (where bi_rw and bi_end_io were
      set), but shouldn't really be anything tricky here
      Signed-off-by: default avatarKent Overstreet <koverstreet@google.com>
      CC: Jens Axboe <axboe@kernel.dk>
      CC: NeilBrown <neilb@suse.de>
    • Kent Overstreet's avatar
      block: Use bio_sectors() more consistently · aa8b57aa
      Kent Overstreet authored
      Bunch of places in the code weren't using it where they could be -
      this'll reduce the size of the patch that puts bi_sector/bi_size/bi_idx
      into a struct bvec_iter.
      Signed-off-by: default avatarKent Overstreet <koverstreet@google.com>
      CC: Jens Axboe <axboe@kernel.dk>
      CC: "Ed L. Cashin" <ecashin@coraid.com>
      CC: Nick Piggin <npiggin@kernel.dk>
      CC: Jiri Kosina <jkosina@suse.cz>
      CC: Jim Paris <jim@jtan.com>
      CC: Geoff Levand <geoff@infradead.org>
      CC: Alasdair Kergon <agk@redhat.com>
      CC: dm-devel@redhat.com
      CC: Neil Brown <neilb@suse.de>
      CC: Steven Rostedt <rostedt@goodmis.org>
      Acked-by: default avatarEd Cashin <ecashin@coraid.com>
    • Kent Overstreet's avatar
      block: Add bio_end_sector() · f73a1c7d
      Kent Overstreet authored
      Just a little convenience macro - main reason to add it now is preparing
      for immutable bio vecs, it'll reduce the size of the patch that puts
      bi_sector/bi_size/bi_idx into a struct bvec_iter.
      Signed-off-by: default avatarKent Overstreet <koverstreet@google.com>
      CC: Jens Axboe <axboe@kernel.dk>
      CC: Lars Ellenberg <drbd-dev@lists.linbit.com>
      CC: Jiri Kosina <jkosina@suse.cz>
      CC: Alasdair Kergon <agk@redhat.com>
      CC: dm-devel@redhat.com
      CC: Neil Brown <neilb@suse.de>
      CC: Martin Schwidefsky <schwidefsky@de.ibm.com>
      CC: Heiko Carstens <heiko.carstens@de.ibm.com>
      CC: linux-s390@vger.kernel.org
      CC: Chris Mason <chris.mason@fusionio.com>
      CC: Steven Whitehouse <swhiteho@redhat.com>
      Acked-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
  15. 20 Mar, 2013 3 commits
    • NeilBrown's avatar
      md/raid5: ensure sync and DISCARD don't happen at the same time. · f8dfcffd
      NeilBrown authored
      A number of problems can occur due to races between
      resync/recovery and discard.
      - if sync_request calls handle_stripe() while a discard is
        happening on the stripe, it might call handle_stripe_clean_event
        before all of the individual discard requests have completed
        (so some devices are still locked, but not all).
        Since commit ca64cae9
           md/raid5: Make sure we clear R5_Discard when discard is finished.
        this will cause R5_Discard to be cleared for the parity device,
        so handle_stripe_clean_event() will not be called when the other
        devices do become unlocked, so their ->written will not be cleared.
        This ultimately leads to a WARN_ON in init_stripe and a lock-up.
      - If handle_stripe_clean_event() does clear R5_UPTODATE at an awkward
        time for resync, it can lead to s->uptodate being less than disks
        in handle_parity_checks5(), which triggers a BUG (because it is
       - keep R5_Discard on the parity device until all other devices have
         completed their discard request
       - make sure we don't try to have a 'discard' and a 'sync' action at
         the same time.
         This involves a new stripe flag to we know when a 'discard' is
         happening, and the use of R5_Overlap on the parity disk so when a
         discard is wanted while a sync is active, so we know to wake up
         the discard at the appropriate time.
      Discard support for RAID5 was added in 3.7, so this is suitable for
      any -stable kernel since 3.7.
      Cc: stable@vger.kernel.org (v3.7+)
      Reported-by: default avatarJes Sorensen <Jes.Sorensen@redhat.com>
      Tested-by: default avatarJes Sorensen <Jes.Sorensen@redhat.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
    • Jonathan Brassow's avatar
      MD RAID5: Avoid accessing gendisk or queue structs when not available · e3620a3a
      Jonathan Brassow authored
      MD RAID5:  Fix kernel oops when RAID4/5/6 is used via device-mapper
      Commit a9add5d9
       (v3.8-rc1) added blktrace calls to the RAID4/5/6 driver.
      However, when device-mapper is used to create RAID4/5/6 arrays, the
      mddev->gendisk and mddev->queue fields are not setup.  Therefore, calling
      things like trace_block_bio_remap will cause a kernel oops.  This patch
      conditionalizes those calls on whether the proper fields exist to make
      the calls.  (Device-mapper will call trace_block_bio_remap on its own.)
      This patch is suitable for the 3.8.y stable kernel.
      Cc: stable@vger.kernel.org (v3.8+)
      Signed-off-by: default avatarJonathan Brassow <jbrassow@redhat.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
    • NeilBrown's avatar
      md/raid5: schedule_construction should abort if nothing to do. · ce7d363a
      NeilBrown authored
      Since commit 1ed850f3
          md/raid5: make sure to_read and to_write never go negative.
      It has been possible for handle_stripe_dirtying to be called
      when there isn't actually any work to do.
      It then calls schedule_reconstruction() which will set R5_LOCKED
      on the parity block(s) even when nothing else is happening.
      This then causes problems in do_release_stripe().
      So add checks to schedule_reconstruction() so that if it doesn't
      find anything to do, it just aborts.
      This bug was introduced in v3.7, so the patch is suitable
      for -stable kernels since then.
      Cc: stable@vger.kernel.org (v3.7+)
      Reported-by: default avatarmajianpeng <majianpeng@gmail.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
  16. 28 Feb, 2013 1 commit
    • Sasha Levin's avatar
      hlist: drop the node parameter from iterators · b67bfe0d
      Sasha Levin authored
      I'm not sure why, but the hlist for each entry iterators were conceived
              list_for_each_entry(pos, head, member)
      The hlist ones were greedy and wanted an extra parameter:
              hlist_for_each_entry(tpos, pos, head, member)
      Why did they need an extra pos parameter? I'm not quite sure. Not only
      they don't really need it, it also prevents the iterator from looking
      exactly like the list iterator, which is unfortunate.
      Besides the semantic patch, there was some manual work required:
       - Fix up the actual hlist iterators in linux/list.h
       - Fix up the declaration of other iterators based on the hlist ones.
       - A very small amount of places were using the 'node' parameter, this
       was modified to use 'obj->member' instead.
       - Coccinelle didn't handle the hlist_for_each_entry_safe iterator
       properly, so those had to be fixed up manually.
      The semantic patch which is mostly the work of Peter Senna Tschudin is here:
      iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;
      type T;
      expression a,c,d,e;
      identifier b;
      statement S;
      -T b;
          <+... when != b
      - b,
      c, d) S
      - b,
      c) S
      - b,
      c) S
      - b,
      c, d) S
      - b,
      c, d) S
      - b,
      c) S
      for_each_busy_worker(a, c,
      - b,
      d) S
      - b,
      c) S
      - b,
      c) S
      - b,
      c) S
      - b,
      c) S
      - b,
      c) S
      - b,
      c) S
      -(a, b)
      + sk_for_each_from(a) S
      - b,
      c, d) S
      - b,
      c) S
      - b,
      c, d, e) S
      - b,
      c) S
      - b,
      c) S
      - b,
      c, d) S
      - b,
      c) S
      - b,
      c, d) S
      - for_each_gfn_sp(a, c, d, b) S
      + for_each_gfn_sp(a, c, d) S
      - for_each_gfn_indirect_valid_sp(a, c, d, b) S
      + for_each_gfn_indirect_valid_sp(a, c, d) S
      - b,
      c) S
      - b,
      c, d) S
      - b,
      c, d) S
      [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
      [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
      [akpm@linux-foundation.org: checkpatch fixes]
      [akpm@linux-foundation.org: fix warnings]
      [akpm@linux-foudnation.org: redo intrusive kvm changes]
      Tested-by: default avatarPeter Senna Tschudin <peter.senna@gmail.com>
      Acked-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  17. 27 Feb, 2013 1 commit
    • NeilBrown's avatar
      md: remove CONFIG_MULTICORE_RAID456 · 51acbcec
      NeilBrown authored
      This doesn't seem to actually help and we have an alternate
      multi-threading approach waiting in the wings, so just get
      rid of this config option and associated code.
      As a bonus, we remove one use of CONFIG_EXPERIMENTAL
      Cc: Dan Williams <djbw@fb.com>
      Cc: Kees Cook <keescook@chromium.org>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
  18. 14 Jan, 2013 1 commit
    • Tejun Heo's avatar
      block: add missing block_bio_complete() tracepoint · 3a366e61
      Tejun Heo authored
      bio completion didn't kick block_bio_complete TP.  Only dm was
      explicitly triggering the TP on IO completion.  This makes
      block_bio_complete TP useless for tracers which want to know about
      bios, and all other bio based drivers skip generating blktrace
      completion events.
      This patch makes all bio completions via bio_endio() generate
      block_bio_complete TP.
      * Explicit trace_block_bio_complete() invocation removed from dm and
        the trace point is unexported.
      * @rq dropped from trace_block_bio_complete().  bios may fly around
        w/o queue associated.  Verifying and accessing the assocaited queue
        belongs to TP probes.
      * blktrace now gets both request and bio completions.  Make it ignore
        bio completions if request completion path is happening.
      This makes all bio based drivers generate blktrace completion events
      properly and makes the block_bio_complete TP actually useful.
      v2: With this change, block_bio_complete TP could be invoked on sg
          commands which have bio's with %NULL bi_bdev.  Update TP
          assignment code to check whether bio->bi_bdev is %NULL before
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Original-patch-by: default avatarNamhyung Kim <namhyung@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Alasdair Kergon <agk@redhat.com>
      Cc: dm-devel@redhat.com
      Cc: Neil Brown <neilb@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
  19. 17 Dec, 2012 1 commit
  20. 13 Dec, 2012 1 commit