1. 10 Nov, 2014 7 commits
  2. 04 Nov, 2014 1 commit
  3. 01 Aug, 2014 3 commits
  4. 11 Jun, 2014 1 commit
    • Lukas Czerner's avatar
      dm thin: update discard_granularity to reflect the thin-pool blocksize · 09869de5
      Lukas Czerner authored
      DM thinp already checks whether the discard_granularity of the data
      device is a factor of the thin-pool block size.  But when using the
      dm-thin-pool's discard passdown support, DM thinp was not selecting the
      max of the underlying data device's discard_granularity and the
      thin-pool's block size.
      Update set_discard_limits() to set discard_granularity to the max of
      these values.  This enables blkdev_issue_discard() to properly align the
      discards that are sent to the DM thin device on a full block boundary.
      As such each discard will now cover an entire DM thin-pool block and the
      block will be reclaimed.
      Reported-by: default avatarZdenek Kabelac <zkabelac@redhat.com>
      Signed-off-by: default avatarLukas Czerner <lczerner@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org
  5. 03 Jun, 2014 2 commits
  6. 20 May, 2014 1 commit
    • Mike Snitzer's avatar
      dm thin: add 'no_space_timeout' dm-thin-pool module param · 80c57893
      Mike Snitzer authored
      Commit 85ad643b
       ("dm thin: add timeout to stop out-of-data-space mode
      holding IO forever") introduced a fixed 60 second timeout.  Users may
      want to either disable or modify this timeout.
      Allow the out-of-data-space timeout to be configured using the
      'no_space_timeout' dm-thin-pool module param.  Setting it to 0 will
      disable the timeout, resulting in IO being queued until more data space
      is added to the thin-pool.
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org # 3.14+
  7. 14 May, 2014 2 commits
  8. 29 Apr, 2014 1 commit
  9. 08 Apr, 2014 2 commits
    • Joe Thornber's avatar
      dm thin: fix rcu_read_lock being held in code that can sleep · b10ebd34
      Joe Thornber authored
      Commit c140e1c4
       ("dm thin: use per thin device deferred bio lists")
      introduced the use of an rculist for all active thin devices.  The use
      of rcu_read_lock() in process_deferred_bios() can result in a BUG if a
      dm_bio_prison_cell must be allocated as a side-effect of bio_detain():
       BUG: sleeping function called from invalid context at mm/mempool.c:203
       in_atomic(): 1, irqs_disabled(): 0, pid: 6, name: kworker/u8:0
       3 locks held by kworker/u8:0/6:
         #0:  ("dm-" "thin"){.+.+..}, at: [<ffffffff8106be42>] process_one_work+0x192/0x550
         #1:  ((&pool->worker)){+.+...}, at: [<ffffffff8106be42>] process_one_work+0x192/0x550
         #2:  (rcu_read_lock){.+.+..}, at: [<ffffffff816360b5>] do_worker+0x5/0x4d0
      We can't process deferred bios with the rcu lock held, since
      dm_bio_prison_cell allocation may block if the bio-prison's cell mempool
      is exhausted.
      To fix:
      - Introduce a refcount and completion field to each thin_c
      - Add thin_get/put methods for adjusting the refcount.  If the refcount
        hits zero then the completion is triggered.
      - Initialise refcount to 1 when creating thin_c
      - When iterating the active_thins list we thin_get() whilst the rcu
        lock is held.
      - After the rcu lock is dropped we process the deferred bios for that
      - When destroying a thin_c we thin_put() and then wait for the
        completion -- to avoid a race between the worker thread iterating
        from that thin_c and destroying the thin_c.
      Signed-off-by: default avatarJoe Thornber <ejt@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
    • Joe Thornber's avatar
      dm thin: irqsave must always be used with the pool->lock spinlock · 5e3283e2
      Joe Thornber authored
      Commit c140e1c4
       ("dm thin: use per thin device deferred bio lists")
      incorrectly stopped disabling irqs when taking the pool's spinlock.
      Irqs must be disabled when taking the pool's spinlock otherwise a thread
      could spin_lock(), then get interrupted to service thin_endio() in
      interrupt context, which would then deadlock in spin_lock_irqsave().
      Signed-off-by: default avatarJoe Thornber <ejt@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
  10. 04 Apr, 2014 1 commit
    • Mike Snitzer's avatar
      dm thin: sort the per thin deferred bios using an rb_tree · 67324ea1
      Mike Snitzer authored
      A thin-pool will allocate blocks using FIFO order for all thin devices
      which share the thin-pool.  Because of this simplistic allocation the
      thin-pool's space can become fragmented quite easily; especially when
      multiple threads are requesting blocks in parallel.
      Sort each thin device's deferred_bio_list based on logical sector to
      help reduce fragmentation of the thin-pool's ondisk layout.
      The following tables illustrate the realized gains/potential offered by
      sorting each thin device's deferred_bio_list.  An "io size"-sized random
      read of the device would result in "seeks/io" fragments being read, with
      an average "distance/seek" between each fragment.
      Data was written to a single thin device using multiple threads via
      iozone (8 threads, 64K for both the block_size and io_size).
           io size   seeks/io distance/seek
                4k    0.000   0b
               16k    0.013   11m
               64k    0.065   11m
              256k    0.274   10m
                1m    1.109   10m
                4m    4.411   10m
               16m    17.097  11m
               64m    60.055  13m
              256m    148.798 25m
                1g    809.929 21m
           io size   seeks/io distance/seek
                4k    0.000   0b
               16k    0.000   1g
               64k    0.001   1g
              256k    0.003   1g
                1m    0.011   1g
                4m    0.045   1g
               16m    0.181   1g
               64m    0.747   1011m
              256m    3.299   1g
                1g    14.373  1g
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Acked-by: default avatarJoe Thornber <ejt@redhat.com>
  11. 31 Mar, 2014 2 commits
  12. 28 Mar, 2014 1 commit
  13. 05 Mar, 2014 4 commits
    • Joe Thornber's avatar
      dm thin: fix noflush suspend IO queueing · 738211f7
      Joe Thornber authored
      i) by the time DM core calls the postsuspend hook the dm_noflush flag
      has been cleared.  So the old thin_postsuspend did nothing.  We need to
      use the presuspend hook instead.
      ii) There was a race between bios leaving DM core and arriving in the
      deferred queue.
      thin_presuspend now sets a 'requeue' flag causing all bios destined for
      that thin to be requeued back to DM core.  Then it requeues all held IO,
      and all IO on the deferred queue (destined for that thin).  Finally
      postsuspend clears the 'requeue' flag.
      Signed-off-by: default avatarJoe Thornber <ejt@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
    • Joe Thornber's avatar
      dm thin: fix deadlock in __requeue_bio_list · 18adc577
      Joe Thornber authored
      The spin lock in requeue_io() was held for too long, allowing deadlock.
      Don't worry, due to other issues addressed in the following "dm thin:
      fix noflush suspend IO queueing" commit, this code was never called.
      Fix this by taking the spin lock for a much shorter period of time.
      Signed-off-by: default avatarJoe Thornber <ejt@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
    • Joe Thornber's avatar
      dm thin: fix out of data space handling · 3e1a0699
      Joe Thornber authored
      Ideally a thin pool would never run out of data space; the low water
      mark would trigger userland to extend the pool before we completely run
      out of space.  However, many small random IOs to unprovisioned space can
      consume data space at an alarming rate.  Adjust your low water mark if
      you're frequently seeing "out-of-data-space" mode.
      Before this fix, if data space ran out the pool would be put in
      PM_READ_ONLY mode which also aborted the pool's current metadata
      transaction (data loss for any changes in the transaction).  This had a
      side-effect of needlessly compromising data consistency.  And retry of
      queued unserviceable bios, once the data pool was resized, could
      initiate changes to potentially inconsistent pool metadata.
      Now when the pool's data space is exhausted transition to a new pool
      mode (PM_OUT_OF_DATA_SPACE) that allows metadata to be changed but data
      may not be allocated.  This allows users to remove thin volumes or
      discard data to recover data space.
      The pool is no longer put in PM_READ_ONLY mode in response to the pool
      running out of data space.  And PM_READ_ONLY mode no longer aborts the
      pool's current metadata transaction.  Also, set_pool_mode() will now
      notify userspace when the pool mode is changed.
      Signed-off-by: default avatarJoe Thornber <ejt@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
    • Mike Snitzer's avatar
      dm thin: ensure user takes action to validate data and metadata consistency · 07f2b6e0
      Mike Snitzer authored
      If a thin metadata operation fails the current transaction will abort,
      whereby causing potential for IO layers up the stack (e.g. filesystems)
      to have data loss.  As such, set THIN_METADATA_NEEDS_CHECK_FLAG in the
      thin metadata's superblock which:
      1) requires the user verify the thin metadata is consistent (e.g. use
         thin_check, etc)
      2) suggests the user verify the thin data is consistent (e.g. use fsck)
      The only way to clear the superblock's THIN_METADATA_NEEDS_CHECK_FLAG is
      to run thin_repair.
      On metadata operation failure: abort current metadata transaction, set
      pool in read-only mode, and now set the needs_check flag.
      As part of this change, constraints are introduced or relaxed:
      * don't allow a pool to transition to write mode if needs_check is set
      * don't allow data or metadata space to be resized if needs_check is set
      * if a thin pool's metadata space is exhausted: the kernel will now
        force the user to take the pool offline for repair before the kernel
        will allow the metadata space to be extended.
      Also, update Documentation to include information about when the thin
      provisioning target commits metadata, how it handles metadata failures
      and running out of space.
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarJoe Thornber <ejt@redhat.com>
  14. 04 Mar, 2014 1 commit
    • Mike Snitzer's avatar
      dm thin: synchronize the pool mode during suspend · cdc2b415
      Mike Snitzer authored
      Commit b5330655
       ("dm thin: handle metadata failures more consistently")
      increased potential for the pool's mode to be changed in response to
      metadata operation failures.
      When the pool mode is changed it isn't synchronized with the mode in
      pool_features stored in the target's context (ti->private) that is used
      as the basis for (re)establishing the pool mode during resume via
      It is important that we synchronize the pool mode when it is changed
      otherwise the pool may experience and unexpected mode transition on the
      next resume (especially if there was no new table load).
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Acked-by: default avatarJoe Thornber <ejt@redhat.com>
  15. 27 Feb, 2014 1 commit
    • Mike Snitzer's avatar
      dm thin: allow metadata space larger than supported to go unused · 7d48935e
      Mike Snitzer authored
      It was always intended that a user could provide a thin metadata device
      that is larger than the max supported by the on-disk format.  The extra
      space would just go unused.
      Unfortunately that never worked.  If the user attempted to use a larger
      metadata device on creation they would get an error like the following:
       device-mapper: space map common: space map too large
       device-mapper: transaction manager: couldn't create metadata space map
       device-mapper: thin metadata: tm_create_with_sm failed
       device-mapper: table: 252:17: thin-pool: Error creating metadata object
       device-mapper: ioctl: error adding target to table
      Fix this by allowing the initial metadata space map creation to cap its
      size at the max number of blocks supported (DM_SM_METADATA_MAX_BLOCKS).
      get_metadata_dev_size() must also impose DM_SM_METADATA_MAX_BLOCKS (via
      THIN_METADATA_MAX_SECTORS), otherwise extending metadata would cap at
      THIN_METADATA_MAX_SECTORS_WARNING (which is larger than supported).
      Also, the calculation for THIN_METADATA_MAX_SECTORS didn't account for
      the sizeof the disk_bitmap_header.  So the supported maximum metadata
      size is a bit smaller (reduced from 33423360 to 33292800 sectors).
      Lastly, remove the "excess space will not be used" warning message from
      get_metadata_dev_size(); it resulted in printing the warning multiple
      times.  Factor out warn_if_metadata_device_too_big(), call it from
      pool_ctr() and maybe_resize_metadata_dev().
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Acked-by: default avatarJoe Thornber <ejt@redhat.com>
  16. 24 Feb, 2014 1 commit
    • Mike Snitzer's avatar
      dm thin: fix the error path for the thin device constructor · 1acacc07
      Mike Snitzer authored
      dm_pool_close_thin_device() must be called if dm_set_target_max_io_len()
      fails in thin_ctr().  Otherwise __pool_destroy() will fail because the
      pool will still have an open thin device:
       device-mapper: thin metadata: attempt to close pmd when 1 device(s) are still open
       device-mapper: thin: __pool_destroy: dm_pool_metadata_close() failed.
      Also, must establish error code if failing thin_ctr() because the pool
      is in fail_io mode.
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Acked-by: default avatarJoe Thornber <ejt@redhat.com>
      Cc: stable@vger.kernel.org
  17. 17 Feb, 2014 1 commit
  18. 16 Jan, 2014 1 commit
  19. 07 Jan, 2014 7 commits
    • Mike Snitzer's avatar
      dm thin: fix set_pool_mode exposed pool operation races · 8b64e881
      Mike Snitzer authored
      The pool mode must not be switched until after the corresponding pool
      process_* methods have been established.  Otherwise, because
      set_pool_mode() isn't interlocked with the IO path for performance
      reasons, the IO path can end up executing process_* operations that
      don't match the mode.  This patch eliminates problems like the following
      (as seen on really fast PCIe SSD storage when transitioning the pool's
      mode from PM_READ_ONLY to PM_WRITE):
      kernel: device-mapper: thin: 253:2: reached low water mark for data device: sending event.
      kernel: device-mapper: thin: 253:2: no free data space available.
      kernel: device-mapper: thin: 253:2: switching pool to read-only mode
      kernel: device-mapper: thin: 253:2: switching pool to write mode
      kernel: ------------[ cut here ]------------
      kernel: WARNING: CPU: 11 PID: 7564 at drivers/md/dm-thin.c:995 handle_unserviceable_bio+0x146/0x160 [dm_thin_pool]()
      kernel: Workqueue: dm-thin do_worker [dm_thin_pool]
      kernel: 00000000000003e3 ffff880308831cc8 ffffffff8152ebcb 00000000000003e3
      kernel: 0000000000000000 ffff880308831d08 ffffffff8104c46c ffff88032502a800
      kernel: ffff880036409000 ffff88030ec7ce00 0000000000000001 00000000ffffffc3
      kernel: Call Trace:
      kernel: [<ffffffff8152ebcb>] dump_stack+0x49/0x5e
      kernel: [<ffffffff8104c46c>] warn_slowpath_common+0x8c/0xc0
      kernel: [<ffffffff8104c4ba>] warn_slowpath_null+0x1a/0x20
      kernel: [<ffffffffa001e2c6>] handle_unserviceable_bio+0x146/0x160 [dm_thin_pool]
      kernel: [<ffffffffa001f276>] process_bio_read_only+0x136/0x180 [dm_thin_pool]
      kernel: [<ffffffffa0020b75>] process_deferred_bios+0xc5/0x230 [dm_thin_pool]
      kernel: [<ffffffffa0020d31>] do_worker+0x51/0x60 [dm_thin_pool]
      kernel: [<ffffffff81067823>] process_one_work+0x183/0x490
      kernel: [<ffffffff81068c70>] worker_thread+0x120/0x3a0
      kernel: [<ffffffff81068b50>] ? manage_workers+0x160/0x160
      kernel: [<ffffffff8106e86e>] kthread+0xce/0xf0
      kernel: [<ffffffff8106e7a0>] ? kthread_freezable_should_stop+0x70/0x70
      kernel: [<ffffffff8153b3ec>] ret_from_fork+0x7c/0xb0
      kernel: [<ffffffff8106e7a0>] ? kthread_freezable_should_stop+0x70/0x70
      kernel: ---[ end trace 3f00528e08ffa55c ]---
      kernel: device-mapper: thin: pool mode is PM_WRITE not PM_READ_ONLY like expected!?
      dm-thin.c:995 was the WARN_ON_ONCE(get_pool_mode(pool) != PM_READ_ONLY);
      at the top of handle_unserviceable_bio().  And as the additional
      debugging I had conveys: the pool mode was _not_ PM_READ_ONLY like
      expected, it was already PM_WRITE, yet pool->process_bio was still set
      to process_bio_read_only().
      Also, while fixing this up, reduce logging of redundant pool mode
      transitions by checking new_mode is different from old_mode.
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org
    • Mike Snitzer's avatar
      dm thin: eliminate the no_free_space flag · 6d16202b
      Mike Snitzer authored
      The pool's error_if_no_space flag can easily serve the same purpose that
      no_free_space did, namely: control whether handle_unserviceable_bio()
      will error a bio or requeue it.
      This is cleaner since error_if_no_space is established when the pool's
      features are processed during table load.  So it avoids managing the
      no_free_space flag by taking the pool's spinlock.
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
    • Mike Snitzer's avatar
      dm thin: add error_if_no_space feature · 787a996c
      Mike Snitzer authored
      If the pool runs out of data or metadata space, the pool can either
      queue or error the IO destined to the data device.  The default is to
      queue the IO until more space is added.
      An admin may now configure the pool to error IO when no space is
      available by setting the 'error_if_no_space' feature when loading the
      thin-pool table.
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Acked-by: default avatarJoe Thornber <ejt@redhat.com>
    • Mike Snitzer's avatar
      dm thin: requeue bios to DM core if no_free_space and in read-only mode · 8c0f0e8c
      Mike Snitzer authored
      Now that we switch the pool to read-only mode when the data device runs
      out of space it causes active writers to get IO errors once we resume
      after resizing the data device.
      If no_free_space is set, save bios to the 'retry_on_resume_list' and
      requeue them on resume (once the data or metadata device may have been
      With this patch the resize_io test passes again (on slower storage):
       dmtest run --suite thin-provisioning -n /resize_io/
      Later patches fix some subtle races associated with the pool mode
      transitions done as part of the pool's -ENOSPC handling.  These races
      are exposed on fast storage (e.g. PCIe SSD).
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Acked-by: default avatarJoe Thornber <ejt@redhat.com>
    • Mike Snitzer's avatar
      dm thin: cleanup and improve no space handling · 399caddf
      Mike Snitzer authored
      Factor out_of_data_space() out of alloc_data_block().  Eliminate the use
      of 'no_free_space' as a latch in alloc_data_block() -- this is no longer
      needed now that we switch to read-only mode when we run out of data or
      metadata space.  In a later patch, the 'no_free_space' flag will be
      eliminated entirely (in favor of checking metadata rather than relying
      on a transient flag).
      Move no metdata space handling into metdata_operation_failed().  Set
      no_free_space when metadata space is exhausted too.  This is useful,
      because it offers consistency, for the following patch that will requeue
      data IOs if no_free_space.
      Also, rename no_space() to retry_bios_on_resume().
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Acked-by: default avatarJoe Thornber <ejt@redhat.com>
    • Mike Snitzer's avatar
    • Joe Thornber's avatar
      dm thin: handle metadata failures more consistently · b5330655
      Joe Thornber authored
      Introduce metadata_operation_failed() wrappers, around set_pool_mode(),
      to assist with improving the consistency of how metadata failures are
      handled.  Logging is improved and metadata operation failures trigger
      read-only mode immediately.
      Also, eliminate redundant set_pool_mode() calls in the two
      alloc_data_block() caller's error paths.
      Signed-off-by: default avatarJoe Thornber <ejt@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>