1. 07 Jan, 2014 2 commits
    • Joe Thornber's avatar
      dm thin: fix discard support to a previously shared block · 19fa1a67
      Joe Thornber authored
      If a snapshot is created and later deleted the origin dm_thin_device's
      snapshotted_time will have been updated to reflect the snapshot's
      creation time.  The 'shared' flag in the dm_thin_lookup_result struct
      returned from dm_thin_find_block() is an approximation based on
      snapshotted_time -- this is done to avoid 0(n), or worse, time
      complexity.  In this case, the shared flag would be true.
      
      But because the 'shared' flag reflects an approximation a block can be
      incorrectly assumed to be shared (e.g. false positive for 'shared'
      because the snapshot no longer exists).  This could result in discards
      issued to a thin device not being passed down to the pool's underlying
      data device.
      
      To fix this we double check that a thin block is really still in-use
      after a mapping is removed using dm_pool_block_is_used().  If the
      reference count for a block is now zero the discard is allowed to be
      passed down.
      
      Also add a 'definitely_not_shared' member to the dm_thin_new_mapping
      structure -- reflects that the 'shared' flag in the response from
      dm_thin_find_block() can only be held as definitive if false is
      returned.
      
      Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1043527
      
      Signed-off-by: default avatarJoe Thornber <ejt@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org
      19fa1a67
    • Mike Snitzer's avatar
      dm thin: initialize dm_thin_new_mapping returned by get_next_mapping · 16961b04
      Mike Snitzer authored
      
      
      As additional members are added to the dm_thin_new_mapping structure
      care should be taken to make sure they get initialized before use.
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Acked-by: default avatarJoe Thornber <ejt@redhat.com>
      Cc: stable@vger.kernel.org
      16961b04
  2. 10 Dec, 2013 5 commits
    • Joe Thornber's avatar
      dm thin: allow pool in read-only mode to transition to read-write mode · 9b7aaa64
      Joe Thornber authored
      
      
      A thin-pool may be in read-only mode because the pool's data or metadata
      space was exhausted.  To allow for recovery, by adding more space to the
      pool, we must allow a pool to transition from PM_READ_ONLY to PM_WRITE
      mode.  Otherwise, running out of space will render the pool permanently
      read-only.
      Signed-off-by: default avatarJoe Thornber <ejt@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org
      9b7aaa64
    • Joe Thornber's avatar
      dm thin: re-establish read-only state when switching to fail mode · 5383ef3a
      Joe Thornber authored
      
      
      If the thin-pool transitioned to fail mode and the thin-pool's table
      were reloaded for some reason: the new table's default pool mode would
      be read-write, though it will transition to fail mode during resume.
      
      When the pool mode transitions directly from PM_WRITE to PM_FAIL we need
      to re-establish the intermediate read-only state in both the metadata
      and persistent-data block manager (as is usually done with the normal
      pool mode transition sequence: PM_WRITE -> PM_READ_ONLY -> PM_FAIL).
      Signed-off-by: default avatarJoe Thornber <ejt@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org
      5383ef3a
    • Joe Thornber's avatar
      dm thin: always fallback the pool mode if commit fails · 020cc3b5
      Joe Thornber authored
      
      
      Rename commit_or_fallback() to commit().  Now all previous calls to
      commit() will trigger the pool mode to fallback if the commit fails.
      
      Also, check the error returned from commit() in alloc_data_block().
      Signed-off-by: default avatarJoe Thornber <ejt@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org
      020cc3b5
    • Mike Snitzer's avatar
      dm thin: switch to read-only mode if metadata space is exhausted · 4a02b34e
      Mike Snitzer authored
      
      
      Switch the thin pool to read-only mode in alloc_data_block() if
      dm_pool_alloc_data_block() fails because the pool's metadata space is
      exhausted.
      
      Differentiate between data and metadata space in messages about no
      free space available.
      
      This issue was noticed with the device-mapper-test-suite using:
      dmtest run --suite thin-provisioning -n /exhausting_metadata_space_causes_fail_mode/
      
      The quantity of errors logged in this case must be reduced.
      
      before patch:
      
      device-mapper: thin: 253:4: reached low water mark for metadata device: sending event.
      device-mapper: space map metadata: unable to allocate new metadata block
      device-mapper: space map common: dm_tm_shadow_block() failed
      device-mapper: space map metadata: unable to allocate new metadata block
      device-mapper: space map common: dm_tm_shadow_block() failed
      device-mapper: space map metadata: unable to allocate new metadata block
      device-mapper: space map common: dm_tm_shadow_block() failed
      device-mapper: space map metadata: unable to allocate new metadata block
      device-mapper: space map common: dm_tm_shadow_block() failed
      device-mapper: space map metadata: unable to allocate new metadata block
      device-mapper: space map common: dm_tm_shadow_block() failed
      <snip ... these repeat for a _very_ long while ... >
      device-mapper: space map metadata: unable to allocate new metadata block
      device-mapper: thin: 253:4: commit failed: error = -28
      device-mapper: thin: 253:4: switching pool to read-only mode
      
      after patch:
      
      device-mapper: thin: 253:4: reached low water mark for metadata device: sending event.
      device-mapper: space map metadata: unable to allocate new metadata block
      device-mapper: thin: 253:4: no free metadata space available.
      device-mapper: thin: 253:4: switching pool to read-only mode
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Acked-by: default avatarJoe Thornber <ejt@redhat.com>
      Cc: stable@vger.kernel.org
      4a02b34e
    • Joe Thornber's avatar
      dm thin: switch to read only mode if a mapping insert fails · fafc7a81
      Joe Thornber authored
      
      
      Switch the thin pool to read-only mode when dm_thin_insert_block() fails
      since there is little reason to expect the cause of the failure to be
      resolved without further action by user space.
      
      This issue was noticed with the device-mapper-test-suite using:
      dmtest run --suite thin-provisioning -n /exhausting_metadata_space_causes_fail_mode/
      
      The quantity of errors logged in this case must be reduced.
      
      before patch:
      
      device-mapper: thin: dm_thin_insert_block() failed
      device-mapper: space map metadata: unable to allocate new metadata block
      device-mapper: thin: dm_thin_insert_block() failed
      device-mapper: space map metadata: unable to allocate new metadata block
      device-mapper: thin: dm_thin_insert_block() failed
      device-mapper: space map metadata: unable to allocate new metadata block
      device-mapper: thin: dm_thin_insert_block() failed
      device-mapper: space map metadata: unable to allocate new metadata block
      device-mapper: thin: dm_thin_insert_block() failed
      device-mapper: space map metadata: unable to allocate new metadata block
      device-mapper: thin: dm_thin_insert_block() failed
      device-mapper: space map metadata: unable to allocate new metadata block
      device-mapper: thin: dm_thin_insert_block() failed
      device-mapper: space map metadata: unable to allocate new metadata block
      device-mapper: thin: dm_thin_insert_block() failed
      device-mapper: space map metadata: unable to allocate new metadata block
      device-mapper: thin: dm_thin_insert_block() failed
      device-mapper: space map metadata: unable to allocate new metadata block
      device-mapper: thin: dm_thin_insert_block() failed
      device-mapper: space map metadata: unable to allocate new metadata block
      device-mapper: space map metadata: unable to allocate new metadata block
      device-mapper: space map metadata: unable to allocate new metadata block
      device-mapper: space map metadata: unable to allocate new metadata block
      device-mapper: space map metadata: unable to allocate new metadata block
      device-mapper: space map metadata: unable to allocate new metadata block
      <snip ... these repeat for a long while ... >
      device-mapper: space map metadata: unable to allocate new metadata block
      device-mapper: space map common: dm_tm_shadow_block() failed
      device-mapper: thin: 253:4: no free metadata space available.
      device-mapper: thin: 253:4: switching pool to read-only mode
      
      after patch:
      
      device-mapper: space map metadata: unable to allocate new metadata block
      device-mapper: thin: 253:4: dm_thin_insert_block() failed: error = -28
      device-mapper: thin: 253:4: switching pool to read-only mode
      Signed-off-by: default avatarJoe Thornber <ejt@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org
      fafc7a81
  3. 23 Sep, 2013 1 commit
    • Mike Snitzer's avatar
      dm thin: do not expose non-zero discard limits if discards disabled · b60ab990
      Mike Snitzer authored
      
      
      Fix issue where the block layer would stack the discard limits of the
      pool's data device even if the "ignore_discard" pool feature was
      specified.
      
      The pool and thin device(s) still had discards disabled because the
      QUEUE_FLAG_DISCARD request_queue flag wasn't set.  But to avoid user
      confusion when "ignore_discard" is used: both the pool device and the
      thin device(s) have zeroes for all discard limits.
      
      Also, always set discard_zeroes_data_unsupported in targets because they
      should never advertise the 'discard_zeroes_data' capability (even if the
      pool's data device supports it).
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Acked-by: default avatarJoe Thornber <ejt@redhat.com>
      b60ab990
  4. 06 Sep, 2013 3 commits
  5. 23 Aug, 2013 1 commit
  6. 19 May, 2013 1 commit
    • Alasdair G Kergon's avatar
      dm thin: fix metadata dev resize detection · 610bba8b
      Alasdair G Kergon authored
      Fix detection of the need to resize the dm thin metadata device.
      
      The code incorrectly tried to extend the metadata device when it
      didn't need to due to a merging error with patch 24347e95
      
       ("dm thin:
      detect metadata device resizing").
      
        device-mapper: transaction manager: couldn't open metadata space map
        device-mapper: thin metadata: tm_open_with_sm failed
        device-mapper: thin: aborting transaction failed
        device-mapper: thin: switching pool to failure mode
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      610bba8b
  7. 10 May, 2013 4 commits
  8. 20 Mar, 2013 2 commits
    • Joe Thornber's avatar
      dm thin: fix non power of two discard granularity calc · 58051b94
      Joe Thornber authored
      Fix a discard granularity calculation to work for non power of 2 block sizes.
      
      In order for thinp to passdown discard bios to the underlying data
      device, the data device must have a discard granularity that is a
      factor of the thinp block size.  Originally this check was done by
      using bitops since the block_size was known to be a power of two.
      
      Introduced by commit f13945d7
      
      
      ("dm thin: support a non power of 2 discard_granularity").
      Signed-off-by: default avatarJoe Thornber <ejt@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      58051b94
    • Joe Thornber's avatar
      dm thin: fix discard corruption · f046f89a
      Joe Thornber authored
      
      
      Fix a bug in dm_btree_remove that could leave leaf values with incorrect
      reference counts.  The effect of this was that removal of a shared block
      could result in the space maps thinking the block was no longer used.
      More concretely, if you have a thin device and a snapshot of it, sending
      a discard to a shared region of the thin could corrupt the snapshot.
      
      Thinp uses a 2-level nested btree to store it's mappings.  This first
      level is indexed by thin device, and the second level by logical
      block.
      
      Often when we're removing an entry in this mapping tree we need to
      rebalance nodes, which can involve shadowing them, possibly creating a
      copy if the block is shared.  If we do create a copy then children of
      that node need to have their reference counts incremented.  In this
      way reference counts percolate down the tree as shared trees diverge.
      
      The rebalance functions were incrementing the children at the
      appropriate time, but they were always assuming the children were
      internal nodes.  This meant the leaf values (in our case packed
      block/flags entries) were not being incremented.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarJoe Thornber <ejt@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      f046f89a
  9. 01 Mar, 2013 7 commits
    • Joe Thornber's avatar
      dm thin: remove cells from stack · 025b9685
      Joe Thornber authored
      
      
      This patch takes advantage of the new bio-prison interface where the
      memory is now passed in rather than using a mempool in bio-prison.
      This allows the map function to avoid performing potentially-blocking
      allocations that could lead to deadlocks: We want to avoid the cell
      allocation that is done in bio_detain.
      
      (The potential for mempool deadlocks still remains in other functions
      that use bio_detain.)
      Signed-off-by: default avatarJoe Thornber <ejt@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      025b9685
    • Joe Thornber's avatar
      dm bio prison: pass cell memory in · 6beca5eb
      Joe Thornber authored
      
      
      Change the dm_bio_prison interface so that instead of allocating memory
      internally, dm_bio_detain is supplied with a pre-allocated cell each
      time it is called.
      
      This enables a subsequent patch to move the allocation of the struct
      dm_bio_prison_cell outside the thin target's mapping function so it can
      no longer block there.
      Signed-off-by: default avatarJoe Thornber <ejt@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      6beca5eb
    • Mikulas Patocka's avatar
      dm kcopyd: introduce configurable throttling · df5d2e90
      Mikulas Patocka authored
      
      
      This patch allows the administrator to reduce the rate at which kcopyd
      issues I/O.
      
      Each module that uses kcopyd acquires a throttle parameter that can be
      set in /sys/module/*/parameters.
      
      We maintain a history of kcopyd usage by each module in the variables
      io_period and total_period in struct dm_kcopyd_throttle. The actual
      kcopyd activity is calculated as a percentage of time equal to
      "(100 * io_period / total_period)".  This is compared with the user-defined
      throttle percentage threshold and if it is exceeded, we sleep.
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      df5d2e90
    • Alasdair G Kergon's avatar
      dm: rename request variables to bios · 55a62eef
      Alasdair G Kergon authored
      
      
      Use 'bio' in the name of variables and functions that deal with
      bios rather than 'request' to avoid confusion with the normal
      block layer use of 'request'.
      
      No functional changes.
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      55a62eef
    • Mike Snitzer's avatar
      dm thin: use block_size_is_power_of_two · 58f77a21
      Mike Snitzer authored
      
      
      Use block_size_is_power_of_two() rather than checking
      sectors_per_block_shift directly.  Also introduce local pool variable in
      get_bio_block() to eliminate redundant tc->pool dereferences.
      
      No functional change.
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      58f77a21
    • Mike Snitzer's avatar
      dm thin: support a non power of 2 discard_granularity · f13945d7
      Mike Snitzer authored
      Support a non-power-of-2 discard granularity in dm-thin, now that the block
      layer supports this(via 8dd2cb7e "block:
      discard granularity might not be power of 2" and
      59771079
      
       "blk: avoid divide-by-zero with zero
      discard granularity").
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      f13945d7
    • Mikulas Patocka's avatar
      dm: fix truncated status strings · fd7c092e
      Mikulas Patocka authored
      
      
      Avoid returning a truncated table or status string instead of setting
      the DM_BUFFER_FULL_FLAG when the last target of a table fills the
      buffer.
      
      When processing a table or status request, the function retrieve_status
      calls ti->type->status. If ti->type->status returns non-zero,
      retrieve_status assumes that the buffer overflowed and sets
      DM_BUFFER_FULL_FLAG.
      
      However, targets don't return non-zero values from their status method
      on overflow. Most targets returns always zero.
      
      If a buffer overflow happens in a target that is not the last in the
      table, it gets noticed during the next iteration of the loop in
      retrieve_status; but if a buffer overflow happens in the last target, it
      goes unnoticed and erroneously truncated data is returned.
      
      In the current code, the targets behave in the following way:
      * dm-crypt returns -ENOMEM if there is not enough space to store the
        key, but it returns 0 on all other overflows.
      * dm-thin returns errors from the status method if a disk error happened.
        This is incorrect because retrieve_status doesn't check the error
        code, it assumes that all non-zero values mean buffer overflow.
      * all the other targets always return 0.
      
      This patch changes the ti->type->status function to return void (because
      most targets don't use the return code). Overflow is detected in
      retrieve_status: if the status method fills up the remaining space
      completely, it is assumed that buffer overflow happened.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      fd7c092e
  10. 31 Jan, 2013 1 commit
    • Mike Snitzer's avatar
      dm thin: fix queue limits stacking · 0f640dca
      Mike Snitzer authored
      thin_io_hints() is blindly copying the queue limits from the thin-pool
      which can lead to incorrect limits being set.  The fix here simply
      deletes the thin_io_hints() hook which leaves the existing stacking
      infrastructure to set the limits correctly.
      
      When a thin-pool uses an MD device for the data device a thin device
      from the thin-pool must respect MD's constraints about disallowing a bio
      from spanning multiple chunks.  Otherwise we can see problems.  If the raid0
      chunksize is 1152K and thin-pool chunksize is 256K I see the following
      md/raid0 error (with extra debug tracing added to thin_endio) when
      mkfs.xfs is executed against the thin device:
      
      md/raid0:md99: make_request bug: can't convert block across chunks or bigger than 1152k 6688 127
      device-mapper: thin: bio sector=2080 err=-5 bi_size=130560 bi_rw=17 bi_vcnt=32 bi_idx=0
      
      This extra DM debugging shows that the failing bio is spanning across
      the first and second logical 1152K chunk (sector 2080 + 255 takes the
      bio beyond the first chunk's boundary of sector 2304).  So the bio
      splitting that DM is doing clearly isn't respecting the MD limits.
      
      max_hw_sectors_kb is 127 for both the thin-pool and thin device
      (queue_max_hw_sectors returns 255 so we'll excuse sysfs's lack of
      precision).  So this explains why bi_size is 130560.
      
      But the thin device's max_hw_sectors_kb should be 4 (PAGE_SIZE) given
      that it doesn't have a .merge function (for bio_add_page to consult
      indirectly via dm_merge_bvec) yet the thin-pool does sit above an MD
      device that has a compulsory merge_bvec_fn.  This scenario is exactly
      why DM must resort to sending single PAGE_SIZE bios to the underlying
      layer. Some additional context for this is available in the header for
      commit 8cbeb67a
      
       ("dm: avoid unsupported spanning of md stripe boundaries").
      
      Long story short, the reason a thin device doesn't properly get
      configured to have a max_hw_sectors_kb of 4 (PAGE_SIZE) is that
      thin_io_hints() is blindly copying the queue limits from the thin-pool
      device directly to the thin device's queue limits.
      
      Fix this by eliminating thin_io_hints.  Doing so is safe because the
      block layer's queue limits stacking already enables the upper level thin
      device to inherit the thin-pool device's discard and minimum_io_size and
      optimal_io_size limits that get set in pool_io_hints.  But avoiding the
      queue limits copy allows the thin and thin-pool limits to be different
      where it is important, namely max_hw_sectors_kb.
      Reported-by: default avatarDaniel Browning <db@kavod.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      0f640dca
  11. 21 Dec, 2012 10 commits
    • Mikulas Patocka's avatar
      dm: remove map_info · 7de3ee57
      Mikulas Patocka authored
      
      
      This patch removes map_info from bio-based device mapper targets.
      map_info is still used for request-based targets.
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      7de3ee57
    • Mikulas Patocka's avatar
      dm thin: dont use map_context · 59c3d2c6
      Mikulas Patocka authored
      
      
      This patch removes endio_hook_pool from dm-thin and uses per-bio data instead.
      
      This patch removes any use of map_info in preparation for the next patch
      that removes map_info from bio-based device mapper.
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      59c3d2c6
    • Mike Snitzer's avatar
      dm kcopyd: add WRITE SAME support to dm_kcopyd_zero · 70d6c400
      Mike Snitzer authored
      
      
      Add WRITE SAME support to dm-io and make it accessible to
      dm_kcopyd_zero().  dm_kcopyd_zero() provides an asynchronous interface
      whereas the blkdev_issue_write_same() interface is synchronous.
      
      WRITE SAME is a SCSI command that can be leveraged for more efficient
      zeroing of a specified logical extent of a device which supports it.
      Only a single zeroed logical block is transfered to the target for each
      WRITE SAME and the target then writes that same block across the
      specified extent.
      
      The dm thin target uses this.
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      70d6c400
    • Mike Snitzer's avatar
      dm thin: use DMERR_LIMIT for errors · c397741c
      Mike Snitzer authored
      
      
      Throttle all errors logged from the IO path by dm thin.
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      c397741c
    • Joe Thornber's avatar
      dm thin: cleanup dead code · 2aab3850
      Joe Thornber authored
      
      
      Remove unused @data_block parameter from cell_defer.
      Change thin_bio_map to use many returns rather than setting a variable.
      Signed-off-by: default avatarJoe Thornber <ejt@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      2aab3850
    • Joe Thornber's avatar
      dm thin: rename cell_defer_except to cell_defer_no_holder · f286ba0e
      Joe Thornber authored
      
      
      Rename cell_defer_except() to cell_defer_no_holder() which describes
      its function more clearly.
      Signed-off-by: default avatarJoe Thornber <ejt@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      f286ba0e
    • Mike Snitzer's avatar
      dm thin: emit ignore_discard in status when discards disabled · 018debea
      Mike Snitzer authored
      
      
      If "ignore_discard" is specified when creating the thin pool device then
      discard support is disabled for that device.  The pool device's status
      should reflect this fact rather than stating "no_discard_passdown"
      (which implies discards are enabled but passdown is disabled).
      Reported-by: default avatarZdenek Kabelac <zkabelac@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      018debea
    • Joe Thornber's avatar
      dm thin: wake worker when discard is prepared · 563af186
      Joe Thornber authored
      
      
      When discards are prepared it is best to directly wake the worker that
      will process them.  The worker will be woken anyway, via periodic
      commit, but there is no reason to not wake_worker here.
      Signed-off-by: default avatarJoe Thornber <ejt@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      563af186
    • Joe Thornber's avatar
      dm thin: fix race between simultaneous io and discards to same block · e8088073
      Joe Thornber authored
      
      
      There is a race when discard bios and non-discard bios are issued
      simultaneously to the same block.
      
      Discard support is expensive for all thin devices precisely because you
      have to be careful to quiesce the area you're discarding.  DM thin must
      handle this conflicting IO pattern (simultaneous non-discard vs discard)
      even though a sane application shouldn't be issuing such IO.
      
      The race manifests as follows:
      
      1. A non-discard bio is mapped in thin_bio_map.
         This doesn't lock out parallel activity to the same block.
      
      2. A discard bio is issued to the same block as the non-discard bio.
      
      3. The discard bio is locked in a dm_bio_prison_cell in process_discard
         to lock out parallel activity against the same block.
      
      4. The non-discard bio's mapping continues and its all_io_entry is
         incremented so the bio is accounted for in the thin pool's all_io_ds
         which is a dm_deferred_set used to track time locality of non-discard IO.
      
      5. The non-discard bio is finally locked in a dm_bio_prison_cell in
         process_bio.
      
      The race can result in deadlock, leaving the block layer hanging waiting
      for completion of a discard bio that never completes, e.g.:
      
      INFO: task ruby:15354 blocked for more than 120 seconds.
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      ruby            D ffffffff8160f0e0     0 15354  15314 0x00000000
       ffff8802fb08bc58 0000000000000082 ffff8802fb08bfd8 0000000000012900
       ffff8802fb08a010 0000000000012900 0000000000012900 0000000000012900
       ffff8802fb08bfd8 0000000000012900 ffff8803324b9480 ffff88032c6f14c0
      Call Trace:
       [<ffffffff814e5a19>] schedule+0x29/0x70
       [<ffffffff814e3d85>] schedule_timeout+0x195/0x220
       [<ffffffffa06b9bc1>] ? _dm_request+0x111/0x160 [dm_mod]
       [<ffffffff814e589e>] wait_for_common+0x11e/0x190
       [<ffffffff8107a170>] ? try_to_wake_up+0x2b0/0x2b0
       [<ffffffff814e59ed>] wait_for_completion+0x1d/0x20
       [<ffffffff81233289>] blkdev_issue_discard+0x219/0x260
       [<ffffffff81233e79>] blkdev_ioctl+0x6e9/0x7b0
       [<ffffffff8119a65c>] block_ioctl+0x3c/0x40
       [<ffffffff8117539c>] do_vfs_ioctl+0x8c/0x340
       [<ffffffff8119a547>] ? block_llseek+0x67/0xb0
       [<ffffffff811756f1>] sys_ioctl+0xa1/0xb0
       [<ffffffff810561f6>] ? sys_rt_sigprocmask+0x86/0xd0
       [<ffffffff814ef099>] system_call_fastpath+0x16/0x1b
      
      The thinp-test-suite's test_discard_random_sectors reliably hits this
      deadlock on fast SSD storage.
      
      The fix for this race is that the all_io_entry for a bio must be
      incremented whilst the dm_bio_prison_cell is held for the bio's
      associated virtual and physical blocks.  That cell locking wasn't
      occurring early enough in thin_bio_map.  This patch fixes this.
      
      Care is taken to always call the new function inc_all_io_entry() with
      the relevant cells locked, but they are generally unlocked before
      calling issue() to try to avoid holding the cells locked across
      generic_submit_request.
      
      Also, now that thin_bio_map may lock bios in a cell, process_bio() is no
      longer the only thread that will do so.  Because of this we must be sure
      to use cell_defer_except() to release all non-holder entries, that
      were added by the other thread, because they must be deferred.
      
      This patch depends on "dm thin: replace dm_cell_release_singleton with
      cell_defer_except".
      Signed-off-by: default avatarJoe Thornber <ejt@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      Cc: stable@vger.kernel.org
      e8088073
    • Joe Thornber's avatar
      dm thin: replace dm_cell_release_singleton with cell_defer_except · b7ca9c92
      Joe Thornber authored
      
      
      Change existing users of the function dm_cell_release_singleton to share
      cell_defer_except instead, and then remove the now-unused function.
      
      Everywhere that calls dm_cell_release_singleton, the bio in question
      is the holder of the cell.
      
      If there are no non-holder entries in the cell then cell_defer_except
      behaves exactly like dm_cell_release_singleton.  Conversely, if there
      *are* non-holder entries then dm_cell_release_singleton must not be used
      because those entries would need to be deferred.
      
      Consequently, it is safe to replace use of dm_cell_release_singleton
      with cell_defer_except.
      
      This patch is a pre-requisite for "dm thin: fix race between
      simultaneous io and discards to same block".
      Signed-off-by: default avatarJoe Thornber <ejt@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      b7ca9c92
  12. 12 Oct, 2012 3 commits