- 13 Nov, 2014 2 commits
-
-
Mike Snitzer authored
Signed-off-by:
Mike Snitzer <snitzer@redhat.com>
-
Mikulas Patocka authored
As long as struct thin_c is in the list, anyone can grab a reference of it. Consequently, we must wait for the reference count to drop to zero *after* we remove the structure from the list, not before. Signed-off-by:
Mikulas Patocka <mpatocka@redhat.com> Signed-off-by:
Mike Snitzer <snitzer@redhat.com>
-
- 10 Nov, 2014 14 commits
-
-
Joe Thornber authored
Ranges will be placed in the same cell if they overlap. Range locking is a prerequisite for more efficient multi-block discard support in both the cache and thin-provisioning targets. Signed-off-by:
Joe Thornber <ejt@redhat.com> Signed-off-by:
Mike Snitzer <snitzer@redhat.com>
-
Mike Snitzer authored
Also refactor some other bio_list erroring helpers. Signed-off-by:
Mike Snitzer <snitzer@redhat.com>
-
Mike Snitzer authored
Eliminate redundant should_error_unserviceable_bio check and error loop. Signed-off-by:
Mike Snitzer <snitzer@redhat.com>
-
Joe Thornber authored
Sort the cells in logical block order before processing each cell in process_thin_deferred_cells(). This significantly improves the ondisk layout on rotational storage, whereby improving read performance. Signed-off-by:
Joe Thornber <ejt@redhat.com> Signed-off-by:
Mike Snitzer <snitzer@redhat.com>
-
Joe Thornber authored
This use of direct submission in process_shared_bio() reduces latency for submitting bios in the shared cell by avoiding adding those bios to the deferred list and waiting for the next iteration of the worker. Signed-off-by:
Joe Thornber <ejt@redhat.com> Signed-off-by:
Mike Snitzer <snitzer@redhat.com>
-
Joe Thornber authored
This use of direct submission in process_prepared_mapping() reduces latency for submitting bios in a cell by avoiding adding those bios to the deferred list and waiting for the next iteration of the worker. But this direct submission exposes the potential for a race between releasing a cell and incrementing deferred set. Fix this by introducing dm_cell_visit_release() and refactoring inc_remap_and_issue_cell() accordingly. Signed-off-by:
Joe Thornber <ejt@redhat.com> Signed-off-by:
Mike Snitzer <snitzer@redhat.com>
-
Joe Thornber authored
This avoids dropping the cell, so increases the probability that other bios will collect within the cell, rather than being passed individually to the worker. Also add required process_cell and process_discard_cell error handling wrappers and set associated pool-mode function pointers accordingly. Signed-off-by:
Joe Thornber <ejt@redhat.com> Signed-off-by:
Mike Snitzer <snitzer@redhat.com>
-
Mike Snitzer authored
Purely cleanup of duplicated code, no functional change. Signed-off-by:
Mike Snitzer <snitzer@redhat.com>
-
Joe Thornber authored
When processing a discard bio, if the block is already quiesced do the discard immediately rather than adding the mapping to a list for the next iteration of the worker thread. Discarding a fully provisioned 100G thin volume with 64k block size goes from 860s to 95s with this change. Clearly there's something wrong with the worker architecture, more investigation needed. Signed-off-by:
Joe Thornber <ejt@redhat.com> Signed-off-by:
Mike Snitzer <snitzer@redhat.com>
-
Mike Snitzer authored
Introduce thin_merge so that any additional constraints from the data volume may be taken into account when determing the maximum number of sectors that can be issued relative to the specified logical offset. This is particularly important if/when the data volume is layered ontop of a more sophisticated device (e.g. dm-raid or some other DM target). Reviewed-by:
Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by:
Mike Snitzer <snitzer@redhat.com>
-
Mike Snitzer authored
Allows for filesystems to submit bios that are a factor of the thinp blocksize, improving dm-thinp efficiency (particularly when the data volume is RAID). Also set io_min to max_sectors_kb if it is a factor of the thinp blocksize. Signed-off-by:
Mike Snitzer <snitzer@redhat.com>
-
Joe Thornber authored
Throttle IO based on the time it's taking the worker to do one loop. There were reports of hung task timeouts occuring and it was observed that the excessively long avgqu-sz (as reported by iostat) was contributing to these hung tasks. Throttling definitely helps dm-thinp perform better under heavy IO load (without being detremental by being overzealous). It reduces avgqu-sz drastically, e.g.: from 60K to ~6K, and even as low as 150 once metadata is cached by bufio, when dirty_ratio=5, dirty_background_ratio=2. And avgqu-sz stays at or below 30K even with dirty_ratio=20, dirty_background_ratio=10. Signed-off-by:
Joe Thornber <ejt@redhat.com> Signed-off-by:
Mike Snitzer <snitzer@redhat.com>
-
Joe Thornber authored
Prefetch metadata at the start of the worker thread and then again every 128th bio processed from the deferred list. Signed-off-by:
Joe Thornber <ejt@redhat.com> Signed-off-by:
Mike Snitzer <snitzer@redhat.com>
-
Joe Thornber authored
Previously it was using a fixed sized hash table. There are times when very many concurrent cells are held (such as when processing a very large discard). When this happens the hash table performance becomes very poor. Signed-off-by:
Joe Thornber <ejt@redhat.com> Signed-off-by:
Mike Snitzer <snitzer@redhat.com>
-
- 04 Nov, 2014 1 commit
-
-
Joe Thornber authored
Avoids normal IO racing with discard. Signed-off-by:
Joe Thornber <ejt@redhat.com> Signed-off-by:
Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org
-
- 01 Aug, 2014 3 commits
-
-
Mike Snitzer authored
Before, if the block layer's limit stacking didn't establish an optimal_io_size that was compatible with the thin-pool's data block size we'd set optimal_io_size to the data block size and minimum_io_size to 0 (which the block layer adjusts to be physical_block_size). Update pool_io_hints() to set both minimum_io_size and optimal_io_size to the thin-pool's data block size. This fixes an issue reported where mkfs.xfs would create more XFS Allocation Groups on thinp volumes than on a normal linear LV of comparable size, see: https://bugzilla.redhat.com/show_bug.cgi?id=1003227 Reported-by:
Chris Murphy <lists@colorremedies.com> Signed-off-by:
Mike Snitzer <snitzer@redhat.com>
-
Joe Thornber authored
Track the size of any external origin. Previously the external origin's size had to be a multiple of the thin-pool's block size, that is no longer a requirement. In addition, snapshots that are larger than the external origin are now supported. Signed-off-by:
Joe Thornber <ejt@redhat.com> Signed-off-by:
Mike Snitzer <snitzer@redhat.com>
-
Joe Thornber authored
Previously we used separate boolean values to track quiescing and copying actions. By switching to an atomic_t we can support blocks that need a partial copy and partial zero. Signed-off-by:
Joe Thornber <ejt@redhat.com> Signed-off-by:
Mike Snitzer <snitzer@redhat.com>
-
- 11 Jun, 2014 1 commit
-
-
Lukas Czerner authored
DM thinp already checks whether the discard_granularity of the data device is a factor of the thin-pool block size. But when using the dm-thin-pool's discard passdown support, DM thinp was not selecting the max of the underlying data device's discard_granularity and the thin-pool's block size. Update set_discard_limits() to set discard_granularity to the max of these values. This enables blkdev_issue_discard() to properly align the discards that are sent to the DM thin device on a full block boundary. As such each discard will now cover an entire DM thin-pool block and the block will be reclaimed. Reported-by:
Zdenek Kabelac <zkabelac@redhat.com> Signed-off-by:
Lukas Czerner <lczerner@redhat.com> Signed-off-by:
Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org
-
- 03 Jun, 2014 2 commits
-
-
Mike Snitzer authored
Update the DM thin provisioning target's allocation failure error to be consistent with commit a9d6ceb8 ("[SCSI] return ENOSPC on thin provisioning failure"). The DM thin target now returns -ENOSPC rather than -EIO when block allocation fails due to the pool being out of data space (and the 'error_if_no_space' thin-pool feature is enabled). Signed-off-by:
Mike Snitzer <snitzer@redhat.com> Acked-By:
Joe Thornber <ejt@redhat.com>
-
Joe Thornber authored
Factor out a pool_work interface that noflush_work makes use of to wait for and complete work items (in terms of a proper completion struct). Allows discontinuing the use of a custom completion in terms of atomic_t and wait_event. Signed-off-by:
Joe Thornber <ejt@redhat.com> Signed-off-by:
Mike Snitzer <snitzer@redhat.com>
-
- 20 May, 2014 1 commit
-
-
Mike Snitzer authored
Commit 85ad643b ("dm thin: add timeout to stop out-of-data-space mode holding IO forever") introduced a fixed 60 second timeout. Users may want to either disable or modify this timeout. Allow the out-of-data-space timeout to be configured using the 'no_space_timeout' dm-thin-pool module param. Setting it to 0 will disable the timeout, resulting in IO being queued until more data space is added to the thin-pool. Signed-off-by:
Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org # 3.14+
-
- 14 May, 2014 2 commits
-
-
Joe Thornber authored
If the pool runs out of data space, dm-thin can be configured to either error IOs that would trigger provisioning, or hold those IOs until the pool is resized. Unfortunately, holding IOs until the pool is resized can result in a cascade of tasks hitting the hung_task_timeout, which may render the system unavailable. Add a fixed timeout so IOs can only be held for a maximum of 60 seconds. If LVM is going to resize a thin-pool that is out of data space it needs to be prompt about it. Signed-off-by:
Joe Thornber <ejt@redhat.com> Signed-off-by:
Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org # 3.14+
-
Joe Thornber authored
Commit 3e1a0699 ("dm thin: fix out of data space handling") introduced a regression in the metadata commit() method by returning an error if the pool is in PM_OUT_OF_DATA_SPACE mode. This oversight caused a thin device to return errors even if the default queue_if_no_space ENOSPC handling mode is used. Fix commit() to only fail if pool is in PM_READ_ONLY or PM_FAIL mode. Reported-by: qindehua@163.com Signed-off-by:
Joe Thornber <ejt@redhat.com> Signed-off-by:
Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org # 3.14+
-
- 29 Apr, 2014 1 commit
-
-
Mike Snitzer authored
Use INIT_WORK_ONSTACK to silence "ODEBUG: object is on stack, but not annotated". Reported-by:
Zdeněk Kabeláč <zkabelac@redhat.com> Signed-off-by:
Mike Snitzer <snitzer@redhat.com> Acked-by:
Joe Thornber <ejt@redhat.com>
-
- 08 Apr, 2014 2 commits
-
-
Joe Thornber authored
Commit c140e1c4 ("dm thin: use per thin device deferred bio lists") introduced the use of an rculist for all active thin devices. The use of rcu_read_lock() in process_deferred_bios() can result in a BUG if a dm_bio_prison_cell must be allocated as a side-effect of bio_detain(): BUG: sleeping function called from invalid context at mm/mempool.c:203 in_atomic(): 1, irqs_disabled(): 0, pid: 6, name: kworker/u8:0 3 locks held by kworker/u8:0/6: #0: ("dm-" "thin"){.+.+..}, at: [<ffffffff8106be42>] process_one_work+0x192/0x550 #1: ((&pool->worker)){+.+...}, at: [<ffffffff8106be42>] process_one_work+0x192/0x550 #2: (rcu_read_lock){.+.+..}, at: [<ffffffff816360b5>] do_worker+0x5/0x4d0 We can't process deferred bios with the rcu lock held, since dm_bio_prison_cell allocation may block if the bio-prison's cell mempool is exhausted. To fix: - Introduce a refcount and completion field to each thin_c - Add thin_get/put methods for adjusting the refcount. If the refcount hits zero then the completion is triggered. - Initialise refcount to 1 when creating thin_c - When iterating the active_thins list we thin_get() whilst the rcu lock is held. - After the rcu lock is dropped we process the deferred bios for that thin. - When destroying a thin_c we thin_put() and then wait for the completion -- to avoid a race between the worker thread iterating from that thin_c and destroying the thin_c. Signed-off-by:
Joe Thornber <ejt@redhat.com> Signed-off-by:
Mike Snitzer <snitzer@redhat.com>
-
Joe Thornber authored
Commit c140e1c4 ("dm thin: use per thin device deferred bio lists") incorrectly stopped disabling irqs when taking the pool's spinlock. Irqs must be disabled when taking the pool's spinlock otherwise a thread could spin_lock(), then get interrupted to service thin_endio() in interrupt context, which would then deadlock in spin_lock_irqsave(). Signed-off-by:
Joe Thornber <ejt@redhat.com> Signed-off-by:
Mike Snitzer <snitzer@redhat.com>
-
- 04 Apr, 2014 1 commit
-
-
Mike Snitzer authored
A thin-pool will allocate blocks using FIFO order for all thin devices which share the thin-pool. Because of this simplistic allocation the thin-pool's space can become fragmented quite easily; especially when multiple threads are requesting blocks in parallel. Sort each thin device's deferred_bio_list based on logical sector to help reduce fragmentation of the thin-pool's ondisk layout. The following tables illustrate the realized gains/potential offered by sorting each thin device's deferred_bio_list. An "io size"-sized random read of the device would result in "seeks/io" fragments being read, with an average "distance/seek" between each fragment. Data was written to a single thin device using multiple threads via iozone (8 threads, 64K for both the block_size and io_size). unsorted: io size seeks/io distance/seek -------------------------------------- 4k 0.000 0b 16k 0.013 11m 64k 0.065 11m 256k 0.274 10m 1m 1.109 10m 4m 4.411 10m 16m 17.097 11m 64m 60.055 13m 256m 148.798 25m 1g 809.929 21m sorted: io size seeks/io distance/seek -------------------------------------- 4k 0.000 0b 16k 0.000 1g 64k 0.001 1g 256k 0.003 1g 1m 0.011 1g 4m 0.045 1g 16m 0.181 1g 64m 0.747 1011m 256m 3.299 1g 1g 14.373 1g Signed-off-by:
Mike Snitzer <snitzer@redhat.com> Acked-by:
Joe Thornber <ejt@redhat.com>
-
- 31 Mar, 2014 2 commits
-
-
Mike Snitzer authored
The thin-pool previously only had a single deferred_bios list that would collect bios for all thin devices in the pool. Split this per-pool deferred_bios list out to per-thin deferred_bios_list -- doing so enables increased parallelism when processing deferred bios. And now that each thin device has it's own deferred_bios_list we can sort all bios in the list using logical sector. The requeue code in error handling path is also cleaner as a side-effect. Signed-off-by:
Mike Snitzer <snitzer@redhat.com> Acked-by:
Joe Thornber <ejt@redhat.com>
-
Mike Snitzer authored
The pool is congested if the pool is in PM_OUT_OF_DATA_SPACE mode. This is more explicit/clear/efficient than inferring whether or not the pool is congested by checking if retry_on_resume_list is empty. Signed-off-by:
Mike Snitzer <snitzer@redhat.com> Acked-by:
Joe Thornber <ejt@redhat.com>
-
- 28 Mar, 2014 1 commit
-
-
Mike Snitzer authored
If unable to ensure_next_mapping() we must add the current bio, which was removed from the @bios list via bio_list_pop, back to the deferred_bios list before all the remaining @bios. Signed-off-by:
Mike Snitzer <snitzer@redhat.com> Acked-by:
Joe Thornber <ejt@redhat.com> Cc: stable@vger.kernel.org
-
- 05 Mar, 2014 4 commits
-
-
Joe Thornber authored
i) by the time DM core calls the postsuspend hook the dm_noflush flag has been cleared. So the old thin_postsuspend did nothing. We need to use the presuspend hook instead. ii) There was a race between bios leaving DM core and arriving in the deferred queue. thin_presuspend now sets a 'requeue' flag causing all bios destined for that thin to be requeued back to DM core. Then it requeues all held IO, and all IO on the deferred queue (destined for that thin). Finally postsuspend clears the 'requeue' flag. Signed-off-by:
Joe Thornber <ejt@redhat.com> Signed-off-by:
Mike Snitzer <snitzer@redhat.com>
-
Joe Thornber authored
The spin lock in requeue_io() was held for too long, allowing deadlock. Don't worry, due to other issues addressed in the following "dm thin: fix noflush suspend IO queueing" commit, this code was never called. Fix this by taking the spin lock for a much shorter period of time. Signed-off-by:
Joe Thornber <ejt@redhat.com> Signed-off-by:
Mike Snitzer <snitzer@redhat.com>
-
Joe Thornber authored
Ideally a thin pool would never run out of data space; the low water mark would trigger userland to extend the pool before we completely run out of space. However, many small random IOs to unprovisioned space can consume data space at an alarming rate. Adjust your low water mark if you're frequently seeing "out-of-data-space" mode. Before this fix, if data space ran out the pool would be put in PM_READ_ONLY mode which also aborted the pool's current metadata transaction (data loss for any changes in the transaction). This had a side-effect of needlessly compromising data consistency. And retry of queued unserviceable bios, once the data pool was resized, could initiate changes to potentially inconsistent pool metadata. Now when the pool's data space is exhausted transition to a new pool mode (PM_OUT_OF_DATA_SPACE) that allows metadata to be changed but data may not be allocated. This allows users to remove thin volumes or discard data to recover data space. The pool is no longer put in PM_READ_ONLY mode in response to the pool running out of data space. And PM_READ_ONLY mode no longer aborts the pool's current metadata transaction. Also, set_pool_mode() will now notify userspace when the pool mode is changed. Signed-off-by:
Joe Thornber <ejt@redhat.com> Signed-off-by:
Mike Snitzer <snitzer@redhat.com>
-
Mike Snitzer authored
If a thin metadata operation fails the current transaction will abort, whereby causing potential for IO layers up the stack (e.g. filesystems) to have data loss. As such, set THIN_METADATA_NEEDS_CHECK_FLAG in the thin metadata's superblock which: 1) requires the user verify the thin metadata is consistent (e.g. use thin_check, etc) 2) suggests the user verify the thin data is consistent (e.g. use fsck) The only way to clear the superblock's THIN_METADATA_NEEDS_CHECK_FLAG is to run thin_repair. On metadata operation failure: abort current metadata transaction, set pool in read-only mode, and now set the needs_check flag. As part of this change, constraints are introduced or relaxed: * don't allow a pool to transition to write mode if needs_check is set * don't allow data or metadata space to be resized if needs_check is set * if a thin pool's metadata space is exhausted: the kernel will now force the user to take the pool offline for repair before the kernel will allow the metadata space to be extended. Also, update Documentation to include information about when the thin provisioning target commits metadata, how it handles metadata failures and running out of space. Signed-off-by:
Mike Snitzer <snitzer@redhat.com> Signed-off-by:
Joe Thornber <ejt@redhat.com>
-
- 04 Mar, 2014 1 commit
-
-
Mike Snitzer authored
Commit b5330655 ("dm thin: handle metadata failures more consistently") increased potential for the pool's mode to be changed in response to metadata operation failures. When the pool mode is changed it isn't synchronized with the mode in pool_features stored in the target's context (ti->private) that is used as the basis for (re)establishing the pool mode during resume via bind_control_target. It is important that we synchronize the pool mode when it is changed otherwise the pool may experience and unexpected mode transition on the next resume (especially if there was no new table load). Signed-off-by:
Mike Snitzer <snitzer@redhat.com> Acked-by:
Joe Thornber <ejt@redhat.com>
-
- 27 Feb, 2014 1 commit
-
-
Mike Snitzer authored
It was always intended that a user could provide a thin metadata device that is larger than the max supported by the on-disk format. The extra space would just go unused. Unfortunately that never worked. If the user attempted to use a larger metadata device on creation they would get an error like the following: device-mapper: space map common: space map too large device-mapper: transaction manager: couldn't create metadata space map device-mapper: thin metadata: tm_create_with_sm failed device-mapper: table: 252:17: thin-pool: Error creating metadata object device-mapper: ioctl: error adding target to table Fix this by allowing the initial metadata space map creation to cap its size at the max number of blocks supported (DM_SM_METADATA_MAX_BLOCKS). get_metadata_dev_size() must also impose DM_SM_METADATA_MAX_BLOCKS (via THIN_METADATA_MAX_SECTORS), otherwise extending metadata would cap at THIN_METADATA_MAX_SECTORS_WARNING (which is larger than supported). Also, the calculation for THIN_METADATA_MAX_SECTORS didn't account for the sizeof the disk_bitmap_header. So the supported maximum metadata size is a bit smaller (reduced from 33423360 to 33292800 sectors). Lastly, remove the "excess space will not be used" warning message from get_metadata_dev_size(); it resulted in printing the warning multiple times. Factor out warn_if_metadata_device_too_big(), call it from pool_ctr() and maybe_resize_metadata_dev(). Signed-off-by:
Mike Snitzer <snitzer@redhat.com> Acked-by:
Joe Thornber <ejt@redhat.com>
-
- 24 Feb, 2014 1 commit
-
-
Mike Snitzer authored
dm_pool_close_thin_device() must be called if dm_set_target_max_io_len() fails in thin_ctr(). Otherwise __pool_destroy() will fail because the pool will still have an open thin device: device-mapper: thin metadata: attempt to close pmd when 1 device(s) are still open device-mapper: thin: __pool_destroy: dm_pool_metadata_close() failed. Also, must establish error code if failing thin_ctr() because the pool is in fail_io mode. Signed-off-by:
Mike Snitzer <snitzer@redhat.com> Acked-by:
Joe Thornber <ejt@redhat.com> Cc: stable@vger.kernel.org
-