1. 17 Jan, 2017 10 commits
  2. 12 Jan, 2017 1 commit
  3. 02 Jan, 2017 2 commits
  4. 25 Dec, 2016 1 commit
    • Thomas Gleixner's avatar
      ktime: Cleanup ktime_set() usage · 8b0e1953
      Thomas Gleixner authored
      ktime_set(S,N) was required for the timespec storage type and is still
      useful for situations where a Seconds and Nanoseconds part of a time value
      needs to be converted. For anything where the Seconds argument is 0, this
      is pointless and can be replaced with a simple assignment.
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
  5. 24 Dec, 2016 1 commit
  6. 23 Dec, 2016 1 commit
  7. 19 Dec, 2016 2 commits
    • Stefan Haberland's avatar
      block: check partition alignment · 633395b6
      Stefan Haberland authored
      Partitions that are not aligned to the blocksize of a device may cause
      invalid I/O requests because the blocklayer cares only about alignment
      within the partition when building requests on partitions.
      partition offset 512byte
      When reading/writing one 4k block of the partition this maps to
      reading/writing with an offset of 512 byte of the device leading to
      unaligned requests for the device which in turn may cause unexpected
      behavior of the device driver.
      For DASD devices we have to translate the block number into a cylinder,
      head, record format. The unaligned requests lead to wrong calculation
      and therefore to misdirected I/O. In a "good" case this leads to I/O
      errors because the underlying hardware detects the wrong addressing.
      In a worst case scenario this might destroy data on the device.
      To prevent partitions that are not aligned to the physical blocksize
      of a device check for the alignment in the blkpg_ioctl.
      Signed-off-by: default avatarStefan Haberland <sth@linux.vnet.ibm.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
    • Mauricio Faria de Oliveira's avatar
      block: allow WRITE_SAME commands with the SG_IO ioctl · 25cdb645
      Mauricio Faria de Oliveira authored
      The WRITE_SAME commands are not present in the blk_default_cmd_filter
      write_ok list, and thus are failed with -EPERM when the SG_IO ioctl()
      is executed without CAP_SYS_RAWIO capability (e.g., unprivileged users).
      [ sg_io() -> blk_fill_sghdr_rq() > blk_verify_command() -> -EPERM ]
      The problem can be reproduced with the sg_write_same command
        # sg_write_same --num 1 --xferlen 512 /dev/sda
        # capsh --drop=cap_sys_rawio -- -c \
          'sg_write_same --num 1 --xferlen 512 /dev/sda'
          Write same: pass through os error: Operation not permitted
      For comparison, the WRITE_VERIFY command does not observe this problem,
      since it is in that list:
        # capsh --drop=cap_sys_rawio -- -c \
          'sg_write_verify --num 1 --ilen 512 --lba 0 /dev/sda'
      So, this patch adds the WRITE_SAME commands to the list, in order
      for the SG_IO ioctl to finish successfully:
        # capsh --drop=cap_sys_rawio -- -c \
          'sg_write_same --num 1 --xferlen 512 /dev/sda'
      That case happens to be exercised by QEMU KVM guests with 'scsi-block' devices
      (qemu "-device scsi-block" [1], libvirt "<disk type='block' device='lun'>" [2]),
      which employs the SG_IO ioctl() and runs as an unprivileged user (libvirt-qemu).
      In that scenario, when a filesystem (e.g., ext4) performs its zero-out calls,
      which are translated to write-same calls in the guest kernel, and then into
      SG_IO ioctls to the host kernel, SCSI I/O errors may be observed in the guest:
        [...] sd 0:0:0:0: [sda] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
        [...] sd 0:0:0:0: [sda] tag#0 Sense Key : Aborted Command [current]
        [...] sd 0:0:0:0: [sda] tag#0 Add. Sense: I/O process terminated
        [...] sd 0:0:0:0: [sda] tag#0 CDB: Write Same(10) 41 00 01 04 e0 78 00 00 08 00
        [...] blk_update_request: I/O error, dev sda, sector 17096824
      [1] http://git.qemu.org/?p=qemu.git;a=commit;h=336a6915bc7089fb20fea4ba99972ad9a97c5f52
      [2] https://libvirt.org/formatdomain.html#elementsDisks
       (see 'disk' -> 'device')
      Signed-off-by: default avatarMauricio Faria de Oliveira <mauricfo@linux.vnet.ibm.com>
      Signed-off-by: default avatarBrahadambal Srinivasan <latha@linux.vnet.ibm.com>
      Reported-by: default avatarManjunatha H R <manjuhr1@in.ibm.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
  8. 14 Dec, 2016 2 commits
    • Gabriel Krisman Bertazi's avatar
      blk-mq: Fix failed allocation path when mapping queues · d1b1cea1
      Gabriel Krisman Bertazi authored
      In blk_mq_map_swqueue, there is a memory optimization that frees the
      tags of a queue that has gone unmapped.  Later, if that hctx is remapped
      after another topology change, the tags need to be reallocated.
      If this allocation fails, a simple WARN_ON triggers, but the block layer
      ends up with an active hctx without any corresponding set of tags.
      Then, any income IO to that hctx can trigger an Oops.
      I can reproduce it consistently by running IO, flipping CPUs on and off
      and eventually injecting a memory allocation failure in that path.
      In the fix below, if the system experiences a failed allocation of any
      hctx's tags, we remap all the ctxs of that queue to the hctx_0, which
      should always keep it's tags.  There is a minor performance hit, since
      our mapping just got worse after the error path, but this is
      the simplest solution to handle this error path.  The performance hit
      will disappear after another successful remap.
      I considered dropping the memory optimization all together, but it
      seemed a bad trade-off to handle this very specific error case.
      This should apply cleanly on top of Jens' for-next branch.
      The Oops is the one below:
      SP (3fff935ce4d0) is in userspace
      1:mon> e
      cpu 0x1: Vector: 300 (Data Access) at [c000000fe99eb110]
          pc: c0000000005e868c: __sbitmap_queue_get+0x2c/0x180
          lr: c000000000575328: __bt_get+0x48/0xd0
          sp: c000000fe99eb390
         msr: 900000010280b033
         dar: 28
       dsisr: 40000000
        current = 0xc000000fe9966800
        paca    = 0xc000000007e80300   softe: 0        irq_happened: 0x01
          pid   = 11035, comm = aio-stress
      Linux version 4.8.0-rc6+ (root@bean) (gcc version 5.4.0 20160609
      (Ubuntu/IBM 5.4.0-6ubuntu1~16.04.2) ) #3 SMP Mon Oct 10 20:16:53 CDT 2016
      1:mon> s
      [c000000fe99eb3d0] c000000000575328 __bt_get+0x48/0xd0
      [c000000fe99eb400] c000000000575838 bt_get.isra.1+0x78/0x2d0
      [c000000fe99eb480] c000000000575cb4 blk_mq_get_tag+0x44/0x100
      [c000000fe99eb4b0] c00000000056f6f4 __blk_mq_alloc_request+0x44/0x220
      [c000000fe99eb500] c000000000570050 blk_mq_map_request+0x100/0x1f0
      [c000000fe99eb580] c000000000574650 blk_mq_make_request+0xf0/0x540
      [c000000fe99eb640] c000000000561c44 generic_make_request+0x144/0x230
      [c000000fe99eb690] c000000000561e00 submit_bio+0xd0/0x200
      [c000000fe99eb740] c0000000003ef740 ext4_io_submit+0x90/0xb0
      [c000000fe99eb770] c0000000003e95d8 ext4_writepages+0x588/0xdd0
      [c000000fe99eb910] c00000000025a9f0 do_writepages+0x60/0xc0
      [c000000fe99eb940] c000000000246c88 __filemap_fdatawrite_range+0xf8/0x180
      [c000000fe99eb9e0] c000000000246f90 filemap_write_and_wait_range+0x70/0xf0
      [c000000fe99eba20] c0000000003dd844 ext4_sync_file+0x214/0x540
      [c000000fe99eba80] c000000000364718 vfs_fsync_range+0x78/0x130
      [c000000fe99ebad0] c0000000003dd46c ext4_file_write_iter+0x35c/0x430
      [c000000fe99ebb90] c00000000038c280 aio_run_iocb+0x3b0/0x450
      [c000000fe99ebce0] c00000000038dc28 do_io_submit+0x368/0x730
      [c000000fe99ebe30] c000000000009404 system_call+0x38/0xec
      Signed-off-by: default avatarGabriel Krisman Bertazi <krisman@linux.vnet.ibm.com>
      Cc: Brian King <brking@linux.vnet.ibm.com>
      Cc: Douglas Miller <dougmill@linux.vnet.ibm.com>
      Cc: linux-block@vger.kernel.org
      Cc: linux-scsi@vger.kernel.org
      Reviewed-by: default avatarDouglas Miller <dougmill@linux.vnet.ibm.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
    • Gabriel Krisman Bertazi's avatar
      blk-mq: Avoid memory reclaim when remapping queues · 36e1f3d1
      Gabriel Krisman Bertazi authored
      While stressing memory and IO at the same time we changed SMT settings,
      we were able to consistently trigger deadlocks in the mm system, which
      froze the entire machine.
      I think that under memory stress conditions, the large allocations
      performed by blk_mq_init_rq_map may trigger a reclaim, which stalls
      waiting on the block layer remmaping completion, thus deadlocking the
      system.  The trace below was collected after the machine stalled,
      waiting for the hotplug event completion.
      The simplest fix for this is to make allocations in this path
      non-reclaimable, with GFP_NOIO.  With this patch, We couldn't hit the
      issue anymore.
      This should apply on top of Jens's for-next branch cleanly.
      Changes since v1:
        - Use GFP_NOIO instead of GFP_NOWAIT.
       Call Trace:
      [c000000f0160aaf0] [c000000f0160ab50] 0xc000000f0160ab50 (unreliable)
      [c000000f0160acc0] [c000000000016624] __switch_to+0x2e4/0x430
      [c000000f0160ad20] [c000000000b1a880] __schedule+0x310/0x9b0
      [c000000f0160ae00] [c000000000b1af68] schedule+0x48/0xc0
      [c000000f0160ae30] [c000000000b1b4b0] schedule_preempt_disabled+0x20/0x30
      [c000000f0160ae50] [c000000000b1d4fc] __mutex_lock_slowpath+0xec/0x1f0
      [c000000f0160aed0] [c000000000b1d678] mutex_lock+0x78/0xa0
      [c000000f0160af00] [d000000019413cac] xfs_reclaim_inodes_ag+0x33c/0x380 [xfs]
      [c000000f0160b0b0] [d000000019415164] xfs_reclaim_inodes_nr+0x54/0x70 [xfs]
      [c000000f0160b0f0] [d0000000194297f8] xfs_fs_free_cached_objects+0x38/0x60 [xfs]
      [c000000f0160b120] [c0000000003172c8] super_cache_scan+0x1f8/0x210
      [c000000f0160b190] [c00000000026301c] shrink_slab.part.13+0x21c/0x4c0
      [c000000f0160b2d0] [c000000000268088] shrink_zone+0x2d8/0x3c0
      [c000000f0160b380] [c00000000026834c] do_try_to_free_pages+0x1dc/0x520
      [c000000f0160b450] [c00000000026876c] try_to_free_pages+0xdc/0x250
      [c000000f0160b4e0] [c000000000251978] __alloc_pages_nodemask+0x868/0x10d0
      [c000000f0160b6f0] [c000000000567030] blk_mq_init_rq_map+0x160/0x380
      [c000000f0160b7a0] [c00000000056758c] blk_mq_map_swqueue+0x33c/0x360
      [c000000f0160b820] [c000000000567904] blk_mq_queue_reinit+0x64/0xb0
      [c000000f0160b850] [c00000000056a16c] blk_mq_queue_reinit_notify+0x19c/0x250
      [c000000f0160b8a0] [c0000000000f5d38] notifier_call_chain+0x98/0x100
      [c000000f0160b8f0] [c0000000000c5fb0] __cpu_notify+0x70/0xe0
      [c000000f0160b930] [c0000000000c63c4] notify_prepare+0x44/0xb0
      [c000000f0160b9b0] [c0000000000c52f4] cpuhp_invoke_callback+0x84/0x250
      [c000000f0160ba10] [c0000000000c570c] cpuhp_up_callbacks+0x5c/0x120
      [c000000f0160ba60] [c0000000000c7cb8] _cpu_up+0xf8/0x1d0
      [c000000f0160bac0] [c0000000000c7eb0] do_cpu_up+0x120/0x150
      [c000000f0160bb40] [c0000000006fe024] cpu_subsys_online+0x64/0xe0
      [c000000f0160bb90] [c0000000006f5124] device_online+0xb4/0x120
      [c000000f0160bbd0] [c0000000006f5244] online_store+0xb4/0xc0
      [c000000f0160bc20] [c0000000006f0a68] dev_attr_store+0x68/0xa0
      [c000000f0160bc60] [c0000000003ccc30] sysfs_kf_write+0x80/0xb0
      [c000000f0160bca0] [c0000000003cbabc] kernfs_fop_write+0x17c/0x250
      [c000000f0160bcf0] [c00000000030fe6c] __vfs_write+0x6c/0x1e0
      [c000000f0160bd90] [c000000000311490] vfs_write+0xd0/0x270
      [c000000f0160bde0] [c0000000003131fc] SyS_write+0x6c/0x110
      [c000000f0160be30] [c000000000009204] system_call+0x38/0xec
      Signed-off-by: default avatarGabriel Krisman Bertazi <krisman@linux.vnet.ibm.com>
      Cc: Brian King <brking@linux.vnet.ibm.com>
      Cc: Douglas Miller <dougmill@linux.vnet.ibm.com>
      Cc: linux-block@vger.kernel.org
      Cc: linux-scsi@vger.kernel.org
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
  9. 13 Dec, 2016 1 commit
    • Jens Axboe's avatar
      mm: don't cap request size based on read-ahead setting · 9491ae4a
      Jens Axboe authored
      We ran into a funky issue, where someone doing 256K buffered reads saw
      128K requests at the device level.  Turns out it is read-ahead capping
      the request size, since we use 128K as the default setting.  This
      doesn't make a lot of sense - if someone is issuing 256K reads, they
      should see 256K reads, regardless of the read-ahead setting, if the
      underlying device can support a 256K read in a single command.
      This patch introduces a bdi hint, io_pages.  This is the soft max IO
      size for the lower level, I've hooked it up to the bdev settings here.
      Read-ahead is modified to issue the maximum of the user request size,
      and the read-ahead max size, but capped to the max request size on the
      device side.  The latter is done to avoid reading ahead too much, if the
      application asks for a huge read.  With this patch, the kernel behaves
      like the application expects.
      Link: http://lkml.kernel.org/r/1479498073-8657-1-git-send-email-axboe@fb.com
      Signed-off-by: Jens Axb...
  10. 09 Dec, 2016 7 commits
  11. 07 Dec, 2016 1 commit
  12. 05 Dec, 2016 2 commits
    • Jens Axboe's avatar
    • Nicolai Stange's avatar
      block: fix unintended fallthrough in generic_make_request_checks() · 58886785
      Nicolai Stange authored
      Since commit e73c23ff ("block: add async variant of
      blkdev_issue_zeroout") messages like the following show up:
        EXT4-fs (dm-1): Delayed block allocation failed for inode 2368848 at
                        logical offset 0 with max blocks 1 with error 95
        EXT4-fs (dm-1): This should not happen!! Data will be lost
      Due to the following fallthrough introduced with
      commit 2d253440 ("block: Define zoned block device operations"),
      generic_make_request_checks() would accept a REQ_OP_WRITE_SAME bio only
      if the block device supports "write same" *and* is a zoned one:
        switch (bio_op(bio)) {
        case REQ_OP_WRITE_SAME:
              if (!bdev_write_same(bio->bi_bdev))
                      goto not_supported;
        case REQ_OP_ZONE_REPORT:
        case REQ_OP_ZONE_RESET:
                      if (!bdev_is_zoned(bio->bi_bdev))
                              goto not_supported;
      Thus, although the bio setup as done by __blkdev_issue_write_same() from
      commit e73c23ff ("block: add async variant of blkdev_issue_zeroout")
      would succeed, its actual submission would not, resulting in the
      EOPNOTSUPP == 95.
      Fix this by removing the fallthrough which, due to the lack of an explicit
      comment, seems to be unintended anyway.
      Fixes: e73c23ff ("block: add async variant of blkdev_issue_zeroout")
      Fixes: 2d253440
       ("block: Define zoned block device operations")
      Signed-off-by: default avatarNicolai Stange <nicstange@gmail.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
  13. 03 Dec, 2016 1 commit
  14. 01 Dec, 2016 4 commits
    • Ritesh Harjani's avatar
      block: factor out req_set_nomerge · e0c72300
      Ritesh Harjani authored
      Factor out common code for setting REQ_NOMERGE flag which is being used
      out at certain places and make it a helper instead, req_set_nomerge().
      Signed-off-by: default avatarRitesh Harjani <riteshh@codeaurora.org>
      Get rid of the inline.
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
    • Chaitanya Kulkarni's avatar
      block: add support for REQ_OP_WRITE_ZEROES · a6f0788e
      Chaitanya Kulkarni authored
      This adds a new block layer operation to zero out a range of
      LBAs. This allows to implement zeroing for devices that don't use
      either discard with a predictable zero pattern or WRITE SAME of zeroes.
      The prominent example of that is NVMe with the Write Zeroes command,
      but in the future, this should also help with improving the way
      zeroing discards work. For this operation, suitable entry is exported in
      sysfs which indicate the number of maximum bytes allowed in one
      write zeroes operation by the device.
      Signed-off-by: default avatarChaitanya Kulkarni <chaitanya.kulkarni@hgst.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
    • Chaitanya Kulkarni's avatar
      block: add async variant of blkdev_issue_zeroout · e73c23ff
      Chaitanya Kulkarni authored
      Similar to __blkdev_issue_discard this variant allows submitting
      the final bio asynchronously and chaining multiple ranges
      into a single completion.
      Signed-off-by: default avatarChaitanya Kulkarni <chaitanya.kulkarni@hgst.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
    • Damien Le Moal's avatar
      block: Check partition alignment on zoned block devices · b02d8aae
      Damien Le Moal authored
      Both blkdev_report_zones and blkdev_reset_zones can operate on a partition of
      a zoned block device. However, the first and last zones reported for a
      partition make sense only if the partition start sector and size are aligned
      on the device zone size. The same applies for zone reset. Resetting the first
      or the last zone of a partition straddling zones may impact neighboring
      partitions. Finally, if a partition start sector is not at the beginning of a
      sequential zone, it will be impossible to write to the first sectors of the
      partition on a host-managed device.
      Avoid all these problems and incoherencies by ignoring partitions that are not
      zone aligned.
      Note: Even with CONFIG_BLK_DEV_ZONED disabled, bdev_is_zoned() will report the
      correct disk zoning type (host-aware, host-managed or none) but
      bdev_zone_size() will always return 0 for zoned block devices (i.e. the zone
      size is unknown). So test this as a way to ensure that a zoned block device is
      being handled as such. As a result, for a host-aware devices, unaligned zone
      partitions will be accepted with CONFIG_BLK_DEV_ZONED disabled. That is, the
      disk will be treated as a regular block device (as it should). If zoned block
      device support is enabled, only aligned partitions will be accepted.
      Signed-off-by: default avatarDamien Le Moal <damien.lemoal@wdc.com>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
  15. 30 Nov, 2016 1 commit
    • Kent Overstreet's avatar
      block: add bio_iov_iter_get_pages() · 38161995
      Kent Overstreet authored
      This is a helper that pins down a range from an iov_iter and adds it to
      a bio without requiring a separate memory allocation for the page array.
      It will be used for upcoming direct I/O implementations for block devices
      and iomap based file systems.
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@gmail.com>
      [hch: ported to the iov_iter interface, renamed and added comments.
            All blame should be directed to me and all fame should go to Kent
            after this!]
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      (cherry picked from commit 9cd56d916aa481ce8f56d9c5302a6ed90c2e0b5f)
  16. 29 Nov, 2016 1 commit
  17. 28 Nov, 2016 2 commits