1. 22 Sep, 2016 1 commit
  2. 10 Jul, 2016 1 commit
  3. 04 Jul, 2016 2 commits
    • Vegard Nossum's avatar
      ext4: don't call ext4_should_journal_data() on the journal inode · 6a7fd522
      Vegard Nossum authored
      If ext4_fill_super() fails early, it's possible for ext4_evict_inode()
      to call ext4_should_journal_data() before superblock options and flags
      are fully set up.  In that case, the iput() on the journal inode can
      end up causing a BUG().
      
      Work around this problem by reordering the tests so we only call
      ext4_should_journal_data() after we know it's not the journal inode.
      
      Fixes: 2d859db3 ("ext4: fix data corruption in inodes with journalled data")
      Fixes: 2b405bfa
      
       ("ext4: fix data=journal fast mount/umount hang")
      Cc: Jan Kara <jack@suse.cz>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarVegard Nossum <vegard.nossum@oracle.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      6a7fd522
    • Jan Kara's avatar
      ext4: fix deadlock during page writeback · 646caa9c
      Jan Kara authored
      Commit 06bd3c36
      
       (ext4: fix data exposure after a crash) uncovered a
      deadlock in ext4_writepages() which was previously much harder to hit.
      After this commit xfstest generic/130 reproduces the deadlock on small
      filesystems.
      
      The problem happens when ext4_do_update_inode() sets LARGE_FILE feature
      and marks current inode handle as synchronous. That subsequently results
      in ext4_journal_stop() called from ext4_writepages() to block waiting for
      transaction commit while still holding page locks, reference to io_end,
      and some prepared bio in mpd structure each of which can possibly block
      transaction commit from completing and thus results in deadlock.
      
      Fix the problem by releasing page locks, io_end reference, and
      submitting prepared bio before calling ext4_journal_stop().
      
      [ Changed to defer the call to ext4_journal_stop() only if the handle
        is synchronous.  --tytso ]
      Reported-and-tested-by: default avatarEryu Guan <eguan@redhat.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      CC: stable@vger.kernel.org
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      646caa9c
  4. 03 Jul, 2016 1 commit
  5. 07 Jun, 2016 2 commits
  6. 13 May, 2016 3 commits
    • Jan Kara's avatar
      ext4: pre-zero allocated blocks for DAX IO · 12735f88
      Jan Kara authored
      
      
      Currently ext4 treats DAX IO the same way as direct IO. I.e., it
      allocates unwritten extents before IO is done and converts unwritten
      extents afterwards. However this way DAX IO can race with page fault to
      the same area:
      
      ext4_ext_direct_IO()				dax_fault()
        dax_io()
          get_block() - allocates unwritten extent
          copy_from_iter_pmem()
      						  get_block() - converts
      						    unwritten block to
      						    written and zeroes it
      						    out
        ext4_convert_unwritten_extents()
      
      So data written with DAX IO gets lost. Similarly dax_new_buf() called
      from dax_io() can overwrite data that has been already written to the
      block via mmap.
      
      Fix the problem by using pre-zeroed blocks for DAX IO the same way as we
      use them for DAX mmap. The downside of this solution is that every
      allocating write writes each block twice (once zeros, once data). Fixing
      the race with locking is possible as well however we would need to
      lock-out faults for the whole range written to by DAX IO. And that is
      not easy to do without locking-out faults for the whole file which seems
      too aggressive.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      12735f88
    • Jan Kara's avatar
      ext4: refactor direct IO code · 914f82a3
      Jan Kara authored
      
      
      Currently ext4 direct IO handling is split between ext4_ext_direct_IO()
      and ext4_ind_direct_IO(). However the extent based function calls into
      the indirect based one for some cases and for example it is not able to
      handle file extending. Previously it was not also properly handling
      retries in case of ENOSPC errors. With DAX things would get even more
      contrieved so just refactor the direct IO code and instead of indirect /
      extent split do the split to read vs writes.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      914f82a3
    • Jan Kara's avatar
      ext4: handle transient ENOSPC properly for DAX · 7cb476f8
      Jan Kara authored
      
      
      ext4_dax_get_blocks() was accidentally omitted fixing get blocks
      handlers to properly handle transient ENOSPC errors. Fix it now to use
      ext4_get_blocks_trans() helper which takes care of these errors.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      7cb476f8
  7. 01 May, 2016 1 commit
  8. 26 Apr, 2016 2 commits
    • Daeho Jeong's avatar
      ext4: fix races between changing inode journal mode and ext4_writepages · c8585c6f
      Daeho Jeong authored
      
      
      In ext4, there is a race condition between changing inode journal mode
      and ext4_writepages(). While ext4_writepages() is executed on a
      non-journalled mode inode, the inode's journal mode could be enabled
      by ioctl() and then, some pages dirtied after switching the journal
      mode will be still exposed to ext4_writepages() in non-journaled mode.
      To resolve this problem, we use fs-wide per-cpu rw semaphore by Jan
      Kara's suggestion because we don't want to waste ext4_inode_info's
      space for this extra rare case.
      Signed-off-by: default avatarDaeho Jeong <daeho.jeong@samsung.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      c8585c6f
    • Daeho Jeong's avatar
      ext4: handle unwritten or delalloc buffers before enabling data journaling · 4c546592
      Daeho Jeong authored
      
      
      We already allocate delalloc blocks before changing the inode mode into
      "per-file data journal" mode to prevent delalloc blocks from remaining
      not allocated, but another issue concerned with "BH_Unwritten" status
      still exists. For example, by fallocate(), several buffers' status
      change into "BH_Unwritten", but these buffers cannot be processed by
      ext4_alloc_da_blocks(). So, they still remain in unwritten status after
      per-file data journaling is enabled and they cannot be changed into
      written status any more and, if they are journaled and eventually
      checkpointed, these unwritten buffer will cause a kernel panic by the
      below BUG_ON() function of submit_bh_wbc() when they are submitted
      during checkpointing.
      
      static int submit_bh_wbc(int rw, struct buffer_head *bh,...
      {
              ...
              BUG_ON(buffer_unwritten(bh));
      
      Moreover, when "dioread_nolock" option is enabled, the status of a
      buffer is changed into "BH_Unwritten" after write_begin() completes and
      the "BH_Unwritten" status will be cleared after I/O is done. Therefore,
      if a buffer's status is changed into unwrutten but the buffer's I/O is
      not submitted and completed, it can cause the same problem after
      enabling per-file data journaling. You can easily generate this bug by
      executing the following command.
      
      ./kvm-xfstests -C 10000 -m nodelalloc,dioread_nolock generic/269
      
      To resolve these problems and define a boundary between the previous
      mode and per-file data journaling mode, we need to flush and wait all
      the I/O of buffers of a file before enabling per-file data journaling
      of the file.
      Signed-off-by: default avatarDaeho Jeong <daeho.jeong@samsung.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      4c546592
  9. 24 Apr, 2016 3 commits
    • Jan Kara's avatar
      ext4: do not ask jbd2 to write data for delalloc buffers · ee0876bc
      Jan Kara authored
      
      
      Currently we ask jbd2 to write all dirty allocated buffers before
      committing a transaction when doing writeback of delay allocated blocks.
      However this is unnecessary since we move all pages to writeback state
      before dropping a transaction handle and then submit all the necessary
      IO. We still need the transaction commit to wait for all the outstanding
      writeback before flushing disk caches during transaction commit to avoid
      data exposure issues though. Use the new jbd2 capability and ask it to
      only wait for outstanding writeback during transaction commit when
      writing back data in ext4_writepages().
      Tested-by: default avatar"HUANG Weller (CM/ESW12-CN)" <Weller.Huang@cn.bosch.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      ee0876bc
    • Jan Kara's avatar
      ext4: remove EXT4_STATE_ORDERED_MODE · 3957ef53
      Jan Kara authored
      
      
      This flag is just duplicating what ext4_should_order_data() tells you
      and is used in a single place. Furthermore it doesn't reflect changes to
      inode data journalling flag so it may be possibly misleading. Just
      remove it.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      3957ef53
    • Jan Kara's avatar
      ext4: fix data exposure after a crash · 06bd3c36
      Jan Kara authored
      Huang has reported that in his powerfail testing he is seeing stale
      block contents in some of recently allocated blocks although he mounts
      ext4 in data=ordered mode. After some investigation I have found out
      that indeed when delayed allocation is used, we don't add inode to
      transaction's list of inodes needing flushing before commit. Originally
      we were doing that but commit f3b59291 removed the logic with a
      flawed argument that it is not needed.
      
      The problem is that although for delayed allocated blocks we write their
      contents immediately after allocating them, there is no guarantee that
      the IO scheduler or device doesn't reorder things and thus transaction
      allocating blocks and attaching them to inode can reach stable storage
      before actual block contents. Actually whenever we attach freshly
      allocated blocks to inode using a written extent, we should add inode to
      transaction's ordered inode list to make sure we properly wait for block
      contents to be written before committing the transaction. So that is
      what we do in this patch. This also handles other cases where stale data
      exposure was possible - like filling hole via mmap in
      data=ordered,nodelalloc mode.
      
      The only exception to the above rule are extending direct IO writes where
      blkdev_direct_IO() waits for IO to complete before increasing i_size and
      thus stale data exposure is not possible. For now we don't complicate
      the code with optimizing this special case since the overhead is pretty
      low. In case this is observed to be a performance problem we can always
      handle it using a special flag to ext4_map_blocks().
      
      CC: stable@vger.kernel.org
      Fixes: f3b59291
      
      Reported-by: default avatar"HUANG Weller (CM/ESW12-CN)" <Weller.Huang@cn.bosch.com>
      Tested-by: default avatar"HUANG Weller (CM/ESW12-CN)" <Weller.Huang@cn.bosch.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      06bd3c36
  10. 04 Apr, 2016 2 commits
    • Kirill A. Shutemov's avatar
      mm, fs: remove remaining PAGE_CACHE_* and page_cache_{get,release} usage · ea1754a0
      Kirill A. Shutemov authored
      
      
      Mostly direct substitution with occasional adjustment or removing
      outdated comments.
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ea1754a0
    • Kirill A. Shutemov's avatar
      mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros · 09cbfeaf
      Kirill A. Shutemov authored
      
      
      PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
      ago with promise that one day it will be possible to implement page
      cache with bigger chunks than PAGE_SIZE.
      
      This promise never materialized.  And unlikely will.
      
      We have many places where PAGE_CACHE_SIZE assumed to be equal to
      PAGE_SIZE.  And it's constant source of confusion on whether
      PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
      especially on the border between fs and mm.
      
      Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
      breakage to be doable.
      
      Let's stop pretending that pages in page cache are special.  They are
      not.
      
      The changes are pretty straight-forward:
      
       - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
      
       - page_cache_get() -> get_page();
      
       - page_cache_release() -> put_page();
      
      This patch contains automated changes generated with coccinelle using
      script below.  For some reason, coccinelle doesn't patch header files.
      I've called spatch for them manually.
      
      The only adjustment after coccinelle is revert of changes to
      PAGE_CAHCE_ALIGN definition: we are going to drop it later.
      
      There are few places in the code where coccinelle didn't reach.  I'll
      fix them manually in a separate patch.  Comments and documentation also
      will be addressed with the separate patch.
      
      virtual patch
      
      @@
      expression E;
      @@
      - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      expression E;
      @@
      - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      @@
      - PAGE_CACHE_SHIFT
      + PAGE_SHIFT
      
      @@
      @@
      - PAGE_CACHE_SIZE
      + PAGE_SIZE
      
      @@
      @@
      - PAGE_CACHE_MASK
      + PAGE_MASK
      
      @@
      expression E;
      @@
      - PAGE_CACHE_ALIGN(E)
      + PAGE_ALIGN(E)
      
      @@
      expression E;
      @@
      - page_cache_get(E)
      + get_page(E)
      
      @@
      expression E;
      @@
      - page_cache_release(E)
      + put_page(E)
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      09cbfeaf
  11. 01 Apr, 2016 1 commit
    • Jan Kara's avatar
      ext4: retry block allocation for failed DIO and DAX writes · e84dfbe2
      Jan Kara authored
      
      
      Currently if block allocation for DIO or DAX write fails due to ENOSPC,
      we just returned it to userspace. However these ENOSPC errors can be
      transient because the transaction freeing blocks has not yet committed.
      This demonstrates as failures of generic/102 test when the filesystem is
      mounted with 'dax' mount option.
      
      Fix the problem by properly retrying the allocation in case of ENOSPC
      error in get blocks functions used for direct IO.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Tested-by: Ross Zwisler <ross.zwisler@linux.intel.com>                                                                               
      e84dfbe2
  12. 13 Mar, 2016 1 commit
    • Eryu Guan's avatar
      ext4: fix NULL pointer dereference in ext4_mark_inode_dirty() · 5e1021f2
      Eryu Guan authored
      ext4_reserve_inode_write() in ext4_mark_inode_dirty() could fail on
      error (e.g. EIO) and iloc.bh can be NULL in this case. But the error is
      ignored in the following "if" condition and ext4_expand_extra_isize()
      might be called with NULL iloc.bh set, which triggers NULL pointer
      dereference.
      
      This is uncovered by commit 8b4953e1
      
       ("ext4: reserve code points for
      the project quota feature"), which enlarges the ext4_inode size, and
      run the following script on new kernel but with old mke2fs:
      
        #/bin/bash
        mnt=/mnt/ext4
        devname=ext4-error
        dev=/dev/mapper/$devname
        fsimg=/home/fs.img
      
        trap cleanup 0 1 2 3 9 15
      
        cleanup()
        {
                umount $mnt >/dev/null 2>&1
                dmsetup remove $devname
                losetup -d $backend_dev
                rm -f $fsimg
                exit 0
        }
      
        rm -f $fsimg
        fallocate -l 1g $fsimg
        backend_dev=`losetup -f --show $fsimg`
        devsize=`blockdev --getsz $backend_dev`
      
        good_tab="0 $devsize linear $backend_dev 0"
        error_tab="0 $devsize error $backend_dev 0"
      
        dmsetup create $devname --table "$good_tab"
      
        mkfs -t ext4 $dev
        mount -t ext4 -o errors=continue,strictatime $dev $mnt
      
        dmsetup load $devname --table "$error_tab" && dmsetup resume $devname
        echo 3 > /proc/sys/vm/drop_caches
        ls -l $mnt
        exit 0
      
      [ Patch changed to simplify the function a tiny bit. -- Ted ]
      Signed-off-by: default avatarEryu Guan <guaneryu@gmail.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      5e1021f2
  13. 10 Mar, 2016 3 commits
    • Jan Kara's avatar
      ext4: more efficient SEEK_DATA implementation · 2d90c160
      Jan Kara authored
      
      
      Using SEEK_DATA in a huge sparse file can easily lead to sotflockups as
      ext4_seek_data() iterates hole block-by-block. Fix the problem by using
      returned hole size from ext4_map_blocks() and thus skip the hole in one
      go.
      
      Update also SEEK_HOLE implementation to follow the same pattern as
      SEEK_DATA to make future maintenance easier.
      
      Furthermore we add cond_resched() to both ext4_seek_data() and
      ext4_seek_hole() to avoid softlockups in case evil user creates huge
      fragmented file and we have to go through lots of extents.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      2d90c160
    • Jan Kara's avatar
      ext4: cleanup handling of bh->b_state in DAX mmap · e3fb8eb1
      Jan Kara authored
      
      
      ext4_dax_mmap_get_block() updates bh->b_state directly instead of using
      ext4_update_bh_state(). This is mostly a cosmetic issue since DAX code
      always passes on-stack buffer_head but clean this up to make code more
      uniform.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      e3fb8eb1
    • Jan Kara's avatar
      ext4: return hole from ext4_map_blocks() · facab4d9
      Jan Kara authored
      
      
      Currently, ext4_map_blocks() just returns 0 when it finds a hole and
      allocation is not requested. However we have all the information
      available to tell how large the hole actually is and there are callers
      of ext4_map_blocks() which would save some block-by-block hole iteration
      if they knew this information. So fill in struct ext4_map_blocks even
      for holes with the information we have. We keep returning 0 for holes to
      maintain backward compatibility of the function.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      facab4d9
  14. 09 Mar, 2016 4 commits
    • Jan Kara's avatar
      ext4: remove i_ioend_count · 600be30a
      Jan Kara authored
      
      
      Remove counter of pending io ends as it is unused.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      600be30a
    • Jan Kara's avatar
      ext4: simplify io_end handling for AIO DIO · 109811c2
      Jan Kara authored
      
      
      When mapping blocks for direct IO, we allocate io_end structure before
      mapping blocks and store pointer to it in the inode. This creates a
      requirement that any AIO DIO using io_end must be protected by i_mutex.
      This created problems in the past with dioread_nolock mode which was
      corrupting io_end pointers. Also io_end is allocated unnecessarily in
      case where we don't need to convert any extents (which is a common case
      for example when overwriting file).
      
      We fix the problem by allocating io_end only once we return unwritten
      extent from block mapping function for AIO DIO (so we can save some
      pointless io_end allocations) and we pass pointer to it in bh->b_private
      which generic DIO code later passes to our end IO callback. That way we
      remove any need for global pointer to io_end structure and thus fix the
      races.
      
      The downside of this change is that the checking for unwritten IO in
      flight in ext4_extents_can_be_merged() is more racy since we now
      increment i_unwritten / set EXT4_STATE_DIO_UNWRITTEN only after dropping
      i_data_sem. However the check has been racy already before because
      ext4_writepages() already increment i_unwritten after dropping
      i_data_sem and reserved blocks save us from hitting ENOSPC in the worst
      case.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      109811c2
    • Jan Kara's avatar
      ext4: move trans handling and completion deferal out of _ext4_get_block · efe70c29
      Jan Kara authored
      
      
      There is no need to handle starting of a transaction and deferal of DIO
      completion in _ext4_get_block() function. We can move this out to get
      block functions for direct IO that need it. That way we can add stricter
      checks verifying things work as we expect.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      efe70c29
    • Jan Kara's avatar
      ext4: rename and split get blocks functions · 705965bd
      Jan Kara authored
      
      
      Rename ext4_get_blocks_write() to ext4_get_blocks_unwritten() to better
      describe what it does. Also split out get blocks functions for direct
      IO. Later we move functionality from _ext4_get_blocks() there. There's no
      functional change in this patch.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      705965bd
  15. 28 Feb, 2016 1 commit
  16. 27 Feb, 2016 2 commits
    • Ross Zwisler's avatar
      dax: move writeback calls into the filesystems · 7f6d5b52
      Ross Zwisler authored
      
      
      Previously calls to dax_writeback_mapping_range() for all DAX filesystems
      (ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range().
      
      dax_writeback_mapping_range() needs a struct block_device, and it used
      to get that from inode->i_sb->s_bdev.  This is correct for normal inodes
      mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw
      block devices and for XFS real-time files.
      
      Instead, call dax_writeback_mapping_range() directly from the filesystem
      ->writepages function so that it can supply us with a valid block
      device.  This also fixes DAX code to properly flush caches in response
      to sync(2).
      Signed-off-by: default avatarRoss Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Cc: Al Viro <viro@ftp.linux.org.uk>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7f6d5b52
    • Ross Zwisler's avatar
      ext2, ext4: only set S_DAX for regular inodes · 0a6cf913
      Ross Zwisler authored
      
      
      When S_DAX is set on an inode we assume that if there are pages attached
      to the mapping (mapping->nrpages != 0), those pages are clean zero pages
      that were used to service reads from holes.  Any dirty data associated
      with the inode should be in the form of DAX exceptional entries
      (mapping->nrexceptional) that is written back via
      dax_writeback_mapping_range().
      
      With the current code, though, this isn't always true.  For example,
      ext2 and ext4 directory inodes can have S_DAX set, but have their dirty
      data stored as dirty page cache entries.  For these types of inodes,
      having S_DAX set doesn't really make sense since their I/O doesn't
      actually happen through the DAX code path.
      
      Instead, only allow S_DAX to be set for regular inodes for ext2 and
      ext4.  This allows us to have strict DAX vs non-DAX paths in the
      writeback code.
      Signed-off-by: default avatarRoss Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Al Viro <viro@ftp.linux.org.uk>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0a6cf913
  17. 19 Feb, 2016 2 commits
    • Jan Kara's avatar
      ext4: fix crashes in dioread_nolock mode · 74dae427
      Jan Kara authored
      
      
      Competing overwrite DIO in dioread_nolock mode will just overwrite
      pointer to io_end in the inode. This may result in data corruption or
      extent conversion happening from IO completion interrupt because we
      don't properly set buffer_defer_completion() when unlocked DIO races
      with locked DIO to unwritten extent.
      
      Since unlocked DIO doesn't need io_end for anything, just avoid
      allocating it and corrupting pointer from inode for locked DIO.
      A cleaner fix would be to avoid these games with io_end pointer from the
      inode but that requires more intrusive changes so we leave that for
      later.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      74dae427
    • Jan Kara's avatar
      ext4: fix bh->b_state corruption · ed8ad838
      Jan Kara authored
      
      
      ext4 can update bh->b_state non-atomically in _ext4_get_block() and
      ext4_da_get_block_prep(). Usually this is fine since bh is just a
      temporary storage for mapping information on stack but in some cases it
      can be fully living bh attached to a page. In such case non-atomic
      update of bh->b_state can race with an atomic update which then gets
      lost. Usually when we are mapping bh and thus updating bh->b_state
      non-atomically, nobody else touches the bh and so things work out fine
      but there is one case to especially worry about: ext4_finish_bio() uses
      BH_Uptodate_Lock on the first bh in the page to synchronize handling of
      PageWriteback state. So when blocksize < pagesize, we can be atomically
      modifying bh->b_state of a buffer that actually isn't under IO and thus
      can race e.g. with delalloc trying to map that buffer. The result is
      that we can mistakenly set / clear BH_Uptodate_Lock bit resulting in the
      corruption of PageWriteback state or missed unlock of BH_Uptodate_Lock.
      
      Fix the problem by always updating bh->b_state bits atomically.
      
      CC: stable@vger.kernel.org
      Reported-by: default avatarNikolay Borisov <kernel@kyup.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      ed8ad838
  18. 08 Feb, 2016 1 commit
  19. 22 Jan, 2016 1 commit
    • Al Viro's avatar
      wrappers for ->i_mutex access · 5955102c
      Al Viro authored
      
      
      parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
      inode_foo(inode) being mutex_foo(&inode->i_mutex).
      
      Please, use those for access to ->i_mutex; over the coming cycle
      ->i_mutex will become rwsem, with ->lookup() done with it held
      only shared.
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      5955102c
  20. 08 Jan, 2016 1 commit
  21. 09 Dec, 2015 1 commit
    • Al Viro's avatar
      don't put symlink bodies in pagecache into highmem · 21fc61c7
      Al Viro authored
      
      
      kmap() in page_follow_link_light() needed to go - allowing to hold
      an arbitrary number of kmaps for long is a great way to deadlocking
      the system.
      
      new helper (inode_nohighmem(inode)) needs to be used for pagecache
      symlinks inodes; done for all in-tree cases.  page_follow_link_light()
      instrumented to yell about anything missed.
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      21fc61c7
  22. 07 Dec, 2015 4 commits
    • Jan Kara's avatar
      ext4: use pre-zeroed blocks for DAX page faults · ba5843f5
      Jan Kara authored
      
      
      Make DAX fault path use pre-zeroed blocks to avoid races with extent
      conversion and zeroing when two page faults to the same block happen.
      Signed-off-by: default avatarJan Kara <jack@suse.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      ba5843f5
    • Jan Kara's avatar
      ext4: implement allocation of pre-zeroed blocks · c86d8db3
      Jan Kara authored
      
      
      DAX page fault path needs to get blocks that are pre-zeroed to avoid
      races when two concurrent page faults happen in the same block of a
      file. Implement support for this in ext4_map_blocks().
      Signed-off-by: default avatarJan Kara <jack@suse.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      c86d8db3
    • Jan Kara's avatar
      ext4: provide ext4_issue_zeroout() · 53085fac
      Jan Kara authored
      
      
      Create new function ext4_issue_zeroout() to zeroout contiguous (both
      logically and physically) part of inode data. We will need to issue
      zeroout when extent structure is not readily available and this function
      will allow us to do it without making up fake extent structures.
      Signed-off-by: default avatarJan Kara <jack@suse.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      53085fac
    • Jan Kara's avatar
      ext4: get rid of EXT4_GET_BLOCKS_NO_LOCK flag · 2dcba478
      Jan Kara authored
      
      
      When dioread_nolock mode is enabled, we grab i_data_sem in
      ext4_ext_direct_IO() and therefore we need to instruct _ext4_get_block()
      not to grab i_data_sem again using EXT4_GET_BLOCKS_NO_LOCK. However
      holding i_data_sem over overwrite direct IO isn't needed these days. We
      have exclusion against truncate / hole punching because we increase
      i_dio_count under i_mutex in ext4_ext_direct_IO() so once
      ext4_file_write_iter() verifies blocks are allocated & written, they are
      guaranteed to stay so during the whole direct IO even after we drop
      i_mutex.
      
      So we can just remove this locking abuse and the no longer necessary
      EXT4_GET_BLOCKS_NO_LOCK flag.
      Signed-off-by: default avatarJan Kara <jack@suse.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      2dcba478