1. 15 Mar, 2011 1 commit
  2. 12 Feb, 2011 1 commit
    • Eric Sandeen's avatar
      ext4: serialize unaligned asynchronous DIO · e9e3bcec
      Eric Sandeen authored
      ext4 has a data corruption case when doing non-block-aligned
      asynchronous direct IO into a sparse file, as demonstrated
      by xfstest 240.
      The root cause is that while ext4 preallocates space in the
      hole, mappings of that space still look "new" and 
      dio_zero_block() will zero out the unwritten portions.  When
      more than one AIO thread is going, they both find this "new"
      block and race to zero out their portion; this is uncoordinated
      and causes data corruption.
      Dave Chinner fixed this for xfs by simply serializing all
      unaligned asynchronous direct IO.  I've done the same here.
      The difference is that we only wait on conversions, not all IO.
      This is a very big hammer, and I'm not very pleased with
      stuffing this into ext4_file_write().  But since ext4 is
      DIO_LOCKING, we need to serialize it at this high level.
      I tried to move this into ext4_ext_direct_IO, but by then
      we have the i_mutex already, and we will wait on the
      work queue to do conversions - which must also take the
      i_mutex.  So that won't work.
      This was originally exposed by qemu-kvm installing to
      a raw disk image with a normal sector-63 alignment.  I've
      tested a backport of this patch with qemu, and it does
      avoid the corruption.  It is also quite a lot slower
      (14 min for package installs, vs. 8 min for well-aligned)
      but I'll take slow correctness over fast corruption any day.
      Mingming suggested that we can track outstanding
      conversions, and wait on those so that non-sparse
      files won't be affected, and I've implemented that here;
      unaligned AIO to nonsparse files won't take a perf hit.
      [tytso@mit.edu: Keep the mutex as a hashed array instead
       of bloating the ext4 inode]
      [tytso@mit.edu: Fix up namespace issues so that global
       variables are protected with an "ext4_" prefix.]
      Signed-off-by: default avatarEric Sandeen <sandeen@redhat.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
  3. 03 Feb, 2011 3 commits
  4. 12 Jan, 2011 1 commit
    • Jan Kara's avatar
      quota: Fix deadlock during path resolution · f00c9e44
      Jan Kara authored
      As Al Viro pointed out path resolution during Q_QUOTAON calls to quotactl
      is prone to deadlocks. We hold s_umount semaphore for reading during the
      path resolution and resolution itself may need to acquire the semaphore
      for writing when e. g. autofs mountpoint is passed.
      Solve the problem by performing the resolution before we get hold of the
      superblock (and thus s_umount semaphore). The whole thing is complicated
      by the fact that some filesystems (OCFS2) ignore the path argument. So to
      distinguish between filesystem which want the path and which do not we
      introduce new .quota_on_meta callback which does not get the path. OCFS2
      then uses this callback instead of old .quota_on.
      CC: Al Viro <viro@ZenIV.linux.org.uk>
      CC: Christoph Hellwig <hch@lst.de>
      CC: Ted Ts'o <tytso@mit.edu>
      CC: Joel Becker <joel.becker@oracle.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
  5. 10 Jan, 2011 5 commits
  6. 07 Jan, 2011 1 commit
    • Nick Piggin's avatar
      fs: icache RCU free inodes · fa0d7e3d
      Nick Piggin authored
      RCU free the struct inode. This will allow:
      - Subsequent store-free path walking patch. The inode must be consulted for
        permissions when walking, so an RCU inode reference is a must.
      - sb_inode_list_lock to be moved inside i_lock because sb list walkers who want
        to take i_lock no longer need to take sb_inode_list_lock to walk the list in
        the first place. This will simplify and optimize locking.
      - Could remove some nested trylock loops in dcache code
      - Could potentially simplify things a bit in VM land. Do not need to take the
        page lock to follow page->mapping.
      The downsides of this is the performance cost of using RCU. In a simple
      creat/unlink microbenchmark, performance drops by about 10% due to inability to
      reuse cache-hot slab objects. As iterations increase and RCU freeing starts
      kicking over, this increases to about 20%.
      In cases where inode lifetimes are longer (ie. many inodes may be allocated
      during the average life span of a single inode), a lot of this cache reuse is
      not applicable, so the regression caused by this patch is smaller.
      The cache-hot regression could largely be avoided by using SLAB_DESTROY_BY_RCU,
      however this adds some complexity to list walking and store-free path walking,
      so I prefer to implement this at a later date, if it is shown to be a win in
      real situations. I haven't found a regression in any non-micro benchmark so I
      doubt it will be a problem.
      Signed-off-by: default avatarNick Piggin <npiggin@kernel.dk>
  7. 20 Dec, 2010 2 commits
  8. 16 Dec, 2010 3 commits
  9. 14 Dec, 2010 1 commit
    • Theodore Ts'o's avatar
      ext4: Turn off multiple page-io submission by default · 1449032b
      Theodore Ts'o authored
      Jon Nelson has found a test case which causes postgresql to fail with
      the error:
      psql:t.sql:4: ERROR: invalid page header in block 38269 of relation base/16384/16581
      Under memory pressure, it looks like part of a file can end up getting
      replaced by zero's.  Until we can figure out the cause, we'll roll
      back the change and use block_write_full_page() instead of
      ext4_bio_write_page().  The new, more efficient writing function can
      be used via the mount option mblk_io_submit, so we can test and fix
      the new page I/O code.
      To reproduce the problem, install postgres 8.4 or 9.0, and pin enough
      memory such that the system just at the end of triggering writeback
      before running the following sql script:
      create temporary table foo as select x as a, ARRAY[x] as b FROM
      generate_series(1, 10000000 ) AS x;
      create index foo_a_idx on foo (a);
      create index foo_b_idx on foo USING GIN (b);
      If the temporary table is created on a hard drive partition which is
      encrypted using dm_crypt, then under memory pressure, approximately
      30-40% of the time, pgsql will issue the above failure.
      This patch should fix this problem, and the problem will come back if
      the file system is mounted with the mblk_io_submit mount option.
      Reported-by: default avatarJon Nelson <jnelson@jamponi.net>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
  10. 20 Nov, 2010 1 commit
  11. 19 Nov, 2010 1 commit
    • Darrick J. Wong's avatar
      ext4: ext4_fill_super shouldn't return 0 on corruption · 5a9ae68a
      Darrick J. Wong authored
      At the start of ext4_fill_super, ret is set to -EINVAL, and any failure path
      out of that function returns ret.  However, the generic_check_addressable
      clause sets ret = 0 (if it passes), which means that a subsequent failure (e.g.
      a group checksum error) returns 0 even though the mount should fail.  This
      causes vfs_kern_mount in turn to think that the mount succeeded, leading to an
      A simple fix is to avoid using ret for the generic_check_addressable check,
      which was last changed in commit 30ca22c7
      Signed-off-by: default avatarDarrick J. Wong <djwong@us.ibm.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
  12. 18 Nov, 2010 1 commit
  13. 13 Nov, 2010 2 commits
    • Tejun Heo's avatar
      block: clean up blkdev_get() wrappers and their users · d4d77629
      Tejun Heo authored
      After recent blkdev_get() modifications, open_by_devnum() and
      open_bdev_exclusive() are simple wrappers around blkdev_get().
      Replace them with blkdev_get_by_dev() and blkdev_get_by_path().
      blkdev_get_by_dev() is identical to open_by_devnum().
      blkdev_get_by_path() is slightly different in that it doesn't
      automatically add %FMODE_EXCL to @mode.
      All users are converted.  Most conversions are mechanical and don't
      introduce any behavior difference.  There are several exceptions.
      * btrfs now sets FMODE_EXCL in btrfs_device->mode, so there's no
        reason to OR it explicitly on blkdev_put().
      * gfs2, nilfs2 and the generic mount_bdev() now set FMODE_EXCL in
      * With the above changes, sb->s_mode now always should contain
        FMODE_EXCL.  WARN_ON_ONCE() added to kill_block_super() to detect
      The new blkdev_get_*() functions are with proper docbook comments.
      While at it, add function description to blkdev_get() too.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Philipp Reisner <philipp.reisner@linbit.com>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc: Joern Engel <joern@lazybastard.org>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
      Cc: reiserfs-devel@vger.kernel.org
      Cc: xfs-masters@oss.sgi.com
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
    • Tejun Heo's avatar
      block: make blkdev_get/put() handle exclusive access · e525fd89
      Tejun Heo authored
      Over time, block layer has accumulated a set of APIs dealing with bdev
      open, close, claim and release.
      * blkdev_get/put() are the primary open and close functions.
      * bd_claim/release() deal with exclusive open.
      * open/close_bdev_exclusive() are combination of open and claim and
        the other way around, respectively.
      * bd_link/unlink_disk_holder() to create and remove holder/slave
      * open_by_devnum() wraps bdget() + blkdev_get().
      The interface is a bit confusing and the decoupling of open and claim
      makes it impossible to properly guarantee exclusive access as
      in-kernel open + claim sequence can disturb the existing exclusive
      open even before the block layer knows the current open if for another
      exclusive access.  Reorganize the interface such that,
      * blkdev_get() is extended to include exclusive access management.
        @holder argument is added and, if is @FMODE_EXCL specified, it will
        gain exclusive access atomically w.r.t. other exclusive accesses.
      * blkdev_put() is similarly extended.  It now takes @mode argument and
        if @FMODE_EXCL is set, it releases an exclusive access.  Also, when
        the last exclusive claim is released, the holder/slave symlinks are
        removed automatically.
      * bd_claim/release() and close_bdev_exclusive() are no longer
        necessary and either made static or removed.
      * bd_link_disk_holder() remains the same but bd_unlink_disk_holder()
        is no longer necessary and removed.
      * open_bdev_exclusive() becomes a simple wrapper around lookup_bdev()
        and blkdev_get().  It also has an unexpected extra bdev_read_only()
        test which probably should be moved into blkdev_get().
      * open_by_devnum() is modified to take @holder argument and pass it to
      Most of bdev open/close operations are unified into blkdev_get/put()
      and most exclusive accesses are tested atomically at the open time (as
      it should).  This cleans up code and removes some, both valid and
      invalid, but unnecessary all the same, corner cases.
      open_bdev_exclusive() and open_by_devnum() can use further cleanup -
      rename to blkdev_get_by_path() and blkdev_get_by_devt() and drop
      special features.  Well, let's leave them for another day.
      Most conversions are straight-forward.  drbd conversion is a bit more
      involved as there was some reordering, but the logic should stay the
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarNeil Brown <neilb@suse.de>
      Acked-by: default avatarRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Acked-by: default avatarMike Snitzer <snitzer@redhat.com>
      Acked-by: default avatarPhilipp Reisner <philipp.reisner@linbit.com>
      Cc: Peter Osterlund <petero2@telia.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <joel.becker@oracle.com>
      Cc: Alex Elder <aelder@sgi.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: dm-devel@redhat.com
      Cc: drbd-dev@lists.linbit.com
      Cc: Leo Chen <leochen@broadcom.com>
      Cc: Scott Branden <sbranden@broadcom.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
      Cc: Joern Engel <joern@logfs.org>
      Cc: reiserfs-devel@vger.kernel.org
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
  14. 08 Nov, 2010 3 commits
    • Theodore Ts'o's avatar
      ext4: Add new ext4 inode tracepoints · 7ff9c073
      Theodore Ts'o authored
      Add ext4_evict_inode, ext4_drop_inode, ext4_mark_inode_dirty, and
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
    • Dmitry Monakhov's avatar
      ext4: do not try to grab the s_umount semaphore in ext4_quota_off · 87009d86
      Dmitry Monakhov authored
      It's not needed to sync the filesystem, and it fixes a lock_dep complaint.
      Signed-off-by: default avatarDmitry Monakhov <dmonakhov@gmail.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
    • Theodore Ts'o's avatar
      ext4: handle writeback of inodes which are being freed · f7ad6d2e
      Theodore Ts'o authored
      The following BUG can occur when an inode which is getting freed when
      it still has dirty pages outstanding, and it gets deleted (in this
      because it was the target of a rename).  In ordered mode, we need to
      make sure the data pages are written just in case we crash before the
      rename (or unlink) is committed.  If the inode is being freed then
      when we try to igrab the inode, we end up tripping the BUG_ON at
      To solve this problem, we need to keep track of the number of io
      callbacks which are pending, and avoid destroying the inode until they
      have all been completed.  That way we don't have to bump the inode
      count to keep the inode from being destroyed; an approach which
      doesn't work because the count could have already been dropped down to
      zero before the inode writeback has started (at which point we're not
      allowed to bump the count back up to 1, since it's already started
      getting freed).
      Thanks to Dave Chinner for suggesting this approach, which is also
      used by XFS.
        kernel BUG at /scratch_space/linux-2.6/fs/ext4/page-io.c:146!
        Call Trace:
         [<ffffffff811075b1>] ext4_bio_write_page+0x172/0x307
         [<ffffffff811033a7>] mpage_da_submit_io+0x2f9/0x37b
         [<ffffffff811068d7>] mpage_da_map_and_submit+0x2cc/0x2e2
         [<ffffffff811069b3>] mpage_add_bh_to_extent+0xc6/0xd5
         [<ffffffff81106c66>] write_cache_pages_da+0x2a4/0x3ac
         [<ffffffff81107044>] ext4_da_writepages+0x2d6/0x44d
         [<ffffffff81087910>] do_writepages+0x1c/0x25
         [<ffffffff810810a4>] __filemap_fdatawrite_range+0x4b/0x4d
         [<ffffffff810815f5>] filemap_fdatawrite_range+0xe/0x10
         [<ffffffff81122a2e>] jbd2_journal_begin_ordered_truncate+0x7b/0xa2
         [<ffffffff8110615d>] ext4_evict_inode+0x57/0x24c
         [<ffffffff810c14a3>] evict+0x22/0x92
         [<ffffffff810c1a3d>] iput+0x212/0x249
         [<ffffffff810bdf16>] dentry_iput+0xa1/0xb9
         [<ffffffff810bdf6b>] d_kill+0x3d/0x5d
         [<ffffffff810be613>] dput+0x13a/0x147
         [<ffffffff810b990d>] sys_renameat+0x1b5/0x258
         [<ffffffff81145f71>] ? _atomic_dec_and_lock+0x2d/0x4c
         [<ffffffff810b2950>] ? cp_new_stat+0xde/0xea
         [<ffffffff810b29c1>] ? sys_newlstat+0x2d/0x38
         [<ffffffff810b99c6>] sys_rename+0x16/0x18
         [<ffffffff81002a2b>] system_call_fastpath+0x16/0x1b
      Reported-by: default avatarNick Bowler <nbowler@elliptictech.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Tested-by: default avatarNick Bowler <nbowler@elliptictech.com>
  15. 03 Nov, 2010 1 commit
  16. 02 Nov, 2010 2 commits
  17. 29 Oct, 2010 1 commit
  18. 28 Oct, 2010 10 commits
    • Nicolas Kaiser's avatar
    • Theodore Ts'o's avatar
      ext4: make various ext4 functions be static · 1f109d5a
      Theodore Ts'o authored
      These functions have no need to be exported beyond file context.
      No functions needed to be moved for this commit; just some function
      declarations changed to be static and removed from header files.
      (A similar patch was submitted by Eric Sandeen, but I wanted to handle
      code movement in separate patches to make sure code changes didn't
      accidentally get dropped.)
      Signed-off-by: default avatarEric Sandeen <sandeen@redhat.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
    • Theodore Ts'o's avatar
      ext4: rename {exit,init}_ext4_*() to ext4_{exit,init}_*() · 5dabfc78
      Theodore Ts'o authored
      This is a cleanup to avoid namespace leaks out of fs/ext4
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
    • Theodore Ts'o's avatar
      ext4: fix kernel oops if the journal superblock has a non-zero j_errno · 7f93cff9
      Theodore Ts'o authored
      Commit 84061e07
       fixed an accounting bug only to introduce the
      possibility of a kernel OOPS if the journal has a non-zero j_errno
      field indicating that the file system had detected a fs inconsistency.
      After the journal replay, if the journal superblock indicates that the
      file system has an error, this indication is transfered to the file
      system and then ext4_commit_super() is called to write this to the
      But since the percpu counters are now initialized after the journal
      replay, the call to ext4_commit_super() will cause a kernel oops since
      it needs to use the percpu counters the ext4 superblock structure.
      The fix is to skip setting the ext4 free block and free inode fields
      if the percpu counter has not been set.
      Thanks to Ken Sumrall for reporting and analyzing the root causes of
      this bug.
      Addresses-Google-Bug: #3054080
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
    • Lukas Czerner's avatar
      ext4: add batched_discard into ext4 feature list · 27ee40df
      Lukas Czerner authored
      Should be applied on the top of "lazy inode table initialization"
      and "batched discard support" patch-sets.
      Signed-off-by: default avatarLukas Czerner <lczerner@redhat.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
    • Lukas Czerner's avatar
      ext4: Add batched discard support for ext4 · 7360d173
      Lukas Czerner authored
      Walk through allocation groups and trim all free extents. It can be
      invoked through FITRIM ioctl on the file system. The main idea is to
      provide a way to trim the whole file system if needed, since some SSD's
      may suffer from performance loss after the whole device was filled (it
      does not mean that fs is full!).
      It search for free extents in allocation groups specified by Byte range
      start -> start+len. When the free extent is within this range, blocks
      are marked as used and then trimmed. Afterwards these blocks are marked
      as free in per-group bitmap.
      Since fstrim is a long operation it is good to have an ability to
      interrupt it by a signal. This was added by Dmitry Monakhov.
      Thanks Dimitry.
      Signed-off-by: default avatarLukas Czerner <lczerner@redhat.com>
      Signed-off-by: default avatarDmitry Monakhov <dmonakhov@openvz.org>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
    • Theodore Ts'o's avatar
      ext4: use bio layer instead of buffer layer in mpage_da_submit_io · bd2d0210
      Theodore Ts'o authored
      Call the block I/O layer directly instad of going through the buffer
      layer.  This should give us much better performance and scalability,
      as well as lowering our CPU utilization when doing buffered writeback.
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
    • Maciej Żenczykowski's avatar
      ext4: don't update sb journal_devnum when RO dev · c41303ce
      Maciej Żenczykowski authored
      An ext4 filesystem on a read-only device, with an external journal
      which is at a different device number then recorded in the superblock
      will fail to honor the read-only setting of the device and trigger
      a superblock update (write).
      For example:
        - ext4 on a software raid which is in read-only mode
        - external journal on a read-write device which has changed device num
        - attempt to mount with -o journal_dev=<new_number>
        - hits BUG_ON(mddev->ro = 1) in md.c
      Cc: Theodore Ts'o <tytso@mit.edu>
      Signed-off-by: default avatarMaciej Żenczykowski <zenczykowski@gmail.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
    • Lukas Czerner's avatar
      ext4: add interface to advertise ext4 features in sysfs · 857ac889
      Lukas Czerner authored
      User-space should have the opportunity to check what features doest ext4
      support in each particular copy. This adds easy interface by creating new
      "features" directory in sys/fs/ext4/. In that directory files
      advertising feature names can be created.
      Add lazy_itable_init to the feature list.
      Signed-off-by: default avatarLukas Czerner <lczerner@redhat.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
    • Lukas Czerner's avatar
      ext4: add support for lazy inode table initialization · bfff6873
      Lukas Czerner authored
      When the lazy_itable_init extended option is passed to mke2fs, it
      considerably speeds up filesystem creation because inode tables are
      not zeroed out.  The fact that parts of the inode table are
      uninitialized is not a problem so long as the block group descriptors,
      which contain information regarding how much of the inode table has
      been initialized, has not been corrupted However, if the block group
      checksums are not valid, e2fsck must scan the entire inode table, and
      the the old, uninitialized data could potentially cause e2fsck to
      report false problems.
      Hence, it is important for the inode tables to be initialized as soon
      as possble.  This commit adds this feature so that mke2fs can safely
      use the lazy inode table initialization feature to speed up formatting
      file systems.
      This is done via a new new kernel thread called ext4lazyinit, which is
      created on demand and destroyed, when it is no longer needed.  There
      is only one thread for all ext4 filesystems in the system. When the
      first filesystem with inititable mount option is mounted, ext4lazyinit
      thread is created, then the filesystem can register its request in the
      request list.
      This thread then walks through the list of requests picking up
      scheduled requests and invoking ext4_init_inode_table(). Next schedule
      time for the request is computed by multiplying the time it took to
      zero out last inode table with wait multiplier, which can be set with
      the (init_itable=n) mount option (default is 10).  We are doing
      this so we do not take the whole I/O bandwidth. When the thread is no
      longer necessary (request list is empty) it frees the appropriate
      structures and exits (and can be created later later by another
      We do not disturb regular inode allocations in any way, it just do not
      care whether the inode table is, or is not zeroed. But when zeroing, we
      have to skip used inodes, obviously. Also we should prevent new inode
      allocations from the group, while zeroing is on the way. For that we
      take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
      in the ext4_claim_inode, so when we are unlucky and allocator hits the
      group which is currently being zeroed, it just has to wait.
      This can be suppresed using the mount option no_init_itable.
      Signed-off-by: default avatarLukas Czerner <lczerner@redhat.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>