1. 07 Aug, 2014 1 commit
  2. 26 May, 2014 2 commits
  3. 13 May, 2014 1 commit
  4. 21 Apr, 2014 1 commit
  5. 01 Apr, 2014 5 commits
    • Miklos Szeredi's avatar
      ext4: add cross rename support · bd42998a
      Miklos Szeredi authored
      
      
      Implement RENAME_EXCHANGE flag in renameat2 syscall.
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@suse.cz>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      bd42998a
    • Miklos Szeredi's avatar
      ext4: rename: split out helper functions · bd1af145
      Miklos Szeredi authored
      
      
      Cross rename (exchange source and dest) will need to call some of these
      helpers for both source and dest, while overwriting rename currently only
      calls them for one or the other.  This also makes the code easier to
      follow.
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@suse.cz>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      bd1af145
    • Miklos Szeredi's avatar
      ext4: rename: move EMLINK check up · 0d7d5d67
      Miklos Szeredi authored
      
      
      Move checking i_nlink from after ext4_get_first_dir_block() to before.  The
      check doesn't rely on the result of that function and the function only
      fails on fs corruption, so the order shouldn't matter.
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@suse.cz>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      0d7d5d67
    • Miklos Szeredi's avatar
      ext4: rename: create ext4_renament structure for local vars · c0d268c3
      Miklos Szeredi authored
      
      
      Need to split up ext4_rename() into helpers but there are too many local
      variables involved, so create a new structure.  This also, apparently,
      makes the generated code size slightly smaller.
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@suse.cz>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      c0d268c3
    • Miklos Szeredi's avatar
      vfs: add RENAME_NOREPLACE flag · 0a7c3937
      Miklos Szeredi authored
      
      
      If this flag is specified and the target of the rename exists then the
      rename syscall fails with EEXIST.
      
      The VFS does the existence checking, so it is trivial to enable for most
      local filesystems.  This patch only enables it in ext4.
      
      For network filesystems the VFS check is not enough as there may be a race
      between a remote create and the rename, so these filesystems need to handle
      this flag in their ->rename() implementations to ensure atomicity.
      
      Andy writes about why this is useful:
      
      "The trivial answer: to eliminate the race condition from 'mv -i'.
      
      Another answer: there's a common pattern to atomically create a file
      with contents: open a temporary file, write to it, optionally fsync
      it, close it, then link(2) it to the final name, then unlink the
      temporary file.
      
      The reason to use link(2) is because it won't silently clobber the destination.
      
      This is annoying:
       - It requires an extra system call that shouldn't be necessary.
       - It doesn't work on (IMO sensible) filesystems that don't support
      hard links (e.g. vfat).
       - It's not atomic -- there's an intermediate state where both files exist.
       - It's ugly.
      
      The new rename flag will make this totally sensible.
      
      To be fair, on new enough kernels, you can also use O_TMPFILE and
      linkat to achieve the same thing even more cleanly."
      
      Suggested-by: Andy Lutomirski <luto@amacapital.net> 
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@suse.cz>
      Reviewed-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      0a7c3937
  6. 26 Jan, 2014 1 commit
  7. 06 Jan, 2014 1 commit
  8. 15 Oct, 2013 1 commit
  9. 17 Aug, 2013 2 commits
    • Theodore Ts'o's avatar
      ext4: allocate delayed allocation blocks before rename · 0e202704
      Theodore Ts'o authored
      
      
      When ext4_rename() overwrites an already existing file, call
      ext4_alloc_da_blocks() before starting the journal handle which
      actually does the rename, instead of doing this afterwards.  This
      improves the likelihood that the contents will survive a crash if an
      application replaces a file using the sequence:
      
      1)  write replacement contents to foo.new
      2)  <omit fsync of foo.new>
      3)  rename foo.new to foo
      
      It is still not a guarantee, since ext4_alloc_da_blocks() is *not*
      doing a file integrity sync; this means if foo.new is a very large
      file, it may not be completely flushed out to disk.
      
      However, for files smaller than a megabyte or so, any dirty pages
      should be flushed out before we do the rename operation, and so at the
      next journal commit, the CACHE FLUSH command will make sure al of
      these pages are safely on the disk platter.
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      0e202704
    • Theodore Ts'o's avatar
      ext4: start handle at least possible moment when renaming files · 5b61de75
      Theodore Ts'o authored
      
      
      In ext4_rename(), don't start the journal handle until the the
      directory entries have been successfully looked up.
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      5b61de75
  10. 21 Jul, 2013 1 commit
    • Zheng Liu's avatar
      ext4: fix a BUG when opening a file with O_TMPFILE flag · e94bd349
      Zheng Liu authored
      
      
      When we try to open a file with O_TMPFILE flag, we will trigger a bug.
      The root cause is that in ext4_orphan_add() we check ->i_nlink == 0 and
      this check always fails because we set ->i_nlink = 1 in
      inode_init_always().  We can use the following program to trigger it:
      
      int main(int argc, char *argv[])
      {
      	int fd;
      
      	fd = open(argv[1], O_TMPFILE, 0666);
      	if (fd < 0) {
      		perror("open ");
      		return -1;
      	}
      	close(fd);
      	return 0;
      }
      
      The oops message looks like this:
      
      kernel BUG at fs/ext4/namei.c:2572!
      invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
      Modules linked in: dlci bridge stp hidp cmtp kernelcapi l2tp_ppp l2tp_netlink l2tp_core sctp libcrc32c rfcomm tun fuse nfnetli
      nk can_raw ipt_ULOG can_bcm x25 scsi_transport_iscsi ipx p8023 p8022 appletalk phonet psnap vmw_vsock_vmci_transport af_key vmw_vmci rose vsock atm can netrom ax25 af_rxrpc ir
      da pppoe pppox ppp_generic slhc bluetooth nfc rfkill rds caif_socket caif crc_ccitt af_802154 llc2 llc snd_hda_codec_realtek snd_hda_intel snd_hda_codec serio_raw snd_pcm pcsp
      kr edac_core snd_page_alloc snd_timer snd soundcore r8169 mii sr_mod cdrom pata_atiixp radeon backlight drm_kms_helper ttm
      CPU: 1 PID: 1812571 Comm: trinity-child2 Not tainted 3.11.0-rc1+ #12
      Hardware name: Gigabyte Technology Co., Ltd. GA-MA78GM-S2H/GA-MA78GM-S2H, BIOS F12a 04/23/2010
      task: ffff88007dfe69a0 ti: ffff88010f7b6000 task.ti: ffff88010f7b6000
      RIP: 0010:[<ffffffff8125ce69>]  [<ffffffff8125ce69>] ext4_orphan_add+0x299/0x2b0
      RSP: 0018:ffff88010f7b7cf8  EFLAGS: 00010202
      RAX: 0000000000000000 RBX: ffff8800966d3020 RCX: 0000000000000000
      RDX: 0000000000000000 RSI: ffff88007dfe70b8 RDI: 0000000000000001
      RBP: ffff88010f7b7d40 R08: ffff880126a3c4e0 R09: ffff88010f7b7ca0
      R10: 0000000000000000 R11: 0000000000000000 R12: ffff8801271fd668
      R13: ffff8800966d2f78 R14: ffff88011d7089f0 R15: ffff88007dfe69a0
      FS:  00007f70441a3740(0000) GS:ffff88012a800000(0000) knlGS:00000000f77c96c0
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000002834000 CR3: 0000000107964000 CR4: 00000000000007e0
      DR0: 0000000000780000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
      Stack:
       0000000000002000 00000020810b6dde 0000000000000000 ffff88011d46db00
       ffff8800966d3020 ffff88011d7089f0 ffff88009c7f4c10 ffff88010f7b7f2c
       ffff88007dfe69a0 ffff88010f7b7da8 ffffffff8125cfac ffff880100000004
      Call Trace:
       [<ffffffff8125cfac>] ext4_tmpfile+0x12c/0x180
       [<ffffffff811cba78>] path_openat+0x238/0x700
       [<ffffffff8100afc4>] ? native_sched_clock+0x24/0x80
       [<ffffffff811cc647>] do_filp_open+0x47/0xa0
       [<ffffffff811db73f>] ? __alloc_fd+0xaf/0x200
       [<ffffffff811ba2e4>] do_sys_open+0x124/0x210
       [<ffffffff81010725>] ? syscall_trace_enter+0x25/0x290
       [<ffffffff811ba3ee>] SyS_open+0x1e/0x20
       [<ffffffff816ca8d4>] tracesys+0xdd/0xe2
       [<ffffffff81001001>] ? start_thread_common.constprop.6+0x1/0xa0
      Code: 04 00 00 00 89 04 24 31 c0 e8 c4 77 04 00 e9 43 fe ff ff 66 25 00 d0 66 3d 00 80 0f 84 0e fe ff ff 83 7b 48 00 0f 84 04 fe ff ff <0f> 0b 49 8b 8c 24 50 07 00 00 e9 88 fe ff ff 0f 1f 84 00 00 00
      
      Here we couldn't call clear_nlink() directly because in d_tmpfile() we
      will call inode_dec_link_count() to decrease ->i_nlink.  So this commit
      tries to call d_tmpfile() before ext4_orphan_add() to fix this problem.
      Reported-by: default avatarDave Jones <davej@redhat.com>
      Signed-off-by: default avatarZheng Liu <wenqing.lz@taobao.com>
      Tested-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Tested-by: default avatarDave Jones <davej@redhat.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Acked-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      e94bd349
  11. 03 Jul, 2013 1 commit
  12. 01 Jul, 2013 1 commit
  13. 19 Apr, 2013 2 commits
    • Tao Ma's avatar
      ext4: fix readdir error in the case of inline_data+dir_index · 8af0f082
      Tao Ma authored
      
      
      Zach reported a problem that if inline data is enabled, we don't
      tell the difference between the offset of '.' and '..'. And a
      getdents will fail if the user only want to get '.' and what's worse,
      if there is a conversion happens when the user calls getdents
      many times, he/she may get the same entry twice.
      
      In theory, a dir block would also fail if it is converted to a
      hashed-index based dir since f_pos will become a hash value, not the
      real one, but it doesn't happen.  And a deep investigation shows that
      we uses a hash based solution even for a normal dir if the dir_index
      feature is enabled.
      
      So this patch just adds a new htree_inlinedir_to_tree for inline dir,
      and if we find that the hash index is supported, we will do like what
      we do for a dir block.
      Reported-by: default avatarZach Brown <zab@redhat.com>
      Signed-off-by: default avatarTao Ma <boyu.mt@taobao.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      8af0f082
    • Jan Kara's avatar
      ext4: move quota initialization out of inode allocation transaction · eb9cc7e1
      Jan Kara authored
      
      
      Inode allocation transaction is pretty heavy (246 credits with quotas
      and extents before previous patch, still around 200 after it).  This is
      mostly due to credits required for allocation of quota structures
      (credits there are heavily overestimated but it's difficult to make
      better estimates if we don't want to wire non-trivial assumptions about
      quota format into filesystem).
      
      So move quota initialization out of allocation transaction. That way
      transaction for quota structure allocation will be started only if we
      need to look up quota structure on disk (rare) and furthermore it will
      be started for each quota type separately, not for all of them at once.
      This reduces maximum transaction size to 34 is most cases and to 73 in
      the worst case.
      
      [ Modified by tytso to clean up the cleanup paths for error handling.
        Also use a separate call to ext4_std_error() for each failure so it
        is easier for someone who is debugging a problem in this function to
        determine which function call failed. ]
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      eb9cc7e1
  14. 10 Apr, 2013 1 commit
  15. 23 Feb, 2013 1 commit
  16. 15 Feb, 2013 2 commits
    • Theodore Ts'o's avatar
      ext4: use ERR_PTR() abstraction for ext4_append() · 0f70b406
      Theodore Ts'o authored
      
      
      Use ERR_PTR()/IS_ERR() abstraction instead of passing in a separate
      pointer to an integer for the error code, as a code cleanup.
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      0f70b406
    • Theodore Ts'o's avatar
      ext4: refactor code to read directory blocks into ext4_read_dirblock() · dc6982ff
      Theodore Ts'o authored
      
      
      The code to read in directory blocks and verify their metadata
      checksums was replicated in ten different places across
      fs/ext4/namei.c, and the code was buggy in subtle ways in a number of
      those replicated sites.  In some cases, ext4_error() was called with a
      training newline.  In others, in particularly in empty_dir(), it was
      possible to call ext4_dirent_csum_verify() on an index block, which
      would trigger false warnings requesting the system adminsitrator to
      run e2fsck.
      
      By refactoring the code, we make the code more readable, as well as
      shrinking the compiled object file by over 700 bytes and 50 lines of
      code.
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      dc6982ff
  17. 09 Feb, 2013 5 commits
    • Theodore Ts'o's avatar
      ext4: start handle at the last possible moment when creating inodes · 1139575a
      Theodore Ts'o authored
      
      
      In ext4_{create,mknod,mkdir,symlink}(), don't start the journal handle
      until the inode has been succesfully allocated.  In order to do this,
      we need to start the handle in the ext4_new_inode().  So create a new
      variant of this function, ext4_new_inode_start_handle(), so the handle
      can be created at the last possible minute, before we need to modify
      the inode allocation bitmap block.
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      1139575a
    • Theodore Ts'o's avatar
      ext4: fix the number of credits needed for ext4_unlink() and ext4_rmdir() · 64044abf
      Theodore Ts'o authored
      
      
      The ext4_unlink() and ext4_rmdir() don't actually release the blocks
      associated with the file/directory.  This gets done in a separate jbd2
      handle called via ext4_evict_inode().  Thus, we don't need to reserve
      lots of journal credits for the truncate.
      
      Note that using too many journal credits is non-optimal because it can
      leading to the journal transmit getting closed too early, before it is
      strictly necessary.
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      64044abf
    • Theodore Ts'o's avatar
      ext4: start handle at the last possible moment in ext4_rmdir() · 8dcfaad2
      Theodore Ts'o authored
      
      
      Don't start the jbd2 transaction handle until after the directory
      entry has been found, to minimize the amount of time that a handle is
      held active.
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      8dcfaad2
    • Theodore Ts'o's avatar
      ext4: start handle at the last possible moment in ext4_unlink() · 931b6864
      Theodore Ts'o authored
      
      
      Don't start the jbd2 transaction handle until after the directory
      entry has been found, to minimize the amount of time that a handle is
      held active.
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      931b6864
    • Theodore Ts'o's avatar
      ext4: pass context information to jbd2__journal_start() · 9924a92a
      Theodore Ts'o authored
      
      
      So we can better understand what bits of ext4 are responsible for
      long-running jbd2 handles, use jbd2__journal_start() so we can pass
      context information for logging purposes.
      
      The recommended way for finding the longer-running handles is:
      
         T=/sys/kernel/debug/tracing
         EVENT=$T/events/jbd2/jbd2_handle_stats
         echo "interval > 5" > $EVENT/filter
         echo 1 > $EVENT/enable
      
         ./run-my-fs-benchmark
      
         cat $T/trace > /tmp/problem-handles
      
      This will list handles that were active for longer than 20ms.  Having
      longer-running handles is bad, because a commit started at the wrong
      time could stall for those 20+ milliseconds, which could delay an
      fsync() or an O_SYNC operation.  Here is an example line from the
      trace file describing a handle which lived on for 311 jiffies, or over
      1.2 seconds:
      
      postmark-2917  [000] ....   196.435786: jbd2_handle_stats: dev 254,32 
         tid 570 type 2 line_no 2541 interval 311 sync 0 requested_blocks 1
         dirtied_blocks 0
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      9924a92a
  18. 29 Jan, 2013 4 commits
  19. 21 Jan, 2013 1 commit
  20. 07 Jan, 2013 2 commits
  21. 27 Dec, 2012 1 commit
    • Theodore Ts'o's avatar
      ext4: avoid hang when mounting non-journal filesystems with orphan list · 0e9a9a1a
      Theodore Ts'o authored
      
      
      When trying to mount a file system which does not contain a journal,
      but which does have a orphan list containing an inode which needs to
      be truncated, the mount call with hang forever in
      ext4_orphan_cleanup() because ext4_orphan_del() will return
      immediately without removing the inode from the orphan list, leading
      to an uninterruptible loop in kernel code which will busy out one of
      the CPU's on the system.
      
      This can be trivially reproduced by trying to mount the file system
      found in tests/f_orphan_extents_inode/image.gz from the e2fsprogs
      source tree.  If a malicious user were to put this on a USB stick, and
      mount it on a Linux desktop which has automatic mounts enabled, this
      could be considered a potential denial of service attack.  (Not a big
      deal in practice, but professional paranoids worry about such things,
      and have even been known to allocate CVE numbers for such problems.)
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: default avatarZheng Liu <wenqing.lz@taobao.com>
      Cc: stable@vger.kernel.org
      0e9a9a1a
  22. 10 Dec, 2012 3 commits
    • Tao Ma's avatar
      ext4: Remove CONFIG_EXT4_FS_XATTR · 939da108
      Tao Ma authored
      
      
      Ted has sent out a RFC about removing this feature. Eric and Jan
      confirmed that both RedHat and SUSE enable this feature in all their
      product.  David also said that "As far as I know, it's enabled in all
      Android kernels that use ext4."  So it seems OK for us.
      
      And what's more, as inline data depends its implementation on xattr,
      and to be frank, I don't run any test again inline data enabled while
      xattr disabled.  So I think we should add inline data and remove this
      config option in the same release.
      
      [ The savings if you disable CONFIG_EXT4_FS_XATTR is only 27k, which
        isn't much in the grand scheme of things.  Since no one seems to be
        testing this configuration except for some automated compile farms, on
        balance we are better removing this config option, and so that it is
        effectively always enabled. -- tytso ]
      
      Cc: David Brown <davidb@codeaurora.org>
      Cc: Eric Sandeen <sandeen@redhat.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarTao Ma <boyu.mt@taobao.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      939da108
    • Tao Ma's avatar
      ext4: let ext4_rename handle inline dir · 32f7f22c
      Tao Ma authored
      
      
      In case we rename a directory, ext4_rename has to read the dir block
      and change its dotdot's information.  The old ext4_rename encapsulated
      the dir_block read into itself.  So this patch adds a new function
      ext4_get_first_dir_block() which gets the dir buffer information so
      the ext4_rename can handle it properly.  As it will also change the
      parent inode number, we return the parent_de so that ext4_rename() can
      handle it more easily.
      
      ext4_find_entry is also changed so that the caller(rename) can tell
      whether the found entry is an inlined one or not and journaling the
      corresponding buffer head.
      Signed-off-by: default avatarTao Ma <boyu.mt@taobao.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      32f7f22c
    • Tao Ma's avatar
      ext4: let empty_dir handle inline dir · 61f86638
      Tao Ma authored
      
      
      empty_dir is used when deleting a dir.  So it should handle inline dir
      properly.
      Signed-off-by: default avatarTao Ma <boyu.mt@taobao.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      61f86638