1. 04 Jun, 2011 1 commit
  2. 23 May, 2011 7 commits
  3. 13 May, 2011 3 commits
    • Arne Jansen's avatar
      btrfs: quasi-round-robin for chunk allocation · 73c5de00
      Arne Jansen authored
      In a multi device setup, the chunk allocator currently always allocates
      chunks on the devices in the same order. This leads to a very uneven
      distribution, especially with RAID1 or RAID10 and an uneven number of
      This patch always sorts the devices before allocating, and allocates the
      stripes on the devices with the most available space, as long as there
      is enough space available. In a low space situation, it first tries to
      maximize striping.
      The patch also simplifies the allocator and reduces the checks for
      corner cases.
      The simplification is done by several means. First, it defines the
      properties of each RAID type upfront. These properties are used afterwards
      instead of differentiating cases in several places.
      Second, the old allocator defined a minimum stripe size for each block
      group type, tried to find a large enough chunk, and if this fails just
      allocates a smaller one. This is now done in one step. The largest possible
      chunk (up to max_chunk_size) is searched and allocated.
      Because we now have only one pass, the allocation of the map (struct
      map_lookup) is moved down to the point where the number of stripes is
      already known. This way we avoid reallocation of the map.
      We still avoid allocating stripes that are not a multiple of STRIPE_SIZE.
    • Arne Jansen's avatar
      btrfs: heed alloc_start · a9c9bf68
      Arne Jansen authored
      currently alloc_start is disregarded if the requested
      chunk size is bigger than (device size - alloc_start),
      but smaller than the device size.
      The only situation where I see this could have made sense
      was when a chunk equal the size of the device has been
      requested. This was possible as the allocator failed to
      take alloc_start into account when calculating the request
      chunk size. As this gets fixed by this patch, the workaround
      is not necessary anymore.
    • Arne Jansen's avatar
      btrfs: move btrfs_cmp_device_free_bytes to super.c · bcd53741
      Arne Jansen authored
      this function won't be used here anymore, so move it super.c where it is
      used for df-calculation
  4. 12 May, 2011 1 commit
    • Arne Jansen's avatar
      btrfs: scrub · a2de733c
      Arne Jansen authored
      This adds an initial implementation for scrub. It works quite
      straightforward. The usermode issues an ioctl for each device in the
      fs. For each device, it enumerates the allocated device chunks. For
      each chunk, the contained extents are enumerated and the data checksums
      fetched. The extents are read sequentially and the checksums verified.
      If an error occurs (checksum or EIO), a good copy is searched for. If
      one is found, the bad copy will be rewritten.
      All enumerations happen from the commit roots. During a transaction
      commit, the scrubs get paused and afterwards continue from the new
      This commit is based on the series originally posted to linux-btrfs
      with some improvements that resulted from comments from David Sterba,
      Ilya Dryomov and Jan Schmidt.
      Signed-off-by: default avatarArne Jansen <sensille@gmx.net>
  5. 06 May, 2011 1 commit
  6. 02 May, 2011 3 commits
  7. 20 Apr, 2011 1 commit
    • Chris Mason's avatar
      Btrfs: do some plugging in the submit_bio threads · 211588ad
      Chris Mason authored
      The Btrfs submit bio threads have a small number of
      threads responsible for pushing down bios we've collected
      for a large number of devices.
      Since we do all the bios for a single device at once,
      we want to make sure we unplug and send down the bios
      for each device as we're done processing them.
      The new plugging API removed the btrfs code to
      unplug while processing bios, this adds it back with
      the new API.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
  8. 28 Mar, 2011 3 commits
    • Chris Mason's avatar
      Btrfs: fix __btrfs_map_block on 32 bit machines · d9d04879
      Chris Mason authored
      Recent changes for discard support didn't compile,
      this fixes them not to try and % 64 bit numbers.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
    • Li Dongyang's avatar
      Btrfs: make btrfs_map_block() return entire free extent for each device of RAID0/1/10/DUP · fce3bb9a
      Li Dongyang authored
      btrfs_map_block() will only return a single stripe length, but we want the
      full extent be mapped to each disk when we are trimming the extent,
      so we add length to btrfs_bio_stripe and fill it if we are mapping for REQ_DISCARD.
      Signed-off-by: default avatarLi Dongyang <lidongyang@novell.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
    • liubo's avatar
      Btrfs: add initial tracepoint support for btrfs · 1abe9b8a
      liubo authored
      Tracepoints can provide insight into why btrfs hits bugs and be greatly
      helpful for debugging, e.g
                    dd-7822  [000]  2121.641088: btrfs_inode_request: root = 5(FS_TREE), gen = 4, ino = 256, blocks = 8, disk_i_size = 0, last_trans = 8, logged_trans = 0
                    dd-7822  [000]  2121.641100: btrfs_inode_new: root = 5(FS_TREE), gen = 8, ino = 257, blocks = 0, disk_i_size = 0, last_trans = 0, logged_trans = 0
       btrfs-transacti-7804  [001]  2146.935420: btrfs_cow_block: root = 2(EXTENT_TREE), refs = 2, orig_buf = 29368320 (orig_level = 0), cow_buf = 29388800 (cow_level = 0)
       btrfs-transacti-7804  [001]  2146.935473: btrfs_cow_block: root = 1(ROOT_TREE), refs = 2, orig_buf = 29364224 (orig_level = 0), cow_buf = 29392896 (cow_level = 0)
       btrfs-transacti-7804  [001]  2146.972221: btrfs_transaction_commit: root = 1(ROOT_TREE), gen = 8
         flush-btrfs-2-7821  [001]  2155.824210: btrfs_chunk_alloc: root = 3(CHUNK_TREE), offset = 1103101952, size = 1073741824, num_stripes = 1, sub_stripes = 0, type = DATA
         flush-btrfs-2-7821  [001]  2155.824241: btrfs_cow_block: root = 2(EXTENT_TREE), refs = 2, orig_buf = 29388800 (orig_level = 0), cow_buf = 29396992 (cow_level = 0)
         flush-btrfs-2-7821  [001]  2155.824255: btrfs_cow_block: root = 4(DEV_TREE), refs = 2, orig_buf = 29372416 (orig_level = 0), cow_buf = 29401088 (cow_level = 0)
         flush-btrfs-2-7821  [000]  2155.824329: btrfs_cow_block: root = 3(CHUNK_TREE), refs = 2, orig_buf = 20971520 (orig_level = 0), cow_buf = 20975616 (cow_level = 0)
       btrfs-endio-wri-7800  [001]  2155.898019: btrfs_cow_block: root = 5(FS_TREE), refs = 2, orig_buf = 29384704 (orig_level = 0), cow_buf = 29405184 (cow_level = 0)
       btrfs-endio-wri-7800  [001]  2155.898043: btrfs_cow_block: root = 7(CSUM_TREE), refs = 2, orig_buf = 29376512 (orig_level = 0), cow_buf = 29409280 (cow_level = 0)
      Here is what I have added:
      1) ordere_extent:
      These provide critical information to understand how ordered_extents are
      2) extent_map:
      extent_map is used in both read and write cases, and it is useful for tracking
      how btrfs specific IO is running.
      3) writepage:
      Pages are cirtical resourses and produce a lot of corner cases during writeback,
      so it is valuable to know how page is written to disk.
      4) inode:
      These can show where and when a inode is created, when a inode is evicted.
      5) sync:
      These show sync arguments.
      6) transaction:
      In transaction based filesystem, it will be useful to know the generation and
      who does commit.
      7) back reference and cow:
      Btrfs natively supports back references, these tracepoints are helpful on
      understanding btrfs's COW mechanism.
      8) chunk:
      Chunk is a link between physical offset and logical offset, and stands for space
      infomation in btrfs, and these are helpful on tracing space things.
      9) reserved_extent:
      These can show how btrfs uses its space.
      Signed-off-by: default avatarLiu Bo <liubo2009@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
  9. 10 Mar, 2011 1 commit
  10. 16 Feb, 2011 2 commits
  11. 14 Feb, 2011 1 commit
  12. 01 Feb, 2011 1 commit
  13. 16 Jan, 2011 6 commits
    • Ben Hutchings's avatar
      btrfs: Require CAP_SYS_ADMIN for filesystem rebalance · 6f88a440
      Ben Hutchings authored
      Filesystem rebalancing (BTRFS_IOC_BALANCE) affects the entire
      filesystem and may run uninterruptibly for a long time.  This does not
      seem to be something that an unprivileged user should be able to do.
      Reported-by: default avatarAron Xu <happyaron.xu@gmail.com>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
    • Dave Young's avatar
      btrfs: mount failure return value fix · 20b45077
      Dave Young authored
      I happened to pass swap partition as root partition in cmdline,
      then kernel panic and tell me about "Cannot open root device".
      It is not correct, in fact it is a fs type mismatch instead of 'no device'.
      Eventually I found btrfs mounting failed with -EIO, it should be -EINVAL.
      The logic in init/do_mounts.c:
              for (p = fs_names; *p; p += strlen(p)+1) {
                      int err = do_mount_root(name, p, flags, root_mount_data);
                      switch (err) {
                              case 0:
                                      goto out;
                              case -EACCES:
                                      flags |= MS_RDONLY;
                                      goto retry;
                              case -EINVAL:
      		print "Cannot open root device"
      SO fs type after btrfs will have no chance to mount
      Here fix the return value as -EINVAL
      Signed-off-by: default avatarDave Young <hidave.darkstar@gmail.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
    • Miao Xie's avatar
      btrfs: fix wrong free space information of btrfs · 6d07bcec
      Miao Xie authored
      When we store data by raid profile in btrfs with two or more different size
      disks, df command shows there is some free space in the filesystem, but the
      user can not write any data in fact, df command shows the wrong free space
      information of btrfs.
       # mkfs.btrfs -d raid1 /dev/sda9 /dev/sda10
       # btrfs-show
       Label: none  uuid: a95cd49e-6e33-45b8-8741-a36153ce4b64
       	Total devices 2 FS bytes used 28.00KB
       	devid    1 size 5.01GB used 2.03GB path /dev/sda9
       	devid    2 size 10.00GB used 2.01GB path /dev/sda10
       # btrfs device scan /dev/sda9 /dev/sda10
       # mount /dev/sda9 /mnt
       # dd if=/dev/zero of=tmpfile0 bs=4K count=9999999999
         (fill the filesystem)
       # sync
       # df -TH
       Filesystem	Type	Size	Used	Avail	Use%	Mounted on
       /dev/sda9	btrfs	17G	8.6G	5.4G	62%	/mnt
       # btrfs-show
       Label: none  uuid: a95cd49e-6e33-45b8-8741-a36153ce4b64
       	Total devices 2 FS bytes used 3.99GB
       	devid    1 size 5.01GB used 5.01GB path /dev/sda9
       	devid    2 size 10.00GB used 4.99GB path /dev/sda10
      It is because btrfs cannot allocate chunks when one of the pairing disks has
      no space, the free space on the other disks can not be used for ever, and should
      be subtracted from the total space, but btrfs doesn't subtract this space from
      the total. It is strange to the user.
      This patch fixes it by calcing the free space that can be used to allocate
      1. get all the devices free space, and align them by stripe length.
      2. sort the devices by the free space.
      3. check the free space of the devices,
         3.1. if it is not zero, and then check the number of the devices that has
              more free space than this device,
              if the number of the devices is beyond the min stripe number, the free
              space can be used, and add into total free space.
              if the number of the devices is below the min stripe number, we can not
              use the free space, the check ends.
         3.2. if the free space is zero, check the next devices, goto 3.1
      This implementation is just likely fake chunk allocation.
      After appling this patch, df can show correct space information:
       # df -TH
       Filesystem	Type	Size	Used	Avail	Use%	Mounted on
       /dev/sda9	btrfs	17G	8.6G	0	100%	/mnt
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
    • Miao Xie's avatar
      btrfs: make the chunk allocator utilize the devices better · b2117a39
      Miao Xie authored
      With this patch, we change the handling method when we can not get enough free
      extents with default size.
      1. Look up the suitable free extent on each device and keep the search result.
         If not find a suitable free extent, keep the max free extent
      2. If we get enough suitable free extents with default size, chunk allocation
      3. If we can not get enough free extents, but the number of the extent with
         default size is >= min_stripes, we just change the mapping information
         (reduce the number of stripes in the extent map), and chunk allocation
      4. If the number of the extent with default size is < min_stripes, sort the
         devices by its max free extent's size descending
      5. Use the size of the max free extent on the (num_stripes - 1)th device as the
         stripe size to allocate the device space
      By this way, the chunk allocator can allocate chunks as large as possible when
      the devices' space is not enough and make full use of the devices.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
    • Miao Xie's avatar
      btrfs: restructure find_free_dev_extent() · 7bfc837d
      Miao Xie authored
      - make it return the start position and length of the max free space when it can
        not find a suitable free space.
      - make it more readability
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
    • Miao Xie's avatar
      btrfs: fix wrong calculation of stripe size · 1974a3b4
      Miao Xie authored
      There are two tiny problem:
      - One is When we check the chunk size is greater than the max chunk size or not,
        we should take mirrors into account, but the original code didn't.
      - The other is btrfs shouldn't use the size of the residual free space as the
        length of of a dup chunk when doing chunk allocation. It is because the device
        space that a dup chunk needs is twice as large as the chunk size, if we use
        the size of the residual free space as the length of a dup chunk, we can not
        get enough free space. Fix it.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Reviewed-by: default avatarJosef Bacik <josef@redhat.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
  14. 14 Dec, 2010 1 commit
    • Chris Mason's avatar
      Btrfs: account for missing devices in RAID allocation profiles · cd02dca5
      Chris Mason authored
      When we mount in RAID degraded mode without adding a new device to
      replace the failed one, we can end up using the wrong RAID flags for
      This results in strange combinations of block groups (raid1 in a raid10
      filesystem) and corruptions when we try to allocate blocks from single
      spindle chunks on drives that are actually missing.
      The first device has two small 4MB chunks in it that mkfs creates and
      these are usually unused in a raid1 or raid10 setup.  But, in -o degraded,
      the allocator will fall back to these because the mask of desired raid groups
      isn't correct.
      The fix here is to count the missing devices as we build up the list
      of devices in the system.  This count is used when picking the
      raid level to make sure we continue using the same levels that were
      in place before we lost a drive.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
  15. 13 Nov, 2010 3 commits
    • Tejun Heo's avatar
      block: clean up blkdev_get() wrappers and their users · d4d77629
      Tejun Heo authored
      After recent blkdev_get() modifications, open_by_devnum() and
      open_bdev_exclusive() are simple wrappers around blkdev_get().
      Replace them with blkdev_get_by_dev() and blkdev_get_by_path().
      blkdev_get_by_dev() is identical to open_by_devnum().
      blkdev_get_by_path() is slightly different in that it doesn't
      automatically add %FMODE_EXCL to @mode.
      All users are converted.  Most conversions are mechanical and don't
      introduce any behavior difference.  There are several exceptions.
      * btrfs now sets FMODE_EXCL in btrfs_device->mode, so there's no
        reason to OR it explicitly on blkdev_put().
      * gfs2, nilfs2 and the generic mount_bdev() now set FMODE_EXCL in
      * With the above changes, sb->s_mode now always should contain
        FMODE_EXCL.  WARN_ON_ONCE() added to kill_block_super() to detect
      The new blkdev_get_*() functions are with proper docbook comments.
      While at it, add function description to blkdev_get() too.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Philipp Reisner <philipp.reisner@linbit.com>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc: Joern Engel <joern@lazybastard.org>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
      Cc: reiserfs-devel@vger.kernel.org
      Cc: xfs-masters@oss.sgi.com
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
    • Tejun Heo's avatar
      block: make blkdev_get/put() handle exclusive access · e525fd89
      Tejun Heo authored
      Over time, block layer has accumulated a set of APIs dealing with bdev
      open, close, claim and release.
      * blkdev_get/put() are the primary open and close functions.
      * bd_claim/release() deal with exclusive open.
      * open/close_bdev_exclusive() are combination of open and claim and
        the other way around, respectively.
      * bd_link/unlink_disk_holder() to create and remove holder/slave
      * open_by_devnum() wraps bdget() + blkdev_get().
      The interface is a bit confusing and the decoupling of open and claim
      makes it impossible to properly guarantee exclusive access as
      in-kernel open + claim sequence can disturb the existing exclusive
      open even before the block layer knows the current open if for another
      exclusive access.  Reorganize the interface such that,
      * blkdev_get() is extended to include exclusive access management.
        @holder argument is added and, if is @FMODE_EXCL specified, it will
        gain exclusive access atomically w.r.t. other exclusive accesses.
      * blkdev_put() is similarly extended.  It now takes @mode argument and
        if @FMODE_EXCL is set, it releases an exclusive access.  Also, when
        the last exclusive claim is released, the holder/slave symlinks are
        removed automatically.
      * bd_claim/release() and close_bdev_exclusive() are no longer
        necessary and either made static or removed.
      * bd_link_disk_holder() remains the same but bd_unlink_disk_holder()
        is no longer necessary and removed.
      * open_bdev_exclusive() becomes a simple wrapper around lookup_bdev()
        and blkdev_get().  It also has an unexpected extra bdev_read_only()
        test which probably should be moved into blkdev_get().
      * open_by_devnum() is modified to take @holder argument and pass it to
      Most of bdev open/close operations are unified into blkdev_get/put()
      and most exclusive accesses are tested atomically at the open time (as
      it should).  This cleans up code and removes some, both valid and
      invalid, but unnecessary all the same, corner cases.
      open_bdev_exclusive() and open_by_devnum() can use further cleanup -
      rename to blkdev_get_by_path() and blkdev_get_by_devt() and drop
      special features.  Well, let's leave them for another day.
      Most conversions are straight-forward.  drbd conversion is a bit more
      involved as there was some reordering, but the logic should stay the
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarNeil Brown <neilb@suse.de>
      Acked-by: default avatarRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Acked-by: default avatarMike Snitzer <snitzer@redhat.com>
      Acked-by: default avatarPhilipp Reisner <philipp.reisner@linbit.com>
      Cc: Peter Osterlund <petero2@telia.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <joel.becker@oracle.com>
      Cc: Alex Elder <aelder@sgi.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: dm-devel@redhat.com
      Cc: drbd-dev@lists.linbit.com
      Cc: Leo Chen <leochen@broadcom.com>
      Cc: Scott Branden <sbranden@broadcom.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
      Cc: Joern Engel <joern@logfs.org>
      Cc: reiserfs-devel@vger.kernel.org
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
    • Tejun Heo's avatar
      btrfs: close_bdev_exclusive() should use the same @flags as the matching open_bdev_exclusive() · 37004c42
      Tejun Heo authored
      In the failure path of __btrfs_open_devices(), close_bdev_exclusive()
      is called with @flags which doesn't match the one used during
      open_bdev_exclusive().  Fix it.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Chris Mason <chris.mason@oracle.com>
  16. 29 Oct, 2010 2 commits
  17. 10 Sep, 2010 1 commit
  18. 07 Aug, 2010 1 commit
    • Christoph Hellwig's avatar
      block: unify flags for struct bio and struct request · 7b6d91da
      Christoph Hellwig authored
      Remove the current bio flags and reuse the request flags for the bio, too.
      This allows to more easily trace the type of I/O from the filesystem
      down to the block driver.  There were two flags in the bio that were
      missing in the requests:  BIO_RW_UNPLUG and BIO_RW_AHEAD.  Also I've
      renamed two request flags that had a superflous RW in them.
      Note that the flags are in bio.h despite having the REQ_ name - as
      blkdev.h includes bio.h that is the only way to go for now.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <jaxboe@fusionio.com>
  19. 25 May, 2010 1 commit