1. 22 Sep, 2016 1 commit
  2. 07 Aug, 2016 1 commit
    • Jens Axboe's avatar
      block: rename bio bi_rw to bi_opf · 1eff9d32
      Jens Axboe authored
      Since commit 63a4cc24
      , bio->bi_rw contains flags in the lower
      portion and the op code in the higher portions. This means that
      old code that relies on manually setting bi_rw is most likely
      going to be broken. Instead of letting that brokeness linger,
      rename the member, to force old and out-of-tree code to break
      at compile time instead of at runtime.
      No intended functional changes in this commit.
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
  3. 26 Jul, 2016 8 commits
  4. 07 Jul, 2016 1 commit
    • Josef Bacik's avatar
      Btrfs: fix callers of btrfs_block_rsv_migrate · 25d609f8
      Josef Bacik authored
      So btrfs_block_rsv_migrate just unconditionally calls block_rsv_migrate_bytes.
      Not only this but it unconditionally changes the size of the block_rsv.  This
      isn't a bug strictly speaking, but it makes truncate block rsv's look funny
      because every time we migrate bytes over its size grows, even though we only
      want it to be a specific size.  So collapse this into one function that takes an
      update_size argument and make truncate and evict not update the size for
      consistency sake.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
  5. 25 Jun, 2016 1 commit
    • Omar Sandoval's avatar
      Btrfs: fix ->iterate_shared() by upgrading i_rwsem for delayed nodes · 02dbfc99
      Omar Sandoval authored
      Commit fe742fd4
       ("Revert "btrfs: switch to ->iterate_shared()"")
      backed out the conversion to ->iterate_shared() for Btrfs because the
      delayed inode handling in btrfs_real_readdir() is racy. However, we can
      still do readdir in parallel if there are no delayed nodes.
      This is a temporary fix which upgrades the shared inode lock to an
      exclusive lock only when we have delayed items until we come up with a
      more complete solution. While we're here, rename the
      btrfs_{get,put}_delayed_items functions to make it very clear that
      they're just for readdir.
      Tested with xfstests and by doing a parallel kernel build:
      	while make tinyconfig && make -j4 && git clean dqfx; do
      along with a bunch of parallel finds in another shell:
      	while true; do
      		for ((i=0; i<4; i++)); do
      			find . >/dev/null &
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
  6. 23 Jun, 2016 1 commit
    • Josef Bacik's avatar
      Btrfs: track transid for delayed ref flushing · 31b9655f
      Josef Bacik authored
      Using the offwakecputime bpf script I noticed most of our time was spent waiting
      on the delayed ref throttling.  This is what is supposed to happen, but
      sometimes the transaction can commit and then we're waiting for throttling that
      doesn't matter anymore.  So change this stuff to be a little smarter by tracking
      the transid we were in when we initiated the throttling.  If the transaction we
      get is different then we can just bail out.  This resulted in a 50% speedup in
      my fs_mark test, and reduced the amount of time spent throttling by 60 seconds
      over the entire run (which is about 30 minutes).  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
  7. 17 Jun, 2016 1 commit
  8. 07 Jun, 2016 5 commits
  9. 03 Jun, 2016 1 commit
    • Chris Mason's avatar
      Btrfs: deal with duplciates during extent_map insertion in btrfs_get_extent · 8dff9c85
      Chris Mason authored
      When dealing with inline extents, btrfs_get_extent will incorrectly try
      to insert a duplicate extent_map.  The dup hits -EEXIST from
      add_extent_map, but then we try to merge with the existing one and end
      up trying to insert a zero length extent_map.
      This actually works most of the time, except when there are extent maps
      past the end of the inline extent.  rocksdb will trigger this sometimes
      because it preallocates an extent and then truncates down.
      Josef made a script to trigger with xfs_io:
      	xfs_io -f -c "pwrite 0 1000" inline
      	xfs_io -c "falloc -k 4k 1M" inline
      	xfs_io -c "pread 0 1000" -c "fadvise -d 0 1000" -c "pread 0 1000" inline
      	xfs_io -c "fadvise -d 0 1000" inline
      	cat inline
      You'll get EIOs trying to read inline after this because add_extent_map
      is returning EEXIST
      Signed-off-by: default avatarChris Mason <clm@fb.com>
  10. 25 May, 2016 1 commit
  11. 18 May, 2016 1 commit
  12. 17 May, 2016 1 commit
  13. 13 May, 2016 11 commits
    • Filipe Manana's avatar
      Btrfs: add semaphore to synchronize direct IO writes with fsync · 5f9a8a51
      Filipe Manana authored
      Due to the optimization of lockless direct IO writes (the inode's i_mutex
      is not held) introduced in commit 38851cc1
       ("Btrfs: implement unlocked
      dio write"), we started having races between such writes with concurrent
      fsync operations that use the fast fsync path. These races were addressed
      in the patches titled "Btrfs: fix race between fsync and lockless direct
      IO writes" and "Btrfs: fix race between fsync and direct IO writes for
      prealloc extents". The races happened because the direct IO path, like
      every other write path, does create extent maps followed by the
      corresponding ordered extents while the fast fsync path collected first
      ordered extents and then it collected extent maps. This made it possible
      to log file extent items (based on the collected extent maps) without
      waiting for the corresponding ordered extents to complete (get their IO
      done). The two fixes mentioned before added a solution that consists of
      making the direct IO path create first the ordered extents and then the
      extent maps, while the fsync path attempts to collect any new ordered
      extents once it collects the extent maps. This was simple and did not
      require adding any synchonization primitive to any data structure (struct
      btrfs_inode for example) but it makes things more fragile for future
      development endeavours and adds an exceptional approach compared to the
      other write paths.
      This change adds a read-write semaphore to the btrfs inode structure and
      makes the direct IO path create the extent maps and the ordered extents
      while holding read access on that semaphore, while the fast fsync path
      collects extent maps and ordered extents while holding write access on
      that semaphore. The logic for direct IO write path is encapsulated in a
      new helper function that is used both for cow and nocow direct IO writes.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarJosef Bacik <jbacik@fb.com>
    • Filipe Manana's avatar
      Btrfs: fix race between block group relocation and nocow writes · f78c436c
      Filipe Manana authored
      Relocation of a block group waits for all existing tasks flushing
      dellaloc, starting direct IO writes and any ordered extents before
      starting the relocation process. However for direct IO writes that end
      up doing nocow (inode either has the flag nodatacow set or the write is
      against a prealloc extent) we have a short time window that allows for a
      race that makes relocation proceed without waiting for the direct IO
      write to complete first, resulting in data loss after the relocation
      finishes. This is illustrated by the following diagram:
                 CPU 1                                     CPU 2
       btrfs_relocate_block_group(bg X)
                                                     direct IO write starts against
                                                     an extent in block group X
                                                     using nocow mode (inode has the
                                                     nodatacow flag or the write is
                                                     for a prealloc extent)
                                                         --> can_nocow_extent() returns 1
         btrfs_inc_block_group_ro(bg X)
           --> turns block group into RO mode
           --> returns and does not know about
               the DIO write happening at CPU 2
               (the task there has not created
                yet an ordered extent)
         relocate_block_group(bg X)
           --> rc->stage == MOVE_DATA_EXTENTS
             --> returns extent that the DIO
                 write is going to write to
               --> reads the extent from disk into
                   pages belonging to the relocation
                   inode and dirties them
                                                         --> creates DIO ordered extent
                                                         --> submits bio against a location
                                                             on disk obtained from an extent
                                                             map before the relocation started
           --> writes all the pages read before
               to disk (belonging to the
               relocation inode)
         relocation finishes
                                                       bio completes and wrote new data
                                                       to the old location of the block
      So fix this by tracking the number of nocow writers for a block group and
      make sure relocation waits for that number to go down to 0 before starting
      to move the extents.
      The same race can also happen with buffered writes in nocow mode since the
      patch I recently made titled "Btrfs: don't do unnecessary delalloc flushes
      when relocating", because we are no longer flushing all delalloc which
      served as a synchonization mechanism (due to page locking) and ensured
      the ordered extents for nocow buffered writes were created before we
      called btrfs_wait_ordered_roots(). The race with direct IO writes in nocow
      mode existed before that patch (no pages are locked or used during direct
      IO) and that fixed only races with direct IO writes that do cow.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarJosef Bacik <jbacik@fb.com>
    • Filipe Manana's avatar
      Btrfs: fix race between fsync and direct IO writes for prealloc extents · 0b901916
      Filipe Manana authored
      When we do a direct IO write against a preallocated extent (fallocate)
      that does not go beyond the i_size of the inode, we do the write operation
      without holding the inode's i_mutex (an optimization that landed in
      commit 38851cc1 ("Btrfs: implement unlocked dio write")). This allows
      for a very tiny time window where a race can happen with a concurrent
      fsync using the fast code path, as the direct IO write path creates first
      a new extent map (no longer flagged as a prealloc extent) and then it
      creates the ordered extent, while the fast fsync path first collects
      ordered extents and then it collects extent maps. This allows for the
      possibility of the fast fsync path to collect the new extent map without
      collecting the new ordered extent, and therefore logging an extent item
      based on the extent map without waiting for the ordered extent to be
      created and complete. This can result in a situation where after a log
      replay we end up with an extent not marked anymore as prealloc but it was
      only partially written (or not written at all), exposing random, stale or
      garbage data corresponding to the unwritten pages and without any
      checksums in the csum tree covering the extent's range.
      This is an extension of what was done in commit de0ee0ed
       ("Btrfs: fix
      race between fsync and lockless direct IO writes").
      So fix this by creating first the ordered extent and then the extent
      map, so that this way if the fast fsync patch collects the new extent
      map it also collects the corresponding ordered extent.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarJosef Bacik <jbacik@fb.com>
    • Filipe Manana's avatar
      Btrfs: fix number of transaction units for renames with whiteout · 5062af35
      Filipe Manana authored
      When we do a rename with the whiteout flag, we need to create the whiteout
      inode, which in the worst case requires 5 transaction units (1 inode item,
      1 inode ref, 2 dir items and 1 xattr if selinux is enabled). So bump the
      number of transaction units from 11 to 16 if the whiteout flag is set.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
    • Filipe Manana's avatar
      Btrfs: pin logs earlier when doing a rename exchange operation · 376e5a57
      Filipe Manana authored
      The btrfs_rename_exchange() started as a copy-paste from btrfs_rename(),
      which had a race fixed by my previous patch titled "Btrfs: pin log earlier
      when renaming", and so it suffers from the same problem.
      We pin the logs of the affected roots after we insert the new inode
      references, leaving a time window where concurrent tasks logging the
      inodes can end up logging both the new and old references, resulting
      in log trees that when replayed can turn the metadata into inconsistent
      states. This behaviour was added to btrfs_rename() in 2009 without any
      explanation about why not pinning the logs earlier, just leaving a
      comment about the posibility for the race. As of today it's perfectly
      safe and sane to pin the logs before we start doing any of the steps
      involved in the rename operation.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
    • Filipe Manana's avatar
      Btrfs: unpin logs if rename exchange operation fails · 86e8aa0e
      Filipe Manana authored
      If rename exchange operations fail at some point after we pinned any of
      the logs, we end up aborting the current transaction but never unpin the
      logs, which leaves concurrent tasks that are trying to sync the logs (as
      part of an fsync request from user space) blocked forever and preventing
      the filesystem from being unmountable.
      Fix this by safely unpinning the log.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
    • Filipe Manana's avatar
      Btrfs: fix inode leak on failure to setup whiteout inode in rename · c9901618
      Filipe Manana authored
      If we failed to fully setup the whiteout inode during a rename operation
      with the whiteout flag, we ended up leaking the inode, not decrementing
      its link count nor removing all its items from the fs/subvol tree.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
    • Dan Fuhry's avatar
      btrfs: add support for RENAME_EXCHANGE and RENAME_WHITEOUT · cdd1fedf
      Dan Fuhry authored
      Two new flags, RENAME_EXCHANGE and RENAME_WHITEOUT, provide for new
      behavior in the renameat2() syscall. This behavior is primarily used by
      overlayfs. This patch adds support for these flags to btrfs, enabling it to
      be used as a fully functional upper layer for overlayfs.
      RENAME_EXCHANGE support was written by Davide Italiano originally
      submitted on 2 April 2015.
      Signed-off-by: default avatarDavide Italiano <dccitaliano@gmail.com>
      Signed-off-by: default avatarDan Fuhry <dfuhry@datto.com>
      [ remove unlikely ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
    • Filipe Manana's avatar
      Btrfs: pin log earlier when renaming · c4aba954
      Filipe Manana authored
      We were pinning the log right after the first step in the rename operation
      (inserting inode ref for the new name in the destination directory)
      instead of doing it before. This behaviour was introduced in 2009 for some
      reason that was not mentioned neither on the changelog nor any comment,
      with the drawback of a small time window where concurrent log writers can
      end up logging the new inode reference for the inode we are renaming while
      the rename operation is in progress (so that we can end up with a log
      containing both the new and old references). As of today there's no reason
      to not pin the log before that first step anymore, so just fix this.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
    • Filipe Manana's avatar
      Btrfs: unpin log if rename operation fails · 3dc9e8f7
      Filipe Manana authored
      If rename operations fail at some point after we pinned the log, we end
      up aborting the current transaction but never unpin the log, which leaves
      concurrent tasks that are trying to sync the log (as part of an fsync
      request from user space) blocked forever and preventing the filesystem
      from being unmountable.
      Fix this by safely unpinning the log.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
    • Filipe Manana's avatar
      Btrfs: don't do unnecessary delalloc flushes when relocating · 9cfa3e34
      Filipe Manana authored
      Before we start the actual relocation process of a block group, we do
      calls to flush delalloc of all inodes and then wait for ordered extents
      to complete. However we do these flush calls just to make sure we don't
      race with concurrent tasks that have actually already started to run
      delalloc and have allocated an extent from the block group we want to
      relocate, right before we set it to readonly mode, but have not yet
      created the respective ordered extents. The flush calls make us wait
      for such concurrent tasks because they end up calling
      filemap_fdatawrite_range() (through btrfs_start_delalloc_roots() ->
      __start_delalloc_inodes() -> btrfs_alloc_delalloc_work() ->
      btrfs_run_delalloc_work()) which ends up serializing us with those tasks
      due to attempts to lock the same pages (and the delalloc flush procedure
      calls the allocator and creates the ordered extents before unlocking the
      These flushing calls not only make us waste time (cpu, IO) but also reduce
      the chances of writing larger extents (applications might be writing to
      contiguous ranges and we flush before they finish dirtying the whole
      So make sure we don't flush delalloc and just wait for concurrent tasks
      that have already started flushing delalloc and have allocated an extent
      from the block group we are about to relocate.
      This change also ends up fixing a race with direct IO writes that makes
      relocation not wait for direct IO ordered extents. This race is
      illustrated by the following diagram:
              CPU 1                                       CPU 2
       btrfs_relocate_block_group(bg X)
                                                 starts direct IO write,
                                                 target inode currently has no
                                                 ordered extents ongoing nor
                                                 dirty pages (delalloc regions),
                                                 therefore the root for our inode
                                                 is not in the list
                                                         locks range in the io tree
                                                           --> extent allocated
                                                               from bg X
         btrfs_inc_block_group_ro(bg X)
             --> does nothing, no dealloc ranges
                 in the inode's io tree so the
                 inode's root is not in the list
           --> does not find the inode's root in the
               list fs_info->ordered_roots
           --> ends up not waiting for the direct IO
               write started by the task at CPU 2
         relocate_block_group(rc->stage ==
           iterates the extent tree, using its
           commit root and moves extents into new
                                                           --> now a ordered extent is
                                                               created and added to the
                                                               list root->ordered_extents
                                                               and the root added to the
                                                               list fs_info->ordered_roots
                                                           --> this is too late and the
                                                               task at CPU 1 already
                                                               started the relocation
                                                             --> adds delayed data reference
                                                                 for the extent allocated
                                                                 from bg X
         relocate_block_group(rc->stage ==
               --> delayed refs are run, so an extent
                   item for the allocated extent from
                   bg X is added to extent tree
               --> commit roots are switched, so the
                   next scan in the extent tree will
                   see the extent item
           sees the extent in the extent tree
      When this happens the relocation produces the following warning when it
      [ 7260.832836] ------------[ cut here ]------------
      [ 7260.834653] WARNING: CPU: 5 PID: 6765 at fs/btrfs/relocation.c:4318 btrfs_relocate_block_group+0x245/0x2a1 [btrfs]()
      [ 7260.838268] Modules linked in: btrfs crc32c_generic xor ppdev raid6_pq psmouse sg acpi_cpufreq evdev i2c_piix4 tpm_tis serio_raw tpm i2c_core pcspkr parport_pc
      [ 7260.850935] CPU: 5 PID: 6765 Comm: btrfs Not tainted 4.5.0-rc6-btrfs-next-28+ #1
      [ 7260.852998] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
      [ 7260.852998]  0000000000000000 ffff88020bf57bc0 ffffffff812648b3 0000000000000000
      [ 7260.852998]  0000000000000009 ffff88020bf57bf8 ffffffff81051608 ffffffffa03c1b2d
      [ 7260.852998]  ffff8800b2bbb800 0000000000000000 ffff8800b17bcc58 ffff8800399dd000
      [ 7260.852998] Call Trace:
      [ 7260.852998]  [<ffffffff812648b3>] dump_stack+0x67/0x90
      [ 7260.852998]  [<ffffffff81051608>] warn_slowpath_common+0x99/0xb2
      [ 7260.852998]  [<ffffffffa03c1b2d>] ? btrfs_relocate_block_group+0x245/0x2a1 [btrfs]
      [ 7260.852998]  [<ffffffff810516d4>] warn_slowpath_null+0x1a/0x1c
      [ 7260.852998]  [<ffffffffa03c1b2d>] btrfs_relocate_block_group+0x245/0x2a1 [btrfs]
      [ 7260.852998]  [<ffffffffa039d9de>] btrfs_relocate_chunk.isra.29+0x66/0xdb [btrfs]
      [ 7260.852998]  [<ffffffffa039f314>] btrfs_balance+0xde1/0xe4e [btrfs]
      [ 7260.852998]  [<ffffffff8127d671>] ? debug_smp_processor_id+0x17/0x19
      [ 7260.852998]  [<ffffffffa03a9583>] btrfs_ioctl_balance+0x255/0x2d3 [btrfs]
      [ 7260.852998]  [<ffffffffa03ac96a>] btrfs_ioctl+0x11e0/0x1dff [btrfs]
      [ 7260.852998]  [<ffffffff811451df>] ? handle_mm_fault+0x443/0xd63
      [ 7260.852998]  [<ffffffff81491817>] ? _raw_spin_unlock+0x31/0x44
      [ 7260.852998]  [<ffffffff8108b36a>] ? arch_local_irq_save+0x9/0xc
      [ 7260.852998]  [<ffffffff811876ab>] vfs_ioctl+0x18/0x34
      [ 7260.852998]  [<ffffffff81187cb2>] do_vfs_ioctl+0x550/0x5be
      [ 7260.852998]  [<ffffffff81190c30>] ? __fget_light+0x4d/0x71
      [ 7260.852998]  [<ffffffff81187d77>] SyS_ioctl+0x57/0x79
      [ 7260.852998]  [<ffffffff81492017>] entry_SYSCALL_64_fastpath+0x12/0x6b
      [ 7260.893268] ---[ end trace eb7803b24ebab8ad ]---
      This is because at the end of the first stage, in relocate_block_group(),
      we commit the current transaction, which makes delayed refs run, the
      commit roots are switched and so the second stage will find the extent
      item that the ordered extent added to the delayed refs. But this extent
      was not moved (ordered extent completed after first stage finished), so
      at the end of the relocation our block group item still has a positive
      used bytes counter, triggering a warning at the end of
      btrfs_relocate_block_group(). Later on when trying to read the extent
      contents from disk we hit a BUG_ON() due to the inability to map a block
      with a logical address that belongs to the block group we relocated and
      is no longer valid, resulting in the following trace:
      [ 7344.885290] BTRFS critical (device sdi): unable to find logical 12845056 len 4096
      [ 7344.887518] ------------[ cut here ]------------
      [ 7344.888431] kernel BUG at fs/btrfs/inode.c:1833!
      [ 7344.888431] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
      [ 7344.888431] Modules linked in: btrfs crc32c_generic xor ppdev raid6_pq psmouse sg acpi_cpufreq evdev i2c_piix4 tpm_tis serio_raw tpm i2c_core pcspkr parport_pc
      [ 7344.888431] CPU: 0 PID: 6831 Comm: od Tainted: G        W       4.5.0-rc6-btrfs-next-28+ #1
      [ 7344.888431] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
      [ 7344.888431] task: ffff880215818600 ti: ffff880204684000 task.ti: ffff880204684000
      [ 7344.888431] RIP: 0010:[<ffffffffa037c88c>]  [<ffffffffa037c88c>] btrfs_merge_bio_hook+0x54/0x6b [btrfs]
      [ 7344.888431] RSP: 0018:ffff8802046878f0  EFLAGS: 00010282
      [ 7344.888431] RAX: 00000000ffffffea RBX: 0000000000001000 RCX: 0000000000000001
      [ 7344.888431] RDX: ffff88023ec0f950 RSI: ffffffff8183b638 RDI: 00000000ffffffff
      [ 7344.888431] RBP: ffff880204687908 R08: 0000000000000001 R09: 0000000000000000
      [ 7344.888431] R10: ffff880204687770 R11: ffffffff82f2d52d R12: 0000000000001000
      [ 7344.888431] R13: ffff88021afbfee8 R14: 0000000000006208 R15: ffff88006cd199b0
      [ 7344.888431] FS:  00007f1f9e1d6700(0000) GS:ffff88023ec00000(0000) knlGS:0000000000000000
      [ 7344.888431] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 7344.888431] CR2: 00007f1f9dc8cb60 CR3: 000000023e3b6000 CR4: 00000000000006f0
      [ 7344.888431] Stack:
      [ 7344.888431]  0000000000001000 0000000000001000 ffff880204687b98 ffff880204687950
      [ 7344.888431]  ffffffffa0395c8f ffffea0004d64d48 0000000000000000 0000000000001000
      [ 7344.888431]  ffffea0004d64d48 0000000000001000 0000000000000000 0000000000000000
      [ 7344.888431] Call Trace:
      [ 7344.888431]  [<ffffffffa0395c8f>] submit_extent_page+0xf5/0x16f [btrfs]
      [ 7344.888431]  [<ffffffffa03970ac>] __do_readpage+0x4a0/0x4f1 [btrfs]
      [ 7344.888431]  [<ffffffffa039680d>] ? btrfs_create_repair_bio+0xcb/0xcb [btrfs]
      [ 7344.888431]  [<ffffffffa037eeb4>] ? btrfs_writepage_start_hook+0xbc/0xbc [btrfs]
      [ 7344.888431]  [<ffffffff8108df55>] ? trace_hardirqs_on+0xd/0xf
      [ 7344.888431]  [<ffffffffa039728c>] __do_contiguous_readpages.constprop.26+0xc2/0xe4 [btrfs]
      [ 7344.888431]  [<ffffffffa037eeb4>] ? btrfs_writepage_start_hook+0xbc/0xbc [btrfs]
      [ 7344.888431]  [<ffffffffa039739b>] __extent_readpages.constprop.25+0xed/0x100 [btrfs]
      [ 7344.888431]  [<ffffffff81129d24>] ? lru_cache_add+0xe/0x10
      [ 7344.888431]  [<ffffffffa0397ea8>] extent_readpages+0x160/0x1aa [btrfs]
      [ 7344.888431]  [<ffffffffa037eeb4>] ? btrfs_writepage_start_hook+0xbc/0xbc [btrfs]
      [ 7344.888431]  [<ffffffff8115daad>] ? alloc_pages_current+0xa9/0xcd
      [ 7344.888431]  [<ffffffffa037cdc9>] btrfs_readpages+0x1f/0x21 [btrfs]
      [ 7344.888431]  [<ffffffff81128316>] __do_page_cache_readahead+0x168/0x1fc
      [ 7344.888431]  [<ffffffff811285a0>] ondemand_readahead+0x1f6/0x207
      [ 7344.888431]  [<ffffffff811285a0>] ? ondemand_readahead+0x1f6/0x207
      [ 7344.888431]  [<ffffffff8111cf34>] ? pagecache_get_page+0x2b/0x154
      [ 7344.888431]  [<ffffffff8112870e>] page_cache_sync_readahead+0x3d/0x3f
      [ 7344.888431]  [<ffffffff8111dbf7>] generic_file_read_iter+0x197/0x4e1
      [ 7344.888431]  [<ffffffff8117773a>] __vfs_read+0x79/0x9d
      [ 7344.888431]  [<ffffffff81178050>] vfs_read+0x8f/0xd2
      [ 7344.888431]  [<ffffffff81178a38>] SyS_read+0x50/0x7e
      [ 7344.888431]  [<ffffffff81492017>] entry_SYSCALL_64_fastpath+0x12/0x6b
      [ 7344.888431] Code: 8d 4d e8 45 31 c9 45 31 c0 48 8b 00 48 c1 e2 09 48 8b 80 80 fc ff ff 4c 89 65 e8 48 8b b8 f0 01 00 00 e8 1d 42 02 00 85 c0 79 02 <0f> 0b 4c 0
      [ 7344.888431] RIP  [<ffffffffa037c88c>] btrfs_merge_bio_hook+0x54/0x6b [btrfs]
      [ 7344.888431]  RSP <ffff8802046878f0>
      [ 7344.970544] ---[ end trace eb7803b24ebab8ae ]---
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarJosef Bacik <jbacik@fb.com>
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
  14. 09 May, 2016 1 commit
  15. 01 May, 2016 1 commit
  16. 29 Apr, 2016 2 commits
  17. 28 Apr, 2016 1 commit
  18. 04 Apr, 2016 1 commit
    • Kirill A. Shutemov's avatar
      mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros · 09cbfeaf
      Kirill A. Shutemov authored
      PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
      ago with promise that one day it will be possible to implement page
      cache with bigger chunks than PAGE_SIZE.
      This promise never materialized.  And unlikely will.
      We have many places where PAGE_CACHE_SIZE assumed to be equal to
      PAGE_SIZE.  And it's constant source of confusion on whether
      PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
      especially on the border between fs and mm.
      Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
      breakage to be doable.
      Let's stop pretending that pages in page cache are special.  They are
      The changes are pretty straight-forward:
       - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
       - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
       - page_cache_get() -> get_page();
       - page_cache_release() -> put_page();
      This patch contains automated changes generated with coccinelle using
      script below.  For some reason, coccinelle doesn't patch header files.
      I've called spatch for them manually.
      The only adjustment after coccinelle is revert of changes to
      PAGE_CAHCE_ALIGN definition: we are going to drop it later.
      There are few places in the code where coccinelle didn't reach.  I'll
      fix them manually in a separate patch.  Comments and documentation also
      will be addressed with the separate patch.
      virtual patch
      expression E;
      + E
      expression E;
      + E
      + PAGE_SHIFT
      + PAGE_SIZE
      + PAGE_MASK
      expression E;
      + PAGE_ALIGN(E)
      expression E;
      - page_cache_get(E)
      + get_page(E)
      expression E;
      - page_cache_release(E)
      + put_page(E)
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>