1. 22 Jan, 2014 4 commits
  2. 05 Nov, 2013 4 commits
  3. 25 Oct, 2013 2 commits
  4. 01 Oct, 2013 13 commits
    • Tom Gundersen's avatar
      cuse: add fix minor number to /dev/cuse · cb2ffb26
      Tom Gundersen authored
      This allows udev (or more recently systemd-tmpfiles) to create /dev/cuse on
      boot, in the same way as /dev/fuse is currently created, and the corresponding
      module to be loaded on first access.
      
      The corresponding functionalty was introduced for fuse in commit 578454ff
      
      .
      Signed-off-by: default avatarTom Gundersen <teg@jklm.no>
      Cc: Kay Sievers <kay@vrfy.org>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@suse.cz>
      cb2ffb26
    • Miklos Szeredi's avatar
      fuse: writepage: skip already in flight · ff17be08
      Miklos Szeredi authored
      
      
      If ->writepage() tries to write back a page whose copy is still in flight,
      then just skip by calling redirty_page_for_writepage().
      
      This is OK, since now ->writepage() should never be called for data
      integrity sync.
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@suse.cz>
      ff17be08
    • Miklos Szeredi's avatar
      fuse: writepages: handle same page rewrites · 8b284dc4
      Miklos Szeredi authored
      
      
      As Maxim Patlasov pointed out, it's possible to get a dirty page while it's
      copy is still under writeback, despite fuse_page_mkwrite() doing its thing
      (direct IO).
      
      This could result in two concurrent write request for the same offset, with
      data corruption if they get mixed up.
      
      To prevent this, fuse needs to check and delay such writes.  This
      implementation does this by:
      
       1. check if page is still under writeout, if so create a new, single page
          secondary request for it
      
       2. chain this secondary request onto the in-flight request
      
       2/a. if a seconday request for the same offset was already chained to the
          in-flight request, then just copy the contents of the page and discard
          the new secondary request.  This makes sure that for each page will
          have at most two requests associated with it
      
       3. when the in-flight request finished, send off all secondary requests
          chained onto it
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@suse.cz>
      8b284dc4
    • Miklos Szeredi's avatar
      fuse: writepages: fix aggregation · 1e112a48
      Miklos Szeredi authored
      
      
      Checking against tmp-page indexes is not very useful, and results in one
      (or rarely two) page requests.  Which is not much of an improvement...
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@suse.cz>
      1e112a48
    • Maxim Patlasov's avatar
      fuse: fix race in fuse_writepages() · 2d033eaa
      Maxim Patlasov authored
      
      
      The patch fixes a race between ftruncate(2), mmap-ed write and write(2):
      
      1) An user makes a page dirty via mmap-ed write.
      2) The user performs shrinking truncate(2) intended to purge the page.
      3) Before fuse_do_setattr calls truncate_pagecache, the page goes to
         writeback. fuse_writepages_fill attaches a new page to FUSE_WRITE request,
         then releases the original page by end_page_writeback and unlock it.
      4) fuse_do_setattr completes and successfully returns. Since now, i_mutex
         is free.
      5) Ordinary write(2) extends i_size back to cover the page. Note that
         fuse_send_write_pages do wait for fuse writeback, but for another
         page->index.
      6) fuse_writepages_fill attaches more pages to the request (if any), then
         fuse_writepages_send is eventually called. It is supposed to crop
         inarg->size of the request, but it doesn't because i_size has already been
         extended back.
      
      Moving end_page_writeback behind fuse_writepages_send guarantees that
      __fuse_release_nowrite (called from fuse_do_setattr) will crop inarg->size
      of the request before write(2) gets the chance to extend i_size.
      Signed-off-by: default avatarMaxim Patlasov <mpatlasov@parallels.com>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@suse.cz>
      2d033eaa
    • Pavel Emelyanov's avatar
      fuse: Implement writepages callback · 26d614df
      Pavel Emelyanov authored
      
      
      The .writepages one is required to make each writeback request carry more than
      one page on it. The patch enables optimized behaviour unconditionally,
      i.e. mmap-ed writes will benefit from the patch even if fc->writeback_cache=0.
      
      [SzM: simplify, add comments]
      Signed-off-by: default avatarMaxim Patlasov <MPatlasov@parallels.com>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@suse.cz>
      26d614df
    • Miklos Szeredi's avatar
      fuse: don't BUG on no write file · 72523425
      Miklos Szeredi authored
      
      
      Don't bug if there's no writable files found for page writeback.  If ever
      this is triggered, a WARN_ON helps debugging it much better then a BUG_ON.
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@suse.cz>
      72523425
    • Miklos Szeredi's avatar
      fuse: lock page in mkwrite · cca24370
      Miklos Szeredi authored
      
      
      Lock the page in fuse_page_mkwrite() to protect against a race with
      fuse_writepage() where the page is redirtied before the actual writeback
      begins.
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@suse.cz>
      cca24370
    • Pavel Emelyanov's avatar
      fuse: Prepare to handle multiple pages in writeback · 385b1268
      Pavel Emelyanov authored
      
      
      The .writepages callback will issue writeback requests with more than one
      page aboard. Make existing end/check code be aware of this.
      Signed-off-by: default avatarMaxim Patlasov <MPatlasov@parallels.com>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@suse.cz>
      385b1268
    • Pavel Emelyanov's avatar
      fuse: Getting file for writeback helper · adcadfa8
      Pavel Emelyanov authored
      
      
      There will be a .writepageS callback implementation which will need to
      get a fuse_file out of a fuse_inode, thus make a helper for this.
      Signed-off-by: default avatarMaxim Patlasov <MPatlasov@parallels.com>
      Signed-off-by: default avatarPavel Emelyanov <xemul@openvz.org>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@suse.cz>
      adcadfa8
    • Miklos Szeredi's avatar
      fuse: no RCU mode in fuse_access() · 698fa1d1
      Miklos Szeredi authored
      
      
      fuse_access() is never called in RCU walk, only on the final component of
      access(2) and chdir(2)...
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@suse.cz>
      698fa1d1
    • Miklos Szeredi's avatar
      fuse: readdirplus: fix RCU walk · 6314efee
      Miklos Szeredi authored
      
      
      Doing dput(parent) is not valid in RCU walk mode.  In RCU mode it would
      probably be okay to update the parent flags, but it's actually not
      necessary most of the time...
      
      So only set the FUSE_I_ADVISE_RDPLUS flag on the parent when the entry was
      recently initialized by READDIRPLUS.
      
      This is achieved by setting FUSE_I_INIT_RDPLUS on entries added by
      READDIRPLUS and only dropping out of RCU mode if this flag is set.
      FUSE_I_INIT_RDPLUS is cleared once the FUSE_I_ADVISE_RDPLUS flag is set in
      the parent.
      Reported-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@suse.cz>
      Cc: stable@vger.kernel.org
      6314efee
    • Miklos Szeredi's avatar
      fuse: don't check_submounts_and_drop() in RCU walk · 3c70b8ee
      Miklos Szeredi authored
      
      
      If revalidate finds an invalid dentry in RCU walk mode, let the VFS deal
      with it instead of calling check_submounts_and_drop() which is not prepared
      for being called from RCU walk.
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@suse.cz>
      Cc: stable@vger.kernel.org
      3c70b8ee
  5. 18 Sep, 2013 2 commits
    • Maxim Patlasov's avatar
      fuse: fix fallocate vs. ftruncate race · 0ab08f57
      Maxim Patlasov authored
      
      
      A former patch introducing FUSE_I_SIZE_UNSTABLE flag provided detailed
      description of races between ftruncate and anyone who can extend i_size:
      
      > 1. As in the previous scenario fuse_dentry_revalidate() discovered that i_size
      > changed (due to our own fuse_do_setattr()) and is going to call
      > truncate_pagecache() for some  'new_size' it believes valid right now. But by
      > the time that particular truncate_pagecache() is called ...
      > 2. fuse_do_setattr() returns (either having called truncate_pagecache() or
      > not -- it doesn't matter).
      > 3. The file is extended either by write(2) or ftruncate(2) or fallocate(2).
      > 4. mmap-ed write makes a page in the extended region dirty.
      
      This patch adds necessary bits to fuse_file_fallocate() to protect from that
      race.
      Signed-off-by: default avatarMaxim Patlasov <mpatlasov@parallels.com>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@suse.cz>
      Cc: stable@vger.kernel.org
      0ab08f57
    • Maxim Patlasov's avatar
      fuse: wait for writeback in fuse_file_fallocate() · bde52788
      Maxim Patlasov authored
      
      
      The patch fixes a race between mmap-ed write and fallocate(PUNCH_HOLE):
      
      1) An user makes a page dirty via mmap-ed write.
      2) The user performs fallocate(2) with mode == PUNCH_HOLE|KEEP_SIZE
         and <offset, size> covering the page.
      3) Before truncate_pagecache_range call from fuse_file_fallocate,
         the page goes to write-back. The page is fully processed by fuse_writepage
         (including end_page_writeback on the page), but fuse_flush_writepages did
         nothing because fi->writectr < 0.
      4) truncate_pagecache_range is called and fuse_file_fallocate is finishing
         by calling fuse_release_nowrite. The latter triggers processing queued
         write-back request which will write stale data to the hole soon.
      
      Changed in v2 (thanks to Brian for suggestion):
       - Do not truncate page cache until FUSE_FALLOCATE succeeded. Otherwise,
         we can end up in returning -ENOTSUPP while user data is already punched
         from page cache. Use filemap_write_and_wait_range() instead.
      Changed in v3 (thanks to Miklos for suggestion):
       - fuse_wait_on_writeback() is prone to livelocks; use fuse_set_nowrite()
         instead. So far as we need a dirty-page barrier only, fuse_sync_writes()
         should be enough.
       - rebased to for-linus branch of fuse.git
      Signed-off-by: default avatarMaxim Patlasov <mpatlasov@parallels.com>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@suse.cz>
      Cc: stable@vger.kernel.org
      bde52788
  6. 12 Sep, 2013 1 commit
  7. 11 Sep, 2013 1 commit
    • Maxim Patlasov's avatar
      mm/page-writeback.c: add strictlimit feature · 5a537485
      Maxim Patlasov authored
      
      
      The feature prevents mistrusted filesystems (ie: FUSE mounts created by
      unprivileged users) to grow a large number of dirty pages before
      throttling.  For such filesystems balance_dirty_pages always check bdi
      counters against bdi limits.  I.e.  even if global "nr_dirty" is under
      "freerun", it's not allowed to skip bdi checks.  The only use case for now
      is fuse: it sets bdi max_ratio to 1% by default and system administrators
      are supposed to expect that this limit won't be exceeded.
      
      The feature is on if a BDI is marked by BDI_CAP_STRICTLIMIT flag.  A
      filesystem may set the flag when it initializes its BDI.
      
      The problematic scenario comes from the fact that nobody pays attention to
      the NR_WRITEBACK_TEMP counter (i.e.  number of pages under fuse
      writeback).  The implementation of fuse writeback releases original page
      (by calling end_page_writeback) almost immediately.  A fuse request queued
      for real processing bears a copy of original page.  Hence, if userspace
      fuse daemon doesn't finalize write requests in timely manner, an
      aggressive mmap writer can pollute virtually all memory by those temporary
      fuse page copies.  They are carefully accounted in NR_WRITEBACK_TEMP, but
      nobody cares.
      
      To make further explanations shorter, let me use "NR_WRITEBACK_TEMP
      problem" as a shortcut for "a possibility of uncontrolled grow of amount
      of RAM consumed by temporary pages allocated by kernel fuse to process
      writeback".
      
      The problem was very easy to reproduce.  There is a trivial example
      filesystem implementation in fuse userspace distribution: fusexmp_fh.c.  I
      added "sleep(1);" to the write methods, then recompiled and mounted it.
      Then created a huge file on the mount point and run a simple program which
      mmap-ed the file to a memory region, then wrote a data to the region.  An
      hour later I observed almost all RAM consumed by fuse writeback.  Since
      then some unrelated changes in kernel fuse made it more difficult to
      reproduce, but it is still possible now.
      
      Putting this theoretical happens-in-the-lab thing aside, there is another
      thing that really hurts real world (FUSE) users.  This is write-through
      page cache policy FUSE currently uses.  I.e.  handling write(2), kernel
      fuse populates page cache and flushes user data to the server
      synchronously.  This is excessively suboptimal.  Pavel Emelyanov's patches
      ("writeback cache policy") solve the problem, but they also make resolving
      NR_WRITEBACK_TEMP problem absolutely necessary.  Otherwise, simply copying
      a huge file to a fuse mount would result in memory starvation.  Miklos,
      the maintainer of FUSE, believes strictlimit feature the way to go.
      
      And eventually putting FUSE topics aside, there is one more use-case for
      strictlimit feature.  Using a slow USB stick (mass storage) in a machine
      with huge amount of RAM installed is a well-known pain.  Let's make simple
      computations.  Assuming 64GB of RAM installed, existing implementation of
      balance_dirty_pages will start throttling only after 9.6GB of RAM becomes
      dirty (freerun == 15% of total RAM).  So, the command "cp 9GB_file
      /media/my-usb-storage/" may return in a few seconds, but subsequent
      "umount /media/my-usb-storage/" will take more than two hours if effective
      throughput of the storage is, to say, 1MB/sec.
      
      After inclusion of strictlimit feature, it will be trivial to add a knob
      (e.g.  /sys/devices/virtual/bdi/x:y/strictlimit) to enable it on demand.
      Manually or via udev rule.  May be I'm wrong, but it seems to be quite a
      natural desire to limit the amount of dirty memory for some devices we are
      not fully trust (in the sense of sustainable throughput).
      
      [akpm@linux-foundation.org: fix warning in page-writeback.c]
      Signed-off-by: default avatarMaxim Patlasov <MPatlasov@parallels.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Miklos Szeredi <miklos@szeredi.hu>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5a537485
  8. 05 Sep, 2013 3 commits
  9. 04 Sep, 2013 1 commit
  10. 03 Sep, 2013 4 commits
    • Miklos Szeredi's avatar
      fuse: readdir: check for slash in names · efeb9e60
      Miklos Szeredi authored
      
      
      Userspace can add names containing a slash character to the directory
      listing.  Don't allow this as it could cause all sorts of trouble.
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@suse.cz>
      Cc: stable@vger.kernel.org
      efeb9e60
    • Maxim Patlasov's avatar
      fuse: hotfix truncate_pagecache() issue · 06a7c3c2
      Maxim Patlasov authored
      
      
      The way how fuse calls truncate_pagecache() from fuse_change_attributes()
      is completely wrong. Because, w/o i_mutex held, we never sure whether
      'oldsize' and 'attr->size' are valid by the time of execution of
      truncate_pagecache(inode, oldsize, attr->size). In fact, as soon as we
      released fc->lock in the middle of fuse_change_attributes(), we completely
      loose control of actions which may happen with given inode until we reach
      truncate_pagecache. The list of potentially dangerous actions includes
      mmap-ed reads and writes, ftruncate(2) and write(2) extending file size.
      
      The typical outcome of doing truncate_pagecache() with outdated arguments
      is data corruption from user point of view. This is (in some sense)
      acceptable in cases when the issue is triggered by a change of the file on
      the server (i.e. externally wrt fuse operation), but it is absolutely
      intolerable in scenarios when a single fuse client modifies a file without
      any external intervention. A real life case I discovered by fsx-linux
      looked like this:
      
      1. Shrinking ftruncate(2) comes to fuse_do_setattr(). The latter sends
      FUSE_SETATTR to the server synchronously, but before getting fc->lock ...
      2. fuse_dentry_revalidate() is asynchronously called. It sends FUSE_LOOKUP
      to the server synchronously, then calls fuse_change_attributes(). The
      latter updates i_size, releases fc->lock, but before comparing oldsize vs
      attr->size..
      3. fuse_do_setattr() from the first step proceeds by acquiring fc->lock and
      updating attributes and i_size, but now oldsize is equal to
      outarg.attr.size because i_size has just been updated (step 2). Hence,
      fuse_do_setattr() returns w/o calling truncate_pagecache().
      4. As soon as ftruncate(2) completes, the user extends file size by
      write(2) making a hole in the middle of file, then reads data from the hole
      either by read(2) or mmap-ed read. The user expects to get zero data from
      the hole, but gets stale data because truncate_pagecache() is not executed
      yet.
      
      The scenario above illustrates one side of the problem: not truncating the
      page cache even though we should. Another side corresponds to truncating
      page cache too late, when the state of inode changed significantly.
      Theoretically, the following is possible:
      
      1. As in the previous scenario fuse_dentry_revalidate() discovered that
      i_size changed (due to our own fuse_do_setattr()) and is going to call
      truncate_pagecache() for some 'new_size' it believes valid right now. But
      by the time that particular truncate_pagecache() is called ...
      2. fuse_do_setattr() returns (either having called truncate_pagecache() or
      not -- it doesn't matter).
      3. The file is extended either by write(2) or ftruncate(2) or fallocate(2).
      4. mmap-ed write makes a page in the extended region dirty.
      
      The result will be the lost of data user wrote on the fourth step.
      
      The patch is a hotfix resolving the issue in a simplistic way: let's skip
      dangerous i_size update and truncate_pagecache if an operation changing
      file size is in progress. This simplistic approach looks correct for the
      cases w/o external changes. And to handle them properly, more sophisticated
      and intrusive techniques (e.g. NFS-like one) would be required. I'd like to
      postpone it until the issue is well discussed on the mailing list(s).
      
      Changed in v2:
       - improved patch description to cover both sides of the issue.
      Signed-off-by: default avatarMaxim Patlasov <mpatlasov@parallels.com>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@suse.cz>
      Cc: stable@vger.kernel.org
      06a7c3c2
    • Anand Avati's avatar
      fuse: invalidate inode attributes on xattr modification · d331a415
      Anand Avati authored
      
      
      Calls like setxattr and removexattr result in updation of ctime.
      Therefore invalidate inode attributes to force a refresh.
      Signed-off-by: default avatarAnand Avati <avati@redhat.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@suse.cz>
      Cc: stable@vger.kernel.org
      d331a415
    • Maxim Patlasov's avatar
      fuse: postpone end_page_writeback() in fuse_writepage_locked() · 4a4ac4eb
      Maxim Patlasov authored
      
      
      The patch fixes a race between ftruncate(2), mmap-ed write and write(2):
      
      1) An user makes a page dirty via mmap-ed write.
      2) The user performs shrinking truncate(2) intended to purge the page.
      3) Before fuse_do_setattr calls truncate_pagecache, the page goes to
         writeback. fuse_writepage_locked fills FUSE_WRITE request and releases
         the original page by end_page_writeback.
      4) fuse_do_setattr() completes and successfully returns. Since now, i_mutex
         is free.
      5) Ordinary write(2) extends i_size back to cover the page. Note that
         fuse_send_write_pages do wait for fuse writeback, but for another
         page->index.
      6) fuse_writepage_locked proceeds by queueing FUSE_WRITE request.
         fuse_send_writepage is supposed to crop inarg->size of the request,
         but it doesn't because i_size has already been extended back.
      
      Moving end_page_writeback to the end of fuse_writepage_locked fixes the
      race because now the fact that truncate_pagecache is successfully returned
      infers that fuse_writepage_locked has already called end_page_writeback.
      And this, in turn, infers that fuse_flush_writepages has already called
      fuse_send_writepage, and the latter used valid (shrunk) i_size. write(2)
      could not extend it because of i_mutex held by ftruncate(2).
      Signed-off-by: default avatarMaxim Patlasov <mpatlasov@parallels.com>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@suse.cz>
      Cc: stable@vger.kernel.org
      4a4ac4eb
  11. 27 Jul, 2013 1 commit
  12. 17 Jul, 2013 4 commits