1. 09 Nov, 2013 3 commits
    • Al Viro's avatar
      b61625d2
    • Al Viro's avatar
      RCU'd vfsmounts · 48a066e7
      Al Viro authored
      
      
      * RCU-delayed freeing of vfsmounts
      * vfsmount_lock replaced with a seqlock (mount_lock)
      * sequence number from mount_lock is stored in nameidata->m_seq and
      used when we exit RCU mode
      * new vfsmount flag - MNT_SYNC_UMOUNT.  Set by umount_tree() when its
      caller knows that vfsmount will have no surviving references.
      * synchronize_rcu() done between unlocking namespace_sem in namespace_unlock()
      and doing pending mntput().
      * new helper: legitimize_mnt(mnt, seq).  Checks the mount_lock sequence
      number against seq, then grabs reference to mnt.  Then it rechecks mount_lock
      again to close the race and either returns success or drops the reference it
      has acquired.  The subtle point is that in case of MNT_SYNC_UMOUNT we can
      simply decrement the refcount and sod off - aforementioned synchronize_rcu()
      makes sure that final mntput() won't come until we leave RCU mode.  We need
      that, since we don't want to end up with some lazy pathwalk racing with
      umount() and stealing the final mntput() from it - caller of umount() may
      expect it to return only once the fs is shut down and we don't want to break
      that.  In other cases (i.e. with MNT_SYNC_UMOUNT absent) we have to do
      full-blown mntput() in case of mount_lock sequence number mismatch happening
      just as we'd grabbed the reference, but in those cases we won't be stealing
      the final mntput() from anything that would care.
      * mntput_no_expire() doesn't lock anything on the fast path now.  Incidentally,
      SMP and UP cases are handled the same way - no ifdefs there.
      * normal pathname resolution does *not* do any writes to mount_lock.  It does,
      of course, bump the refcounts of vfsmount and dentry in the very end, but that's
      it.
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      48a066e7
    • Al Viro's avatar
      switch shrink_dcache_for_umount() to use of d_walk() · 42c32608
      Al Viro authored
      
      
      we have too many iterators in fs/dcache.c...
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      42c32608
  2. 25 Oct, 2013 2 commits
  3. 15 Sep, 2013 1 commit
  4. 14 Sep, 2013 1 commit
    • Linus Torvalds's avatar
      vfs: fix dentry LRU list handling and nr_dentry_unused accounting · 89dc77bc
      Linus Torvalds authored
      
      
      The LRU list changes interacted badly with our nr_dentry_unused
      accounting, and even worse with the new DCACHE_LRU_LIST bit logic.
      
      This introduces helper functions to make sure everything follows the
      proper dcache d_lru list rules: the dentry cache is complicated by the
      fact that some of the hotpaths don't even want to look at the LRU list
      at all, and the fact that we use the same list entry in the dentry for
      both the LRU list and for our temporary shrinking lists when removing
      things from the LRU.
      
      The helper functions temporarily have some extra sanity checking for the
      flag bits that have to match the current LRU state of the dentry.  We'll
      remove that before the final 3.12 release, but considering how easy it
      is to get wrong, this first cleanup version has some very particular
      sanity checking.
      Acked-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      89dc77bc
  5. 12 Sep, 2013 6 commits
    • Linus Torvalds's avatar
      vfs: make d_path() get the root path under RCU · 68f0d9d9
      Linus Torvalds authored
      This avoids the spinlocks and refcounts in the d_path() sequence too
      (used by /proc and various other entities).  See commit 8b19e341
      
       for
      the equivalent getcwd() system call path.
      
      And unlike getcwd(), d_path() doesn't copy the result to user space, so
      I don't need to fear _that_ particular bug happening again.
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      68f0d9d9
    • Linus Torvalds's avatar
      vfs: use __getname/__putname for getcwd() system call · 3272c544
      Linus Torvalds authored
      
      
      It's a pathname.  It should use the pathname allocators and
      deallocators, and PATH_MAX instead of PAGE_SIZE.  Never mind that the
      two are commonly the same.
      
      With this, the allocations scale up nicely too, and I can do getcwd()
      system calls at a rate of about 300M/s, with no lock contention
      anywhere.
      
      Of course, nobody sane does that, especially since getcwd() is
      traditionally a very slow operation in Unix.  But this was also the
      simplest way to benchmark the prepend_path() improvements by Waiman, and
      once I saw the profiles I couldn't leave it well enough alone.
      
      But apart from being an performance improvement (from using per-cpu slab
      allocators instead of the raw page allocator), it's actually a valid and
      real cleanup.
      Signed-off-by: default avatarLinus "OCD" Torvalds <torvalds@linux-foundation.org>
      3272c544
    • Linus Torvalds's avatar
      vfs: don't copy things to user space holding the rcu readlock · ff812d72
      Linus Torvalds authored
      Oops.  That wasn't very smart.  We don't actually need the RCU lock any
      more by the time we copy the cwd string to user space, but I had
      stupidly surrounded the whole thing with it.
      
      Introduced by commit 8b19e341 ("vfs: make getcwd() get the root and
      pwd path under rcu")
      
      Is-a-big-hairy-idiot: Linus Torvalds <torvalds@linux-foundation.org>
      ff812d72
    • Linus Torvalds's avatar
      vfs: make getcwd() get the root and pwd path under rcu · 8b19e341
      Linus Torvalds authored
      
      
      This allows us to skip all the crazy spinlocks and reference count
      updates, and instead use the fs sequence read-lock to get an atomic
      snapshot of the root and cwd information.
      
      We might want to make the rule that "prepend_path()" is always called
      with the RCU lock held, but the RCU lock nests fine and this is the
      minimal fix.
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8b19e341
    • Linus Torvalds's avatar
      vfs: move get_fs_root_and_pwd() to single caller · 5762482f
      Linus Torvalds authored
      
      
      Let's not pollute the include files with inline functions that are only
      used in a single place.  Especially not if we decide we might want to
      change the semantics of said function to make it more efficient..
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5762482f
    • Waiman Long's avatar
      dcache: get/release read lock in read_seqbegin_or_lock() & friend · 18129977
      Waiman Long authored
      
      
      This patch modifies read_seqbegin_or_lock() and need_seqretry() to use
      newly introduced read_seqlock_excl() and read_sequnlock_excl()
      primitives so that they won't change the sequence number even if they
      fall back to take the lock.  This is OK as no change to the protected
      data structure is being made.
      
      It will prevent one fallback to lock taking from cascading into a series
      of lock taking reducing performance because of the sequence number
      change.  It will also allow other sequence readers to go forward while
      an exclusive reader lock is taken.
      
      This patch also updates some of the inaccurate comments in the code.
      Signed-off-by: default avatarWaiman Long <Waiman.Long@hp.com>
      To: Alexander Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      18129977
  6. 10 Sep, 2013 8 commits
    • Dave Chinner's avatar
      fs: convert inode and dentry shrinking to be node aware · 9b17c623
      Dave Chinner authored
      
      
      Now that the shrinker is passing a node in the scan control structure, we
      can pass this to the the generic LRU list code to isolate reclaim to the
      lists on matching nodes.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarGlauber Costa <glommer@parallels.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
      Cc: Arve Hjønnevåg <arve@android.com>
      Cc: Carlos Maiolino <cmaiolino@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: J. Bruce Fields <bfields@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Kent Overstreet <koverstreet@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Thomas Hellstrom <thellstrom@vmware.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      9b17c623
    • Glauber Costa's avatar
      list_lru: remove special case function list_lru_dispose_all. · 4e717f5c
      Glauber Costa authored
      
      
      The list_lru implementation has one function, list_lru_dispose_all, with
      only one user (the dentry code).  At first, such function appears to make
      sense because we are really not interested in the result of isolating each
      dentry separately - all of them are going away anyway.  However, it's
      implementation is buggy in the following way:
      
      When we call list_lru_dispose_all in fs/dcache.c, we scan all dentries
      marking them with DCACHE_SHRINK_LIST.  However, this is done without the
      nlru->lock taken.  The imediate result of that is that someone else may
      add or remove the dentry from the LRU at the same time.  When list_lru_del
      happens in that scenario we will see an element that is not yet marked
      with DCACHE_SHRINK_LIST (even though it will be in the future) and
      obviously remove it from an lru where the element no longer is.  Since
      list_lru_dispose_all will in effect count down nlru's nr_items and
      list_lru_del will do the same, this will lead to an imbalance.
      
      The solution for this would not be so simple: we can obviously just keep
      the lru_lock taken, but then we have no guarantees that we will be able to
      acquire the dentry lock (dentry->d_lock).  To properly solve this, we need
      a communication mechanism between the lru and dentry code, so they can
      coordinate this with each other.
      
      Such mechanism already exists in the form of the list_lru_walk_cb
      callback.  So it is possible to construct a dcache-side prune function
      that does the right thing only by calling list_lru_walk in a loop until no
      more dentries are available.
      
      With only one user, plus the fact that a sane solution for the problem
      would involve boucing between dcache and list_lru anyway, I see little
      justification to keep the special case list_lru_dispose_all in tree.
      Signed-off-by: default avatarGlauber Costa <glommer@openvz.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Acked-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      4e717f5c
    • Dave Chinner's avatar
      dcache: convert to use new lru list infrastructure · f6041567
      Dave Chinner authored
      
      
      [glommer@openvz.org: don't reintroduce double decrement of nr_unused_dentries, adapted for new LRU return codes]
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarGlauber Costa <glommer@openvz.org>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
      Cc: Arve Hjønnevåg <arve@android.com>
      Cc: Carlos Maiolino <cmaiolino@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: J. Bruce Fields <bfields@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Kent Overstreet <koverstreet@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Thomas Hellstrom <thellstrom@vmware.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      f6041567
    • Dave Chinner's avatar
      shrinker: convert superblock shrinkers to new API · 0a234c6d
      Dave Chinner authored
      
      
      Convert superblock shrinker to use the new count/scan API, and propagate
      the API changes through to the filesystem callouts.  The filesystem
      callouts already use a count/scan API, so it's just changing counters to
      longs to match the VM API.
      
      This requires the dentry and inode shrinker callouts to be converted to
      the count/scan API.  This is mainly a mechanical change.
      
      [glommer@openvz.org: use mult_frac for fractional proportions, build fixes]
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarGlauber Costa <glommer@openvz.org>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
      Cc: Arve Hjønnevåg <arve@android.com>
      Cc: Carlos Maiolino <cmaiolino@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: J. Bruce Fields <bfields@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Kent Overstreet <koverstreet@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Thomas Hellstrom <thellstrom@vmware.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      0a234c6d
    • Dave Chinner's avatar
      dcache: remove dentries from LRU before putting on dispose list · dd1f6b2e
      Dave Chinner authored
      
      
      One of the big problems with modifying the way the dcache shrinker and LRU
      implementation works is that the LRU is abused in several ways.  One of
      these is shrink_dentry_list().
      
      Basically, we can move a dentry off the LRU onto a different list without
      doing any accounting changes, and then use dentry_lru_prune() to remove it
      from what-ever list it is now on to do the LRU accounting at that point.
      
      This makes it -really hard- to change the LRU implementation.  The use of
      the per-sb LRU lock serialises movement of the dentries between the
      different lists and the removal of them, and this is the only reason that
      it works.  If we want to break up the dentry LRU lock and lists into, say,
      per-node lists, we remove the only serialisation that allows this lru
      list/dispose list abuse to work.
      
      To make this work effectively, the dispose list has to be isolated from
      the LRU list - dentries have to be removed from the LRU *before* being
      placed on the dispose list.  This means that the LRU accounting and
      isolation is completed before disposal is started, and that means we can
      change the LRU implementation freely in future.
      
      This means that dentries *must* be marked with DCACHE_SHRINK_LIST when
      they are placed on the dispose list so that we don't think that parent
      dentries found in try_prune_one_dentry() are on the LRU when the are
      actually on the dispose list.  This would result in accounting the dentry
      to the LRU a second time.  Hence dentry_lru_del() has to handle the
      DCACHE_SHRINK_LIST case
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarGlauber Costa <glommer@openvz.org>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
      Cc: Arve Hjønnevåg <arve@android.com>
      Cc: Carlos Maiolino <cmaiolino@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: J. Bruce Fields <bfields@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Kent Overstreet <koverstreet@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Thomas Hellstrom <thellstrom@vmware.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      dd1f6b2e
    • Dave Chinner's avatar
      dentry: move to per-sb LRU locks · 19156840
      Dave Chinner authored
      
      
      With the dentry LRUs being per-sb structures, there is no real need for
      a global dentry_lru_lock. The locking can be made more fine-grained by
      moving to a per-sb LRU lock, isolating the LRU operations of different
      filesytsems completely from each other. The need for this is independent
      of any performance consideration that may arise: in the interest of
      abstracting the lru operations away, it is mandatory that each lru works
      around its own lock instead of a global lock for all of them.
      
      [glommer@openvz.org: updated changelog ]
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarGlauber Costa <glommer@openvz.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
      Cc: Arve Hjønnevåg <arve@android.com>
      Cc: Carlos Maiolino <cmaiolino@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: J. Bruce Fields <bfields@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Kent Overstreet <koverstreet@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Thomas Hellstrom <thellstrom@vmware.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      19156840
    • Dave Chinner's avatar
      dcache: convert dentry_stat.nr_unused to per-cpu counters · 62d36c77
      Dave Chinner authored
      
      
      Before we split up the dcache_lru_lock, the unused dentry counter needs to
      be made independent of the global dcache_lru_lock.  Convert it to per-cpu
      counters to do this.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarGlauber Costa <glommer@openvz.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
      Cc: Arve Hjønnevåg <arve@android.com>
      Cc: Carlos Maiolino <cmaiolino@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: J. Bruce Fields <bfields@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Kent Overstreet <koverstreet@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Thomas Hellstrom <thellstrom@vmware.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      62d36c77
    • Glauber Costa's avatar
      fs: bump inode and dentry counters to long · 3942c07c
      Glauber Costa authored
      This series reworks our current object cache shrinking infrastructure in
      two main ways:
      
       * Noticing that a lot of users copy and paste their own version of LRU
         lists for objects, we put some effort in providing a generic version.
         It is modeled after the filesystem users: dentries, inodes, and xfs
         (for various tasks), but we expect that other users could benefit in
         the near future with little or no modification.  Let us know if you
         have any issues.
      
       * The underlying list_lru being proposed automatically and
         transparently keeps the elements in per-node lists, and is able to
         manipulate the node lists individually.  Given this infrastructure, we
         are able to modify the up-to-now hammer called shrink_slab to proceed
         with node-reclaim instead of always searching memory from all over like
         it has been doing.
      
      Per-node lru lists are also expected to lead to less contention in the lru
      locks on multi-node scans, since we are now no longer fighting for a
      global lock.  The locks usually disappear from the profilers with this
      change.
      
      Although we have no official benchmarks for this version - be our guest to
      independently evaluate this - earlier versions of this series were
      performance tested (details at
      http://permalink.gmane.org/gmane.linux.kernel.mm/100537) yielding no
      visible performance regressions while yielding a better qualitative
      behavior in NUMA machines.
      
      With this infrastructure in place, we can use the list_lru entry point to
      provide memcg isolation and per-memcg targeted reclaim.  Historically,
      those two pieces of work have been posted together.  This version presents
      only the infrastructure work, deferring the memcg work for a later time,
      so we can focus on getting this part tested.  You can see more about the
      history of such work at http://lwn.net/Articles/552769/
      
      
      
      Dave Chinner (18):
        dcache: convert dentry_stat.nr_unused to per-cpu counters
        dentry: move to per-sb LRU locks
        dcache: remove dentries from LRU before putting on dispose list
        mm: new shrinker API
        shrinker: convert superblock shrinkers to new API
        list: add a new LRU list type
        inode: convert inode lru list to generic lru list code.
        dcache: convert to use new lru list infrastructure
        list_lru: per-node list infrastructure
        shrinker: add node awareness
        fs: convert inode and dentry shrinking to be node aware
        xfs: convert buftarg LRU to generic code
        xfs: rework buffer dispose list tracking
        xfs: convert dquot cache lru to list_lru
        fs: convert fs shrinkers to new scan/count API
        drivers: convert shrinkers to new count/scan API
        shrinker: convert remaining shrinkers to count/scan API
        shrinker: Kill old ->shrink API.
      
      Glauber Costa (7):
        fs: bump inode and dentry counters to long
        super: fix calculation of shrinkable objects for small numbers
        list_lru: per-node API
        vmscan: per-node deferred work
        i915: bail out earlier when shrinker cannot acquire mutex
        hugepage: convert huge zero page shrinker to new shrinker API
        list_lru: dynamically adjust node arrays
      
      This patch:
      
      There are situations in very large machines in which we can have a large
      quantity of dirty inodes, unused dentries, etc.  This is particularly true
      when umounting a filesystem, where eventually since every live object will
      eventually be discarded.
      
      Dave Chinner reported a problem with this while experimenting with the
      shrinker revamp patchset.  So we believe it is time for a change.  This
      patch just moves int to longs.  Machines where it matters should have a
      big long anyway.
      Signed-off-by: default avatarGlauber Costa <glommer@openvz.org>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
      Cc: Arve Hjønnevåg <arve@android.com>
      Cc: Carlos Maiolino <cmaiolino@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: J. Bruce Fields <bfields@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Kent Overstreet <koverstreet@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Thomas Hellstrom <thellstrom@vmware.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      3942c07c
  7. 09 Sep, 2013 2 commits
    • Al Viro's avatar
      split read_seqretry_or_unlock(), convert d_walk() to resulting primitives · 48f5ec21
      Al Viro authored
      
      
      Separate "check if we need to retry" from "unlock if we are done and
      had seq_writelock"; that allows to use these guys in d_walk(), where
      we need to recheck every time we ascend back to parent, but do *not*
      want to unlock until the very end.  Lift rcu_read_lock/rcu_read_unlock
      out into callers.
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      48f5ec21
    • Waiman Long's avatar
      dcache: Translating dentry into pathname without taking rename_lock · 232d2d60
      Waiman Long authored
      
      
      When running the AIM7's short workload, Linus' lockref patch eliminated
      most of the spinlock contention. However, there were still some left:
      
           8.46%     reaim  [kernel.kallsyms]     [k] _raw_spin_lock
                       |--42.21%-- d_path
                       |          proc_pid_readlink
                       |          SyS_readlinkat
                       |          SyS_readlink
                       |          system_call
                       |          __GI___readlink
                       |
                       |--40.97%-- sys_getcwd
                       |          system_call
                       |          __getcwd
      
      The big one here is the rename_lock (seqlock) contention in d_path()
      and the getcwd system call. This patch will eliminate the need to take
      the rename_lock while translating dentries into the full pathnames.
      
      The need to take the rename_lock is to make sure that no rename
      operation can be ongoing while the translation is in progress. However,
      only one thread can take the rename_lock thus blocking all the other
      threads that need it even though the translation process won't make
      any change to the dentries.
      
      This patch will replace the writer's write_seqlock/write_sequnlock
      sequence of the rename_lock of the callers of the prepend_path() and
      __dentry_path() functions with the reader's read_seqbegin/read_seqretry
      sequence within these 2 functions. As a result, the code will have to
      retry if one or more rename operations had been performed. In addition,
      RCU read lock will be taken during the translation process to make sure
      that no dentries will go away. To prevent live-lock from happening,
      the code will switch back to take the rename_lock if read_seqretry()
      fails for three times.
      
      To further reduce spinlock contention, this patch does not take the
      dentry's d_lock when copying the filename from the dentries. Instead,
      it treats the name pointer and length as unreliable and just copy
      the string byte-by-byte over until it hits a null byte or the end of
      string as specified by the length. This should avoid stepping into
      invalid memory address. The error cases are left to be handled by
      the sequence number check.
      
      The following code re-factoring are also made:
      1. Move prepend('/') into prepend_name() to remove one conditional
         check.
      2. Move the global root check in prepend_path() back to the top of
         the while loop.
      
      With this patch, the _raw_spin_lock will now account for only 1.2%
      of the total CPU cycles for the short workload. This patch also has
      the effect of reducing the effect of running perf on its profile
      since the perf command itself can be a heavy user of the d_path()
      function depending on the complexity of the workload.
      
      When taking the perf profile of the high-systime workload, the amount
      of spinlock contention contributed by running perf without this patch
      was about 16%. With this patch, the spinlock contention caused by
      the running of perf will go away and we will have a more accurate
      perf profile.
      Signed-off-by: default avatarWaiman Long <Waiman.Long@hp.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      232d2d60
  8. 08 Sep, 2013 2 commits
    • Linus Torvalds's avatar
      vfs: use lockred "dead" flag to mark unrecoverably dead dentries · 0d98439e
      Linus Torvalds authored
      This simplifies the RCU to refcounting code in particular.
      
      I was originally intending to leave this for later, but walking through
      all the dput() logic (see previous commit), I realized that the dput()
      "might_sleep()" check was misleadingly weak.  And I removed it as
      misleading, both for performance profiling and for debugging.
      
      However, the might_sleep() debugging case is actually true: the final
      dput() can indeed sleep, if the inode of the dentry that you are
      releasing ends up sleeping at iput time (see dentry_iput()).  So the
      problem with the might_sleep() in dput() wasn't that it wasn't true, it
      was that it wasn't actually testing and triggering on the interesting
      case.
      
      In particular, just about *any* dput() can indeed sleep, if you happen
      to race with another thread deleting the file in question, and you then
      lose the race to the be the last dput() for that file.  But because it's
      a very rare race, the debugging code would never trigger it in practice.
      
      Why is this problematic? The new d_rcu_to_refcount() (see commit
      15570086
      
      : "vfs: reimplement d_rcu_to_refcount() using
      lockref_get_or_lock()") does a dput() for the failure case, and it does
      it under the RCU lock.  So potentially sleeping really is a bug.
      
      But there's no way I'm going to fix this with the previous complicated
      "lockref_get_or_lock()" interface.  And rather than revert to the old
      and crufty nested dentry locking code (which did get this right by
      delaying the reference count updates until they were verified to be
      safe), let's make forward progress.
      
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0d98439e
    • Linus Torvalds's avatar
      vfs: reorganize dput() memory accesses · 8aab6a27
      Linus Torvalds authored
      
      
      This is me being a bit OCD after all the dentry optimization work this
      merge window: profiles end up showing 'dput()' as a rather expensive
      operation, and there were two unrelated bad reasons for that.
      
      The first reason was reading d_lockref.count for debugging purposes,
      which touches the lockref cacheline (for reads) before really need to.
      More importantly, the debugging test in question is _wrong_, and has
      hidden bugs.  It's true that we can only sleep when the count goes down
      to zero, but the test as-is hides the much more subtle bug that happens
      if we race with somebody else deleting the file.
      
      Anyway we _will_ touch that cacheline, but let's do it for a write and
      in the right routine (ie in "lockref_put_or_lock()") which annotates the
      costs better.  So remove the misleading debug code.
      
      The other was an unnecessary access to the cacheline that contains the
      d_lru list, just to check whether we already were on the LRU list or
      not.  This is exactly what we have d_flags for, so that we can avoid
      touching extra cache lines for the common case.  So just add another bit
      for "is this dentry on the LRU".
      
      Finally, mark the tests properly likely/unlikely, so that the common
      fast-paths are dense in the instruction stream.
      
      This makes the profiles look much saner.
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8aab6a27
  9. 05 Sep, 2013 4 commits
    • Miklos Szeredi's avatar
      vfs: check unlinked ancestors before mount · eed81007
      Miklos Szeredi authored
      
      
      We check submounts before doing d_drop() on a non-empty directory dentry in
      NFS (have_submounts()), but we do not exclude a racing mount.  Nor do we
      prevent mounts to be added to the disconnected subtree using relative paths
      after the d_drop().
      
      This patch fixes these issues by checking for unlinked (unhashed, non-root)
      ancestors before proceeding with the mount.  This is done with rename
      seqlock taken for write and with ->d_lock grabbed on each ancestor in turn,
      including our dentry itself.  This ensures that the only one of
      check_submounts_and_drop() or has_unlinked_ancestor() can succeed.
      Signed-off-by: default avatarMiklos Szeredi <miklos@szeredi.hu>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      eed81007
    • Miklos Szeredi's avatar
      vfs: check submounts and drop atomically · 848ac114
      Miklos Szeredi authored
      
      
      We check submounts before doing d_drop() on a non-empty directory dentry in
      NFS (have_submounts()), but we do not exclude a racing mount.
      
       Process A: have_submounts() -> returns false
       Process B: mount() -> success
       Process A: d_drop()
      
      This patch prepares the ground for the fix by doing the following
      operations all under the same rename lock:
      
        have_submounts()
        shrink_dcache_parent()
        d_drop()
      
      This is actually an optimization since have_submounts() and
      shrink_dcache_parent() both traverse the same dentry tree separately.
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@suse.cz>
      CC: David Howells <dhowells@redhat.com>
      CC: Steven Whitehouse <swhiteho@redhat.com>
      CC: Trond Myklebust <Trond.Myklebust@netapp.com>
      CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      848ac114
    • Miklos Szeredi's avatar
      vfs: add d_walk() · db14fc3a
      Miklos Szeredi authored
      
      
      This one replaces three instances open coded tree walking (have_submounts,
      select_parent, d_genocide) with a common helper.
      
      In addition to slightly reducing the kernel size, this simplifies the
      callers and makes them less bug prone.
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@suse.cz>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      db14fc3a
    • Miklos Szeredi's avatar
      vfs: restructure d_genocide() · 01ddc4ed
      Miklos Szeredi authored
      
      
      It shouldn't matter when we decrement the refcount during the walk as long
      as we do it exactly once.
      
      Restructure d_genocide() to do the killing on entering the dentry instead
      of when leaving it.  This helps creating a common helper for tree walking.
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@suse.cz>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      01ddc4ed
  10. 04 Sep, 2013 1 commit
    • Yan, Zheng's avatar
      vfs: call d_op->d_prune() before unhashing dentry · 590fb51f
      Yan, Zheng authored
      
      
      The d_prune dentry operation is used to notify filesystem when VFS
      about to prune a hashed dentry from the dcache. There are three
      code paths that prune dentries: shrink_dcache_for_umount_subtree(),
      prune_dcache_sb() and d_prune_aliases(). For the d_prune_aliases()
      case, VFS unhashes the dentry first, then call the d_prune dentry
      operation. This confuses ceph_d_prune() (ceph uses the d_prune
      dentry operation to maintain a flag indicating whether the complete
      contents of a directory are in the dcache, pruning unhashed dentry
      does not affect dir's completeness)
      
      This patch fixes the issue by calling the d_prune dentry operation
      in d_prune_aliases(), before unhashing the dentry. Also make VFS
      only call the d_prune dentry operation for hashed dentry, to avoid
      calling the d_prune dentry operation twice when dentry is pruned
      by d_prune_aliases().
      Signed-off-by: default avatarYan, Zheng <zheng.z.yan@intel.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      590fb51f
  11. 02 Sep, 2013 2 commits
    • Linus Torvalds's avatar
      vfs: reimplement d_rcu_to_refcount() using lockref_get_or_lock() · 15570086
      Linus Torvalds authored
      
      
      This moves __d_rcu_to_refcount() from <linux/dcache.h> into fs/namei.c
      and re-implements it using the lockref infrastructure instead.  It also
      adds a lot of comments about what is actually going on, because turning
      a dentry that was looked up using RCU into a long-lived reference
      counted entry is one of the more subtle parts of the rcu walk.
      
      We also used to be _particularly_ subtle in unlazy_walk() where we
      re-validate both the dentry and its parent using the same sequence
      count.  We used to do it by nesting the locks and then verifying the
      sequence count just once.
      
      That was silly, because nested locking is expensive, but the sequence
      count check is not.  So this just re-validates the dentry and the parent
      separately, avoiding the nested locking, and making the lockref lookup
      possible.
      Acked-by: default avatarWaiman Long <waiman.long@hp.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      15570086
    • Waiman Long's avatar
      vfs: use lockref_get_not_zero() for optimistic lockless dget_parent() · df3d0bbc
      Waiman Long authored
      
      
      A valid parent pointer is always going to have a non-zero reference
      count, but if we look up the parent optimistically without locking, we
      have to protect against the (very unlikely) race against renaming
      changing the parent from under us.
      
      We do that by using lockref_get_not_zero(), and then re-checking the
      parent pointer after getting a valid reference.
      
      [ This is a re-implementation of a chunk from the original patch by
        Waiman Long: "dcache: Enable lockless update of dentry's refcount".
        I've completely rewritten the patch-series and split it up, but I'm
        attributing this part to Waiman as it's close enough to his earlier
        patch  - Linus ]
      Signed-off-by: default avatarWaiman Long <Waiman.Long@hp.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      df3d0bbc
  12. 29 Aug, 2013 1 commit
    • Waiman Long's avatar
      vfs: make the dentry cache use the lockref infrastructure · 98474236
      Waiman Long authored
      
      
      This just replaces the dentry count/lock combination with the lockref
      structure that contains both a count and a spinlock, and does the
      mechanical conversion to use the lockref infrastructure.
      
      There are no semantic changes here, it's purely syntactic.  The
      reference lockref implementation uses the spinlock exactly the same way
      that the old dcache code did, and the bulk of this patch is just
      expanding the internal "d_count" use in the dcache code to use
      "d_lockref.count" instead.
      
      This is purely preparation for the real change to make the reference
      count updates be lockless during the 3.12 merge window.
      
      [ As with the previous commit, this is a rewritten version of a concept
        originally from Waiman, so credit goes to him, blame for any errors
        goes to me.
      
        Waiman's patch had some semantic differences for taking advantage of
        the lockless update in dget_parent(), while this patch is
        intentionally a pure search-and-replace change with no semantic
        changes.     - Linus ]
      Signed-off-by: default avatarWaiman Long <Waiman.Long@hp.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      98474236
  13. 24 Aug, 2013 1 commit
  14. 29 Jun, 2013 3 commits
  15. 14 Jun, 2013 1 commit
    • Steven Whitehouse's avatar
      GFS2: Add atomic_open support · 6d4ade98
      Steven Whitehouse authored
      
      
      I've restricted atomic_open to only operate on regular files, although
      I still don't understand why atomic_open should not be possible also for
      directories on GFS2. That can always be added in later though, if it
      makes sense.
      
      The ->atomic_open function can be passed negative dentries, which
      in most cases means either ENOENT (->lookup) or a call to d_instantiate
      (->create). In the GFS2 case though, we need to actually perform the
      look up, since we do not know whether there has been a new inode created
      on another node. The look up calls d_splice_alias which then tries to
      rehash the dentry - so the solution here is to simply check for that
      in d_splice_alias. The same issue is likely to affect any other cluster
      filesystem implementing ->atomic_open
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: "J. Bruce Fields" <bfields fieldses org>
      Cc: Jeff Layton <jlayton@redhat.com>
      6d4ade98
  16. 04 May, 2013 2 commits