1. 09 Aug, 2018 2 commits
    • Al Viro's avatar
      fix __legitimize_mnt()/mntput() race · 119e1ef8
      Al Viro authored
      __legitimize_mnt() has two problems - one is that in case of success
      the check of mount_lock is not ordered wrt preceding increment of
      refcount, making it possible to have successful __legitimize_mnt()
      on one CPU just before the otherwise final mntpu() on another,
      with __legitimize_mnt() not seeing mntput() taking the lock and
      mntput() not seeing the increment done by __legitimize_mnt().
      Solved by a pair of barriers.
      
      Another is that failure of __legitimize_mnt() on the second
      read_seqretry() leaves us with reference that'll need to be
      dropped by caller; however, if that races with final mntput()
      we can end up with caller dropping rcu_read_lock() and doing
      mntput() to release that reference - with the first mntput()
      having freed the damn thing just as rcu_read_lock() had been
      dropped.  Solution: in "do mntput() yourself" failure case
      grab mount_lock, check if MNT_DOOMED has been set by racing
      final mntput() that has missed our increment and if it has -
      undo the increment and treat that as "failure, caller doesn't
      need to drop anything" case.
      
      It's not easy to hit - the final mntput() has to come right
      after the first read_seqretry() in __legitimize_mnt() *and*
      manage to miss the increment done by __legitimize_mnt() before
      the second read_seqretry() in there.  The things that are almost
      impossible to hit on bare hardware are not impossible on SMP
      KVM, though...
      Reported-by: default avatarOleg Nesterov <oleg@redhat.com>
      Fixes: 48a066e7 ("RCU'd vsfmounts")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      119e1ef8
    • Al Viro's avatar
      fix mntput/mntput race · 9ea0a46c
      Al Viro authored
      mntput_no_expire() does the calculation of total refcount under mount_lock;
      unfortunately, the decrement (as well as all increments) are done outside
      of it, leading to false positives in the "are we dropping the last reference"
      test.  Consider the following situation:
      	* mnt is a lazy-umounted mount, kept alive by two opened files.  One
      of those files gets closed.  Total refcount of mnt is 2.  On CPU 42
      mntput(mnt) (called from __fput()) drops one reference, decrementing component
      	* After it has looked at component #0, the process on CPU 0 does
      mntget(), incrementing component #0, gets preempted and gets to run again -
      on CPU 69.  There it does mntput(), which drops the reference (component #69)
      and proceeds to spin on mount_lock.
      	* On CPU 42 our first mntput() finishes counting.  It observes the
      decrement of component #69, but not the increment of component #0.  As the
      result, the total it gets is not 1 as it should've been - it's 0.  At which
      point we decide that vfsmount needs to be killed and proceed to free it and
      shut the filesystem down.  However, there's still another opened file
      on that filesystem, with reference to (now freed) vfsmount, etc. and we are
      screwed.
      
      It's not a wide race, but it can be reproduced with artificial slowdown of
      the mnt_get_count() loop, and it should be easier to hit on SMP KVM setups.
      
      Fix consists of moving the refcount decrement under mount_lock; the tricky
      part is that we want (and can) keep the fast case (i.e. mount that still
      has non-NULL ->mnt_ns) entirely out of mount_lock.  All places that zero
      mnt->mnt_ns are dropping some reference to mnt and they call synchronize_rcu()
      before that mntput().  IOW, if mntput() observes (under rcu_read_lock())
      a non-NULL ->mnt_ns, it is guaranteed that there is another reference yet to
      be dropped.
      Reported-by: default avatarJann Horn <jannh@google.com>
      Tested-by: default avatarJann Horn <jannh@google.com>
      Fixes: 48a066e7 ("RCU'd vsfmounts")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      9ea0a46c
  2. 24 May, 2018 1 commit
  3. 20 Apr, 2018 2 commits
  4. 02 Apr, 2018 2 commits
  5. 10 Dec, 2017 1 commit
  6. 08 Nov, 2017 1 commit
  7. 25 Oct, 2017 1 commit
    • Mark Rutland's avatar
      locking/atomics: COCCINELLE/treewide: Convert trivial ACCESS_ONCE() patterns... · 6aa7de05
      Mark Rutland authored
      locking/atomics: COCCINELLE/treewide: Convert trivial ACCESS_ONCE() patterns to READ_ONCE()/WRITE_ONCE()
      
      Please do not apply this to mainline directly, instead please re-run the
      coccinelle script shown below and apply its output.
      
      For several reasons, it is desirable to use {READ,WRITE}_ONCE() in
      preference to ACCESS_ONCE(), and new code is expected to use one of the
      former. So far, there's been no reason to change most existing uses of
      ACCESS_ONCE(), as these aren't harmful, and changing them results in
      churn.
      
      However, for some features, the read/write distinction is critical to
      correct operation. To distinguish these cases, separate read/write
      accessors must be used. This patch migrates (most) remaining
      ACCESS_ONCE() instances to {READ,WRITE}_ONCE(), using the following
      coccinelle script:
      
      ----
      // Convert trivial ACCESS_ONCE() uses to equivalent READ_ONCE() and
      // WRITE_ONCE()
      
      // $ make coccicheck COCCI=/home/mark/once.cocci SPFLAGS="--include-headers" MODE=patch
      
      virtual patch
      
      @ depends on patch @
      expression E1, E2;
      @@
      
      - ACCESS_ONCE(E1) = E2
      + WRITE_ONCE(E1, E2)
      
      @ depends on patch @
      expression E;
      @@
      
      - ACCESS_ONCE(E)
      + READ_ONCE(E)
      ----
      Signed-off-by: default avatarMark Rutland <mark.rutland@arm.com>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: davem@davemloft.net
      Cc: linux-arch@vger.kernel.org
      Cc: mpe@ellerman.id.au
      Cc: shuah@kernel.org
      Cc: snitzer@redhat.com
      Cc: thor.thayer@linux.intel.com
      Cc: tj@kernel.org
      Cc: viro@zeniv.linux.org.uk
      Cc: will.deacon@arm.com
      Link: http://lkml.kernel.org/r/1508792849-3115-19-git-send-email-paulmck@linux.vnet.ibm.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      6aa7de05
  8. 17 Oct, 2017 1 commit
  9. 05 Oct, 2017 1 commit
  10. 05 Sep, 2017 1 commit
    • Miklos Szeredi's avatar
      ovl: don't allow writing ioctl on lower layer · 7c6893e3
      Miklos Szeredi authored
      Problem with ioctl() is that it's a file operation, yet often used as an
      inode operation (i.e. modify the inode despite the file being opened for
      read-only).
      
      mnt_want_write_file() is used by filesystems in such cases to get write
      access on an arbitrary open file.
      
      Since overlayfs lets filesystems do all file operations, including ioctl,
      this can lead to mnt_want_write_file() returning OK for a lower file and
      modification of that lower file.
      
      This patch prevents modification by checking if the file is from an
      overlayfs lower layer and returning EPERM in that case.
      
      Need to introduce a mnt_want_write_file_path() variant that still does the
      old thing for inode operations that can do the copy up + modification
      correctly in such cases (fchown, fsetxattr, fremovexattr).
      
      This does not address the correctness of such ioctls on overlayfs (the
      correct way would be to copy up and attempt to perform ioctl on upper
      file).
      
      In theory this could be a regression.  We very much hope that nobody is
      relying on such a hack in any sane setup.
      
      While this patch meddles in VFS code, it has no effect on non-overlayfs
      filesystems.
      Reported-by: default avatar"zhangyi (F)" <yi.zhang@huawei.com>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      7c6893e3
  11. 28 Aug, 2017 1 commit
  12. 17 Jul, 2017 2 commits
    • David Howells's avatar
      VFS: Differentiate mount flags (MS_*) from internal superblock flags · e462ec50
      David Howells authored
      Differentiate the MS_* flags passed to mount(2) from the internal flags set
      in the super_block's s_flags.  s_flags are now called SB_*, with the names
      and the values for the moment mirroring the MS_* flags that they're
      equivalent to.
      
      In this patch, just the headers are altered and some kernel code where
      blind automated conversion isn't necessarily correct.
      
      Note that this shows up some interesting issues:
      
       (1) Some MS_* flags get translated to MNT_* flags (such as MS_NODEV ->
           MNT_NODEV) without passing this on to the filesystem, but some
           filesystems set such flags anyway.
      
       (2) The ->remount_fs() methods of some filesystems adjust the *flags
           argument by setting MS_* flags in it, such as MS_NOATIME - but these
           flags are then scrubbed by do_remount_sb() (only the occupants of
           MS_RMT_MASK are permitted: MS_RDONLY, MS_SYNCHRONOUS, MS_MANDLOCK,
           MS_I_VERSION and MS_LAZYTIME)
      
      I'm not sure what's the best way to solve all these cases.
      Suggested-by: default avatarAl Viro <viro@ZenIV.linux.org.uk>
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      e462ec50
    • David Howells's avatar
      VFS: Convert sb->s_flags & MS_RDONLY to sb_rdonly(sb) · bc98a42c
      David Howells authored
      Firstly by applying the following with coccinelle's spatch:
      
      	@@ expression SB; @@
      	-SB->s_flags & MS_RDONLY
      	+sb_rdonly(SB)
      
      to effect the conversion to sb_rdonly(sb), then by applying:
      
      	@@ expression A, SB; @@
      	(
      	-(!sb_rdonly(SB)) && A
      	+!sb_rdonly(SB) && A
      	|
      	-A != (sb_rdonly(SB))
      	+A != sb_rdonly(SB)
      	|
      	-A == (sb_rdonly(SB))
      	+A == sb_rdonly(SB)
      	|
      	-!(sb_rdonly(SB))
      	+!sb_rdonly(SB)
      	|
      	-A && (sb_rdonly(SB))
      	+A && sb_rdonly(SB)
      	|
      	-A || (sb_rdonly(SB))
      	+A || sb_rdonly(SB)
      	|
      	-(sb_rdonly(SB)) != A
      	+sb_rdonly(SB) != A
      	|
      	-(sb_rdonly(SB)) == A
      	+sb_rdonly(SB) == A
      	|
      	-(sb_rdonly(SB)) && A
      	+sb_rdonly(SB) && A
      	|
      	-(sb_rdonly(SB)) || A
      	+sb_rdonly(SB) || A
      	)
      
      	@@ expression A, B, SB; @@
      	(
      	-(sb_rdonly(SB)) ? 1 : 0
      	+sb_rdonly(SB)
      	|
      	-(sb_rdonly(SB)) ? A : B
      	+sb_rdonly(SB) ? A : B
      	)
      
      to remove left over excess bracketage and finally by applying:
      
      	@@ expression A, SB; @@
      	(
      	-(A & MS_RDONLY) != sb_rdonly(SB)
      	+(bool)(A & MS_RDONLY) != sb_rdonly(SB)
      	|
      	-(A & MS_RDONLY) == sb_rdonly(SB)
      	+(bool)(A & MS_RDONLY) == sb_rdonly(SB)
      	)
      
      to make comparisons against the result of sb_rdonly() (which is a bool)
      work correctly.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      bc98a42c
  13. 11 Jul, 2017 1 commit
  14. 06 Jul, 2017 2 commits
  15. 15 Jun, 2017 1 commit
  16. 23 May, 2017 2 commits
    • Eric W. Biederman's avatar
      mnt: In propgate_umount handle visiting mounts in any order · 99b19d16
      Eric W. Biederman authored
      While investigating some poor umount performance I realized that in
      the case of overlapping mount trees where some of the mounts are locked
      the code has been failing to unmount all of the mounts it should
      have been unmounting.
      
      This failure to unmount all of the necessary
      mounts can be reproduced with:
      
      $ cat locked_mounts_test.sh
      
      mount -t tmpfs test-base /mnt
      mount --make-shared /mnt
      mkdir -p /mnt/b
      
      mount -t tmpfs test1 /mnt/b
      mount --make-shared /mnt/b
      mkdir -p /mnt/b/10
      
      mount -t tmpfs test2 /mnt/b/10
      mount --make-shared /mnt/b/10
      mkdir -p /mnt/b/10/20
      
      mount --rbind /mnt/b /mnt/b/10/20
      
      unshare -Urm --propagation unchaged /bin/sh -c 'sleep 5; if [ $(grep test /proc/self/mountinfo | wc -l) -eq 1 ] ; then echo SUCCESS ; else echo FAILURE ; fi'
      sleep 1
      umount -l /mnt/b
      wait %%
      
      $ unshare -Urm ./locked_mounts_test.sh
      
      This failure is corrected by removing the prepass that marks mounts
      that may be umounted.
      
      A first pass is added that umounts mounts if possible and if not sets
      mount mark if they could be unmounted if they weren't locked and adds
      them to a list to umount possibilities.  This first pass reconsiders
      the mounts parent if it is on the list of umount possibilities, ensuring
      that information of umoutability will pass from child to mount parent.
      
      A second pass then walks through all mounts that are umounted and processes
      their children unmounting them or marking them for reparenting.
      
      A last pass cleans up the state on the mounts that could not be umounted
      and if applicable reparents them to their first parent that remained
      mounted.
      
      While a bit longer than the old code this code is much more robust
      as it allows information to flow up from the leaves and down
      from the trunk making the order in which mounts are encountered
      in the umount propgation tree irrelevant.
      
      Cc: stable@vger.kernel.org
      Fixes: 0c56fe31 ("mnt: Don't propagate unmounts to locked mounts")
      Reviewed-by: default avatarAndrei Vagin <avagin@virtuozzo.com>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      99b19d16
    • Eric W. Biederman's avatar
      mnt: In umount propagation reparent in a separate pass · 570487d3
      Eric W. Biederman authored
      It was observed that in some pathlogical cases that the current code
      does not unmount everything it should.  After investigation it
      was determined that the issue is that mnt_change_mntpoint can
      can change which mounts are available to be unmounted during mount
      propagation which is wrong.
      
      The trivial reproducer is:
      $ cat ./pathological.sh
      
      mount -t tmpfs test-base /mnt
      cd /mnt
      mkdir 1 2 1/1
      mount --bind 1 1
      mount --make-shared 1
      mount --bind 1 2
      mount --bind 1/1 1/1
      mount --bind 1/1 1/1
      echo
      grep test-base /proc/self/mountinfo
      umount 1/1
      echo
      grep test-base /proc/self/mountinfo
      
      $ unshare -Urm ./pathological.sh
      
      The expected output looks like:
      46 31 0:25 / /mnt rw,relatime - tmpfs test-base rw,uid=1000,gid=1000
      47 46 0:25 /1 /mnt/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      48 46 0:25 /1 /mnt/2 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      49 54 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      50 53 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      51 49 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      54 47 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      53 48 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      52 50 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      
      46 31 0:25 / /mnt rw,relatime - tmpfs test-base rw,uid=1000,gid=1000
      47 46 0:25 /1 /mnt/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      48 46 0:25 /1 /mnt/2 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      
      The output without the fix looks like:
      46 31 0:25 / /mnt rw,relatime - tmpfs test-base rw,uid=1000,gid=1000
      47 46 0:25 /1 /mnt/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      48 46 0:25 /1 /mnt/2 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      49 54 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      50 53 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      51 49 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      54 47 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      53 48 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      52 50 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      
      46 31 0:25 / /mnt rw,relatime - tmpfs test-base rw,uid=1000,gid=1000
      47 46 0:25 /1 /mnt/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      48 46 0:25 /1 /mnt/2 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      52 48 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      
      That last mount in the output was in the propgation tree to be unmounted but
      was missed because the mnt_change_mountpoint changed it's parent before the walk
      through the mount propagation tree observed it.
      
      Cc: stable@vger.kernel.org
      Fixes: 1064f874 ("mnt: Tuck mounts under others instead of creating shadow/side mounts.")
      Acked-by: default avatarAndrei Vagin <avagin@virtuozzo.com>
      Reviewed-by: default avatarRam Pai <linuxram@us.ibm.com>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      570487d3
  17. 21 Apr, 2017 1 commit
  18. 10 Apr, 2017 2 commits
    • Jan Kara's avatar
      fsnotify: Free fsnotify_mark_connector when there is no mark attached · 08991e83
      Jan Kara authored
      Currently we free fsnotify_mark_connector structure only when inode /
      vfsmount is getting freed. This can however impose noticeable memory
      overhead when marks get attached to inodes only temporarily. So free the
      connector structure once the last mark is detached from the object.
      Since notification infrastructure can be working with the connector
      under the protection of fsnotify_mark_srcu, we have to be careful and
      free the fsnotify_mark_connector only after SRCU period passes.
      Reviewed-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      Reviewed-by: default avatarAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      08991e83
    • Jan Kara's avatar
      fsnotify: Move mark list head from object into dedicated structure · 9dd813c1
      Jan Kara authored
      Currently notification marks are attached to object (inode or vfsmnt) by
      a hlist_head in the object. The list is also protected by a spinlock in
      the object. So while there is any mark attached to the list of marks,
      the object must be pinned in memory (and thus e.g. last iput() deleting
      inode cannot happen). Also for list iteration in fsnotify() to work, we
      must hold fsnotify_mark_srcu lock so that mark itself and
      mark->obj_list.next cannot get freed. Thus we are required to wait for
      response to fanotify events from userspace process with
      fsnotify_mark_srcu lock held. That causes issues when userspace process
      is buggy and does not reply to some event - basically the whole
      notification subsystem gets eventually stuck.
      
      So to be able to drop fsnotify_mark_srcu lock while waiting for
      response, we have to pin the mark in memory and make sure it stays in
      the object list (as removing the mark waiting for response could lead to
      lost notification events for groups later in the list). However we don't
      want inode reclaim to block on such mark as that would lead to system
      just locking up elsewhere.
      
      This commit is the first in the series that paves way towards solving
      these conflicting lifetime needs. Instead of anchoring the list of marks
      directly in the object, we anchor it in a dedicated structure
      (fsnotify_mark_connector) and just point to that structure from the
      object. The following commits will also add spinlock protecting the list
      and object pointer to the structure.
      Reviewed-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      Reviewed-by: default avatarAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      9dd813c1
  19. 02 Mar, 2017 2 commits
  20. 03 Feb, 2017 1 commit
    • Eric W. Biederman's avatar
      mnt: Tuck mounts under others instead of creating shadow/side mounts. · 1064f874
      Eric W. Biederman authored
      Ever since mount propagation was introduced in cases where a mount in
      propagated to parent mount mountpoint pair that is already in use the
      code has placed the new mount behind the old mount in the mount hash
      table.
      
      This implementation detail is problematic as it allows creating
      arbitrary length mount hash chains.
      
      Furthermore it invalidates the constraint maintained elsewhere in the
      mount code that a parent mount and a mountpoint pair will have exactly
      one mount upon them.  Making it hard to deal with and to talk about
      this special case in the mount code.
      
      Modify mount propagation to notice when there is already a mount at
      the parent mount and mountpoint where a new mount is propagating to
      and place that preexisting mount on top of the new mount.
      
      Modify unmount propagation to notice when a mount that is being
      unmounted has another mount on top of it (and no other children), and
      to replace the unmounted mount with the mount on top of it.
      
      Move the MNT_UMUONT test from __lookup_mnt_last into
      __propagate_umount as that is the only call of __lookup_mnt_last where
      MNT_UMOUNT may be set on any mount visible in the mount hash table.
      
      These modifications allow:
       - __lookup_mnt_last to be removed.
       - attach_shadows to be renamed __attach_mnt and its shadow
         handling to be removed.
       - commit_tree to be simplified
       - copy_tree to be simplified
      
      The result is an easier to understand tree of mounts that does not
      allow creation of arbitrary length hash chains in the mount hash table.
      
      The result is also a very slight userspace visible difference in semantics.
      The following two cases now behave identically, where before order
      mattered:
      
      case 1: (explicit user action)
      	B is a slave of A
      	mount something on A/a , it will propagate to B/a
      	and than mount something on B/a
      
      case 2: (tucked mount)
      	B is a slave of A
      	mount something on B/a
      	and than mount something on A/a
      
      Histroically umount A/a would fail in case 1 and succeed in case 2.
      Now umount A/a succeeds in both configurations.
      
      This very small change in semantics appears if anything to be a bug
      fix to me and my survey of userspace leads me to believe that no programs
      will notice or care of this subtle semantic change.
      
      v2: Updated to mnt_change_mountpoint to not call dput or mntput
      and instead to decrement the counts directly.  It is guaranteed
      that there will be other references when mnt_change_mountpoint is
      called so this is safe.
      
      v3: Moved put_mountpoint under mount_lock in attach_recursive_mnt
          As the locking in fs/namespace.c changed between v2 and v3.
      
      v4: Reworked the logic in propagate_mount_busy and __propagate_umount
          that detects when a mount completely covers another mount.
      
      v5: Removed unnecessary tests whose result is alwasy true in
          find_topper and attach_recursive_mnt.
      
      v6: Document the user space visible semantic difference.
      
      Cc: stable@vger.kernel.org
      Fixes: b90fa9ae ("[PATCH] shared mount handling: bind and rbind")
      Tested-by: default avatarAndrei Vagin <avagin@virtuozzo.com>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      1064f874
  21. 01 Feb, 2017 1 commit
    • Eric W. Biederman's avatar
      fs: Better permission checking for submounts · 93faccbb
      Eric W. Biederman authored
      To support unprivileged users mounting filesystems two permission
      checks have to be performed: a test to see if the user allowed to
      create a mount in the mount namespace, and a test to see if
      the user is allowed to access the specified filesystem.
      
      The automount case is special in that mounting the original filesystem
      grants permission to mount the sub-filesystems, to any user who
      happens to stumble across the their mountpoint and satisfies the
      ordinary filesystem permission checks.
      
      Attempting to handle the automount case by using override_creds
      almost works.  It preserves the idea that permission to mount
      the original filesystem is permission to mount the sub-filesystem.
      Unfortunately using override_creds messes up the filesystems
      ordinary permission checks.
      
      Solve this by being explicit that a mount is a submount by introducing
      vfs_submount, and using it where appropriate.
      
      vfs_submount uses a new mount internal mount flags MS_SUBMOUNT, to let
      sget and friends know that a mount is a submount so they can take appropriate
      action.
      
      sget and sget_userns are modified to not perform any permission checks
      on submounts.
      
      follow_automount is modified to stop using override_creds as that
      has proven problemantic.
      
      do_mount is modified to always remove the new MS_SUBMOUNT flag so
      that we know userspace will never by able to specify it.
      
      autofs4 is modified to stop using current_real_cred that was put in
      there to handle the previous version of submount permission checking.
      
      cifs is modified to pass the mountpoint all of the way down to vfs_submount.
      
      debugfs is modified to pass the mountpoint all of the way down to
      trace_automount by adding a new parameter.  To make this change easier
      a new typedef debugfs_automount_t is introduced to capture the type of
      the debugfs automount function.
      
      Cc: stable@vger.kernel.org
      Fixes: 069d5ac9 ("autofs:  Fix automounts by using current_real_cred()->uid")
      Fixes: aeaa4a79 ("fs: Call d_automount with the filesystems creds")
      Reviewed-by: default avatarTrond Myklebust <trond.myklebust@primarydata.com>
      Reviewed-by: default avatarSeth Forshee <seth.forshee@canonical.com>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      93faccbb
  22. 10 Jan, 2017 1 commit
    • Eric W. Biederman's avatar
      mnt: Protect the mountpoint hashtable with mount_lock · 3895dbf8
      Eric W. Biederman authored
      Protecting the mountpoint hashtable with namespace_sem was sufficient
      until a call to umount_mnt was added to mntput_no_expire.  At which
      point it became possible for multiple calls of put_mountpoint on
      the same hash chain to happen on the same time.
      
      Kristen Johansen <kjlx@templeofstupid.com> reported:
      > This can cause a panic when simultaneous callers of put_mountpoint
      > attempt to free the same mountpoint.  This occurs because some callers
      > hold the mount_hash_lock, while others hold the namespace lock.  Some
      > even hold both.
      >
      > In this submitter's case, the panic manifested itself as a GP fault in
      > put_mountpoint() when it called hlist_del() and attempted to dereference
      > a m_hash.pprev that had been poisioned by another thread.
      
      Al Viro observed that the simple fix is to switch from using the namespace_sem
      to the mount_lock to protect the mountpoint hash table.
      
      I have taken Al's suggested patch moved put_mountpoint in pivot_root
      (instead of taking mount_lock an additional time), and have replaced
      new_mountpoint with get_mountpoint a function that does the hash table
      lookup and addition under the mount_lock.   The introduction of get_mounptoint
      ensures that only the mount_lock is needed to manipulate the mountpoint
      hashtable.
      
      d_set_mounted is modified to only set DCACHE_MOUNTED if it is not
      already set.  This allows get_mountpoint to use the setting of
      DCACHE_MOUNTED to ensure adding a struct mountpoint for a dentry
      happens exactly once.
      
      Cc: stable@vger.kernel.org
      Fixes: ce07d891 ("mnt: Honor MNT_LOCKED when detaching mounts")
      Reported-by: default avatarKrister Johansen <kjlx@templeofstupid.com>
      Suggested-by: default avatarAl Viro <viro@ZenIV.linux.org.uk>
      Acked-by: default avatarAl Viro <viro@ZenIV.linux.org.uk>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      3895dbf8
  23. 16 Dec, 2016 3 commits
  24. 06 Dec, 2016 1 commit
  25. 05 Dec, 2016 1 commit
  26. 04 Dec, 2016 1 commit
  27. 10 Oct, 2016 1 commit
    • Emese Revfy's avatar
      latent_entropy: Mark functions with __latent_entropy · 0766f788
      Emese Revfy authored
      The __latent_entropy gcc attribute can be used only on functions and
      variables.  If it is on a function then the plugin will instrument it for
      gathering control-flow entropy. If the attribute is on a variable then
      the plugin will initialize it with random contents.  The variable must
      be an integer, an integer array type or a structure with integer fields.
      
      These specific functions have been selected because they are init
      functions (to help gather boot-time entropy), are called at unpredictable
      times, or they have variable loops, each of which provide some level of
      latent entropy.
      Signed-off-by: default avatarEmese Revfy <re.emese@gmail.com>
      [kees: expanded commit message]
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      0766f788
  28. 30 Sep, 2016 1 commit
    • Eric W. Biederman's avatar
      mnt: Add a per mount namespace limit on the number of mounts · d2921684
      Eric W. Biederman authored
      CAI Qian <caiqian@redhat.com> pointed out that the semantics
      of shared subtrees make it possible to create an exponentially
      increasing number of mounts in a mount namespace.
      
          mkdir /tmp/1 /tmp/2
          mount --make-rshared /
          for i in $(seq 1 20) ; do mount --bind /tmp/1 /tmp/2 ; done
      
      Will create create 2^20 or 1048576 mounts, which is a practical problem
      as some people have managed to hit this by accident.
      
      As such CVE-2016-6213 was assigned.
      
      Ian Kent <raven@themaw.net> described the situation for autofs users
      as follows:
      
      > The number of mounts for direct mount maps is usually not very large because of
      > the way they are implemented, large direct mount maps can have performance
      > problems. There can be anywhere from a few (likely case a few hundred) to less
      > than 10000, plus mounts that have been triggered and not yet expired.
      >
      > Indirect mounts have one autofs mount at the root plus the number of mounts that
      > have been triggered and not yet expired.
      >
      > The number of autofs indirect map entries can range from a few to the common
      > case of several thousand and in rare cases up to between 30000 and 50000. I've
      > not heard of people with maps larger than 50000 entries.
      >
      > The larger the number of map entries the greater the possibility for a large
      > number of active mounts so it's not hard to expect cases of a 1000 or somewhat
      > more active mounts.
      
      So I am setting the default number of mounts allowed per mount
      namespace at 100,000.  This is more than enough for any use case I
      know of, but small enough to quickly stop an exponential increase
      in mounts.  Which should be perfect to catch misconfigurations and
      malfunctioning programs.
      
      For anyone who needs a higher limit this can be changed by writing
      to the new /proc/sys/fs/mount-max sysctl.
      Tested-by: default avatarCAI Qian <caiqian@redhat.com>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      d2921684
  29. 23 Sep, 2016 1 commit
  30. 22 Sep, 2016 1 commit