1. 22 Sep, 2016 1 commit
  2. 05 Jul, 2016 1 commit
    • Eric W. Biederman's avatar
      vfs: Don't modify inodes with a uid or gid unknown to the vfs · 0bd23d09
      Eric W. Biederman authored
      When a filesystem outside of init_user_ns is mounted it could have
      uids and gids stored in it that do not map to init_user_ns.
      The plan is to allow those filesystems to set i_uid to INVALID_UID and
      i_gid to INVALID_GID for unmapped uids and gids and then to handle
      that strange case in the vfs to ensure there is consistent robust
      handling of the weirdness.
      Upon a careful review of the vfs and filesystems about the only case
      where there is any possibility of confusion or trouble is when the
      inode is written back to disk.  In that case filesystems typically
      read the inode->i_uid and inode->i_gid and write them to disk even
      when just an inode timestamp is being updated.
      Which leads to a rule that is very simple to implement and understand
      inodes whose i_uid or i_gid is not valid may not be written.
      In dealing with access times this means treat those inodes as if the
      inode flag S_NOATIME was set.  Reads of the inodes appear safe and
      useful, but any write or modification is disallowed.  The only inode
      write that is allowed is a chown that sets the uid and gid on the
      inode to valid values.  After such a chown the inode is normal and may
      be treated as such.
      Denying all writes to inodes with uids or gids unknown to the vfs also
      prevents several oddball cases where corruption would have occurred
      because the vfs does not have complete information.
      One problem case that is prevented is attempting to use the gid of a
      directory for new inodes where the directories sgid bit is set but the
      directories gid is not mapped.
      Another problem case avoided is attempting to update the evm hash
      after setxattr, removexattr, and setattr.  As the evm hash includeds
      the inode->i_uid or inode->i_gid not knowning the uid or gid prevents
      a correct evm hash from being computed.  evm hash verification also
      fails when i_uid or i_gid is unknown but that is essentially harmless
      as it does not cause filesystem corruption.
      Acked-by: default avatarSeth Forshee <seth.forshee@canonical.com>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
  3. 28 Jun, 2016 1 commit
  4. 22 Jan, 2016 1 commit
    • Al Viro's avatar
      wrappers for ->i_mutex access · 5955102c
      Al Viro authored
      parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
      inode_foo(inode) being mutex_foo(&inode->i_mutex).
      Please, use those for access to ->i_mutex; over the coming cycle
      ->i_mutex will become rwsem, with ->lookup() done with it held
      only shared.
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
  5. 10 Jun, 2014 1 commit
    • Andy Lutomirski's avatar
      fs,userns: Change inode_capable to capable_wrt_inode_uidgid · 23adbe12
      Andy Lutomirski authored
      The kernel has no concept of capabilities with respect to inodes; inodes
      exist independently of namespaces.  For example, inode_capable(inode,
      CAP_LINUX_IMMUTABLE) would be nonsense.
      This patch changes inode_capable to check for uid and gid mappings and
      renames it to capable_wrt_inode_uidgid, which should make it more
      obvious what it does.
      Fixes CVE-2014-4014.
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Serge Hallyn <serge.hallyn@ubuntu.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarAndy Lutomirski <luto@amacapital.net>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  6. 05 Dec, 2013 1 commit
  7. 09 Nov, 2013 1 commit
  8. 20 Nov, 2012 1 commit
  9. 07 Sep, 2012 1 commit
  10. 14 Jul, 2012 1 commit
  11. 31 May, 2012 1 commit
  12. 03 May, 2012 1 commit
  13. 29 Feb, 2012 1 commit
  14. 04 Jan, 2012 1 commit
  15. 21 Jul, 2011 2 commits
    • Christoph Hellwig's avatar
      fs: move inode_dio_wait calls into ->setattr · 562c72aa
      Christoph Hellwig authored
      Let filesystems handle waiting for direct I/O requests themselves instead
      of doing it beforehand.  This means filesystem-specific locks to prevent
      new dio referenes from appearing can be held.  This is important to allow
      generalizing i_dio_count to non-DIO_LOCKING filesystems.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
    • Christoph Hellwig's avatar
      fs: kill i_alloc_sem · bd5fe6c5
      Christoph Hellwig authored
      i_alloc_sem is a rather special rw_semaphore.  It's the last one that may
      be released by a non-owner, and it's write side is always mirrored by
      real exclusion.  It's intended use it to wait for all pending direct I/O
      requests to finish before starting a truncate.
      Replace it with a hand-grown construct:
       - exclusion for truncates is already guaranteed by i_mutex, so it can
         simply fall way
       - the reader side is replaced by an i_dio_count member in struct inode
         that counts the number of pending direct I/O requests.  Truncate can't
         proceed as long as it's non-zero
       - when i_dio_count reaches non-zero we wake up a pending truncate using
         wake_up_bit on a new bit in i_flags
       - new references to i_dio_count can't appear while we are waiting for
         it to read zero because the direct I/O count always needs i_mutex
         (or an equivalent like XFS's i_iolock) for starting a new operation.
      This scheme is much simpler, and saves the space of a spinlock_t and a
      struct list_head in struct inode (typically 160 bits on a non-debug 64-bit
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
  16. 18 Jul, 2011 1 commit
  17. 28 May, 2011 1 commit
    • Andi Kleen's avatar
      Cache xattr security drop check for write v2 · 69b45732
      Andi Kleen authored
      Some recent benchmarking on btrfs showed that a major scaling bottleneck
      on large systems on btrfs is currently the xattr lookup on every write.
      Why xattr lookup on every write I hear you ask?
      write wants to drop suid and security related xattrs that could set o
      capabilities for executables.  To do that it currently looks up
      security.capability on EVERY write (even for non executables) to decide
      whether to drop it or not.
      In btrfs this causes an additional tree walk, hitting some per file system
      locks and quite bad scalability. In a simple read workload on a 8S
      system I saw over 90% CPU time in spinlocks related to that.
      Chris Mason tells me this is also a problem in ext4, where it hits
      the global mbcache lock.
      This patch adds a simple per inode to avoid this problem.  We only
      do the lookup once per file and then if there is no xattr cache
      the decision. All xattr changes clear the flag.
      I also used the same flag to avoid the suid check, although
      that one is pretty cheap.
      A file system can also set this flag when it creates the inode,
      if it has a cheap way to do so.  This is done for some common file systems
      in followon patches.
      With this patch a major part of the lock contention disappears
      for btrfs. Some testing on smaller systems didn't show significant
      performance changes, but at least it helps the larger systems
      and is generally more efficient.
      v2: Rename is_sgid. add file system helper.
      Cc: chris.mason@oracle.com
      Cc: josef@redhat.com
      Cc: viro@zeniv.linux.org.uk
      Cc: agruen@linbit.com
      Cc: Serge E. Hallyn <serue@us.ibm.com>
      Signed-off-by: default avatarAndi Kleen <ak@linux.intel.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
  18. 31 Mar, 2011 1 commit
  19. 24 Mar, 2011 1 commit
  20. 09 Aug, 2010 4 commits
    • Christoph Hellwig's avatar
      check ATTR_SIZE contraints in inode_change_ok · 2c27c65e
      Christoph Hellwig authored
      Make sure we check the truncate constraints early on in ->setattr by adding
      those checks to inode_change_ok.  Also clean up and document inode_change_ok
      to make this obvious.
      As a fallout we don't have to call inode_newsize_ok from simple_setsize and
      simplify it down to a truncate_setsize which doesn't return an error.  This
      simplifies a lot of setattr implementations and means we use truncate_setsize
      almost everywhere.  Get rid of fat_setsize now that it's trivial and mark
      ext2_setsize static to make the calling convention obvious.
      Keep the inode_newsize_ok in vmtruncate for now as all callers need an
      audit for its removal anyway.
      Note: setattr code in ecryptfs doesn't call inode_change_ok at all and
      needs a deeper audit, but that is left for later.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
    • Christoph Hellwig's avatar
      remove inode_setattr · 1025774c
      Christoph Hellwig authored
      Replace inode_setattr with opencoded variants of it in all callers.  This
      moves the remaining call to vmtruncate into the filesystem methods where it
      can be replaced with the proper truncate sequence.
      In a few cases it was obvious that we would never end up calling vmtruncate
      so it was left out in the opencoded variant:
       spufs: explicitly checks for ATTR_SIZE earlier
       btrfs,hugetlbfs,logfs,dlmfs: explicitly clears ATTR_SIZE earlier
       ufs: contains an opencoded simple_seattr + truncate that sets the filesize just above
      In addition to that ncpfs called inode_setattr with handcrafted iattrs,
      which allowed to trim down the opencoded variant.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
    • Christoph Hellwig's avatar
      default to simple_setattr · eef2380c
      Christoph Hellwig authored
      With the new truncate sequence every filesystem that wants to support file
      size changes on disk needs to implement its own ->setattr.  So instead
      of calling inode_setattr which supports size changes call into a simple
      method that doesn't support this.  simple_setattr is almost what we
      want except that it does not mark the inode dirty after changes.  Given
      that marking the inode dirty is a no-op for the simple in-memory filesystems
      that use simple_setattr currently just add the mark_inode_dirty call.
      Also add a WARN_ON for the presence of a truncate method to simple_setattr
      to catch new instances of it during the transition period.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
    • Christoph Hellwig's avatar
      rename generic_setattr · 6a1a90ad
      Christoph Hellwig authored
      Despite its name it's now a generic implementation of ->setattr, but
      rather a helper to copy attributes from a struct iattr to the inode.
      Rename it to setattr_copy to reflect this fact.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
  21. 28 May, 2010 1 commit
    • npiggin@suse.de's avatar
      fs: introduce new truncate sequence · 7bb46a67
      npiggin@suse.de authored
      Introduce a new truncate calling sequence into fs/mm subsystems. Rather than
      setattr > vmtruncate > truncate, have filesystems call their truncate sequence
      from ->setattr if filesystem specific operations are required. vmtruncate is
      deprecated, and truncate_pagecache and inode_newsize_ok helpers introduced
      previously should be used.
      simple_setattr is introduced for simple in-ram filesystems to implement
      the new truncate sequence. Eventually all filesystems should be converted
      to implement a setattr, and the default code in notify_change should go
      simple_setsize is also introduced to perform just the ATTR_SIZE portion
      of simple_setattr (ie. changing i_size and trimming pagecache).
      To implement the new truncate sequence:
      - filesystem specific manipulations (eg freeing blocks) must be done in
        the setattr method rather than ->truncate.
      - vmtruncate can not be used by core code to trim blocks past i_size in
        the event of write failure after allocation, so this must be performed
        in the fs code.
      - convert usage of helpers block_write_begin, nobh_write_begin,
        cont_write_begin, and *blockdev_direct_IO* to use _newtrunc postfixed
        variants. These avoid calling vmtruncate to trim blocks (see previous).
      - inode_setattr should not be used. generic_setattr is a new function
        to be used to copy simple attributes into the generic inode.
      - make use of the better opportunity to handle errors with the new sequence.
      Big problem with the previous calling sequence: the filesystem is not called
      until i_size has already changed.  This means it is not allowed to fail the
      call, and also it does not know what the previous i_size was. Also, generic
      code calling vmtruncate to truncate allocated blocks in case of error had
      no good way to return a meaningful error (or, for example, atomically handle
      block deallocation).
      Cc: Christoph Hellwig <hch@lst.de>
      Acked-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
  22. 06 Mar, 2010 1 commit
  23. 04 Mar, 2010 1 commit
  24. 24 Sep, 2009 1 commit
  25. 26 Mar, 2009 1 commit
  26. 13 Nov, 2008 1 commit
  27. 23 Oct, 2008 1 commit
  28. 27 Jul, 2008 2 commits
    • Miklos Szeredi's avatar
      [patch 4/4] vfs: immutable inode checking cleanup · beb29e05
      Miklos Szeredi authored
      Move the immutable and append-only checks from chmod, chown and utimes
      into notify_change().  Checks for immutable and append-only files are
      always performed by the VFS and not by the filesystem (see
      permission() and may_...() in namei.c), so these belong in
      notify_change(), and not in inode_change_ok().
      This should be completely equivalent.
      CC: Ulrich Drepper <drepper@redhat.com>
      CC: Michael Kerrisk <mtk.manpages@gmail.com>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@suse.cz>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
    • Miklos Szeredi's avatar
      [patch 1/4] vfs: utimes: move owner check into inode_change_ok() · 9767d749
      Miklos Szeredi authored
      Add a new ia_valid flag: ATTR_TIMES_SET, to handle the
      cases neither ATTR_MTIME_SET nor ATTR_ATIME_SET is in the flags, yet
      the POSIX draft specifies that permission checking is performed the
      same way as if one or both of the times was explicitly set to a
      See the path "vfs: utimensat(): fix error checking for
      {UTIME_NOW,UTIME_OMIT} case" by Michael Kerrisk for the patch
      introducing this behavior.
      This is a cleanup, as well as allowing filesystems (NFS/fuse/...) to
      perform their own permission checking instead of the default.
      CC: Ulrich Drepper <drepper@redhat.com>
      CC: Michael Kerrisk <mtk.manpages@gmail.com>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@suse.cz>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
  29. 18 Oct, 2007 1 commit
    • Jeff Layton's avatar
      VFS: make notify_change pass ATTR_KILL_S*ID to setattr operations · 6de0ec00
      Jeff Layton authored
      When an unprivileged process attempts to modify a file that has the setuid or
      setgid bits set, the VFS will attempt to clear these bits.  The VFS will set
      the ATTR_KILL_SUID or ATTR_KILL_SGID bits in the ia_valid mask, and then call
      notify_change to clear these bits and set the mode accordingly.
      With a networked filesystem (NFS and CIFS in particular but likely others),
      the client machine or process may not have credentials that allow for setting
      the mode.  In some situations, this can lead to file corruption, an operation
      failing outright because the setattr fails, or to races that lead to a mode
      change being reverted.
      In this situation, we'd like to just leave the handling of this to the server
      and ignore these bits.  The problem is that by the time the setattr op is
      called, the VFS has already reinterpreted the ATTR_KILL_* bits into a mode
      change.  The setattr operation has no way to know its intent.
      The following patch fixes this by making notify_change no longer clear the
      ATTR_KILL_SUID and ATTR_KILL_SGID bits in the ia_valid before handing it off
      to the setattr inode op.  setattr can then check for the presence of these
      bits, and if they're set it can assume that the mode change was only for the
      purposes of clearing these bits.
      This means that we now have an implicit assumption that notify_change is never
      called with ATTR_MODE and either ATTR_KILL_S*ID bit set.  Nothing currently
      enforces that, so this patch also adds a BUG() if that occurs.
      Signed-off-by: default avatarJeff Layton <jlayton@redhat.com>
      Cc: Michael Halcrow <mhalcrow@us.ibm.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Neil Brown <neilb@suse.de>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Jeff Mahoney <jeffm@suse.com>
      Cc: "Vladimir V. Saveliev" <vs@namesys.com>
      Cc: Josef 'Jeff' Sipek <jsipek@cs.sunysb.edu>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Cc: Steven French <sfrench@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  30. 17 Oct, 2007 1 commit
    • Serge E. Hallyn's avatar
      Implement file posix capabilities · b5376771
      Serge E. Hallyn authored
      Implement file posix capabilities.  This allows programs to be given a
      subset of root's powers regardless of who runs them, without having to use
      setuid and giving the binary all of root's powers.
      This version works with Kaigai Kohei's userspace tools, found at
      http://www.kaigai.gr.jp/index.php.  For more information on how to use this
      patch, Chris Friedhoff has posted a nice page at
      	Nov 27:
      	Incorporate fixes from Andrew Morton
      	(security-introduce-file-caps-tweaks and
      	Fix Kconfig dependency.
      	Fix change signaling behavior when file caps are not compiled in.
      	Nov 13:
      	Integrate comments from Alexey: Remove CONFIG_ ifdef from
      	capability.h, and use %zd for printing a size_t.
      	Nov 13:
      	Fix endianness warnings by sparse as suggested by Alexey
      	Nov 09:
      	Address warnings of unused variables at cap_bprm_set_security
      	when file capabilities are disabled, and simultaneously clean
      	up the code a little, by pulling the new code into a helper
      	Nov 08:
      	For pointers to required userspace tools and how to use
      	them, see http://www.friedhoff.org/fscaps.html
      	Nov 07:
      	Fix the calculation of the highest bit checked in
      	Nov 07:
      	Allow file caps to be enabled without CONFIG_SECURITY, since
      	capabilities are the default.
      	Hook cap_task_setscheduler when !CONFIG_SECURITY.
      	Move capable(TASK_KILL) to end of cap_task_kill to reduce
      	audit messages.
      	Nov 05:
      	Add secondary calls in selinux/hooks.c to task_setioprio and
      	task_setscheduler so that selinux and capabilities with file
      	cap support can be stacked.
      	Sep 05:
      	As Seth Arnold points out, uid checks are out of place
      	for capability code.
      	Sep 01:
      	Define task_setscheduler, task_setioprio, cap_task_kill, and
      	task_setnice to make sure a user cannot affect a process in which
      	they called a program with some fscaps.
      	One remaining question is the note under task_setscheduler: are we
      	ok with CAP_SYS_NICE being sufficient to confine a process to a
      	It is a semantic change, as without fsccaps, attach_task doesn't
      	allow CAP_SYS_NICE to override the uid equivalence check.  But since
      	it uses security_task_setscheduler, which elsewhere is used where
      	CAP_SYS_NICE can be used to override the uid equivalence check,
      	fixing it might be tough.
      		 note: this also controls cpuset:attach_task.  Are we ok with
      		     CAP_SYS_NICE being used to confine to a cpuset?
      		 sys_setpriority uses this (through set_one_prio) for another
      		 process.  Need same checks as setrlimit
      	Aug 21:
      	Updated secureexec implementation to reflect the fact that
      	euid and uid might be the same and nonzero, but the process
      	might still have elevated caps.
      	Aug 15:
      	Handle endianness of xattrs.
      	Enforce capability version match between kernel and disk.
      	Enforce that no bits beyond the known max capability are
      	set, else return -EPERM.
      	With this extra processing, it may be worth reconsidering
      	doing all the work at bprm_set_security rather than
      	Aug 10:
      	Always call getxattr at bprm_set_security, rather than
      	caching it at d_instantiate.
      [morgan@kernel.org: file-caps clean up for linux/capability.h]
      [bunk@kernel.org: unexport cap_inode_killpriv]
      Signed-off-by: default avatarSerge E. Hallyn <serue@us.ibm.com>
      Cc: Stephen Smalley <sds@tycho.nsa.gov>
      Cc: James Morris <jmorris@namei.org>
      Cc: Chris Wright <chrisw@sous-sol.org>
      Cc: Andrew Morgan <morgan@kernel.org>
      Signed-off-by: default avatarAndrew Morgan <morgan@kernel.org>
      Signed-off-by: default avatarAdrian Bunk <bunk@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  31. 17 Jul, 2007 1 commit
    • Satyam Sharma's avatar
      Introduce is_owner_or_cap() to wrap CAP_FOWNER use with fsuid check · 3bd858ab
      Satyam Sharma authored
      Introduce is_owner_or_cap() macro in fs.h, and convert over relevant
      users to it. This is done because we want to avoid bugs in the future
      where we check for only effective fsuid of the current task against a
      file's owning uid, without simultaneously checking for CAP_FOWNER as
      well, thus violating its semantics.
      [ XFS uses special macros and structures, and in general looked ...
      untouchable, so we leave it alone -- but it has been looked over. ]
      The (current->fsuid != inode->i_uid) check in generic_permission() and
      exec_permission_lite() is left alone, because those operations are
      covered by CAP_DAC_OVERRIDE and CAP_DAC_READ_SEARCH. Similarly operations
      falling under the purview of CAP_CHOWN and CAP_LEASE are also left alone.
      Signed-off-by: default avatarSatyam Sharma <ssatyam@cse.iitk.ac.in>
      Cc: Al Viro <viro@ftp.linux.org.uk>
      Acked-by: default avatarSerge E. Hallyn <serge@hallyn.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  32. 08 May, 2007 1 commit
  33. 12 Jan, 2006 1 commit
  34. 11 Jan, 2006 1 commit
  35. 09 Jan, 2006 1 commit
    • NeilBrown's avatar
      [PATCH] Fix some problems with truncate and mtime semantics. · 4a30131e
      NeilBrown authored
      SUS requires that when truncating a file to the size that it currently
        truncate and ftruncate should NOT modify ctime or mtime
        O_TRUNC SHOULD modify ctime and mtime.
      Currently mtime and ctime are always modified on most local
      filesystems (side effect of ->truncate) or never modified (on NFS).
      With this patch:
        ATTR_CTIME|ATTR_MTIME are sent with ATTR_SIZE precisely when
          an update of these times is required whether size changes or not
          (via a new argument to do_truncate).  This allows NFS to do
          the right thing for O_TRUNC.
        inode_setattr nolonger forces ATTR_MTIME|ATTR_CTIME when the ATTR_SIZE
          sets the size to it's current value.  This allows local filesystems
          to do the right thing for f?truncate.
      Also, the logic in inode_setattr is changed a bit so there are two return
      points.  One returns the error from vmtruncate if it failed, the other
      returns 0 (there can be no other failure).
      Finally, if vmtruncate succeeds, and ATTR_SIZE is the only change
      requested, we now fall-through and mark_inode_dirty.  If a filesystem did
      not have a ->truncate function, then vmtruncate will have changed i_size,
      without marking the inode as 'dirty', and I think this is wrong.
      Signed-off-by: default avatarNeil Brown <neilb@suse.de>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>