1. 08 Apr, 2013 4 commits
    • Bob Peterson's avatar
      GFS2: Remove vestigial parameter ip from function rs_deltree · 20095218
      Bob Peterson authored
      
      
      The functions that delete block reservations from the rgrp block
      reservations rbtree no longer use the ip parameter. This patch
      eliminates the parameter.
      Signed-off-by: default avatarBob Peterson <rpeterso@redhat.com>
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      20095218
    • Steven Whitehouse's avatar
      GFS2: Use gfs2_dinode_out() in the inode create path · 79ba7480
      Steven Whitehouse authored
      
      
      Over the previous two patches relating to inode creation, the
      content of init_dinode() has been looking more and more like
      gfs2_dinode_out(). This is not an accident! This patch replaces
      the parts of init_dinode() which are duplicated in gfs2_dinode_out()
      with a call to that function.
      
      Mostly that is straightforward, but there is one issue which needed
      to be resolved relating to the link count. The link count has to be
      set to zero in a certain error handling code path, which lands up
      calling iput(). This is now done specifically in that code path
      allowing the link count to be set earlier and written into the
      on disk inode by gfs2_dinode_put() in the normal way.
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      79ba7480
    • Steven Whitehouse's avatar
      GFS2: Remove gfs2_refresh_inode from inode creation path · 28fb3027
      Steven Whitehouse authored
      
      
      The original method for creating inodes used in GFS2 was to fill
      out a buffer, with all the information, and then to read that
      buffer into the in-core inode, using gfs2_refresh_inode()
      
      The problem with this approach is that all the inode's fields
      need to be calculated ahead of time, and were stored in various
      variables making the code rather complicated.
      
      The new approach is simply to allocate the in-core inode earlier
      and fill in as many fields as possible ahead of time. These can
      then be used to initilise the on disk representation. The
      code has been working towards the point where it is possible
      to remove gfs2_refresh_inode() because all the fields are
      correctly initialised ahead of time. We've now reached that
      milestone, and have reversed the order of setting up the in
      core and on disk inodes.
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      28fb3027
    • Steven Whitehouse's avatar
      GFS2: Clean up inode creation path · fd4b4e04
      Steven Whitehouse authored
      
      
      This patch cleans up the inode creation code path in GFS2. After the
      Orlov allocator was merged, a number of potential improvements are
      now possible, and this is a first set of these.
      
      The quota handling is now updated so that it matches the point in
      the code where the allocation takes place. This means that the one
      exception in gfs2_alloc_blocks relating to quota is now no longer
      required, and we can use the generic code everywhere.
      
      In addition the call to figure out whether we need to allocate any
      extra blocks in order to add a directory entry is moved higher up
      gfs2_create_inode. This means that if it returns an error, we
      can deal with that at a stage where it is easier to handle that case.
      The returned status cannot change during the function since we hold
      an exclusive lock on the directory.
      
      Two calls to gfs2_rindex_update have been changed to one, again at
      the top of gfs2_create_inode to simplify error handling.
      
      The time stamps are also now initialised earlier in the creation
      process, this is gradually moving towards being able to remove the
      call to gfs2_refresh_inode in gfs2_inode_create once we have all the
      fields covered.
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      fd4b4e04
  2. 05 Apr, 2013 1 commit
    • Bob Peterson's avatar
      GFS2: Issue discards in 512b sectors · b2c87cae
      Bob Peterson authored
      
      
      This patch changes GFS2's discard issuing code so that it calls
      function sb_issue_discard rather than blkdev_issue_discard. The
      code was calling blkdev_issue_discard and specifying the correct
      sector offset and sector size, but blkdev_issue_discard expects
      these values to be in terms of 512 byte sectors, even if the native
      sector size for the device is different. Calling sb_issue_discard
      with the BLOCK size instead ensures the correct block-to-512b-sector
      translation. I verified that "minlen" is specified in blocks, so
      comparing it to a number of blocks is correct.
      Signed-off-by: default avatarBob Peterson <rpeterso@redhat.com>
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      b2c87cae
  3. 04 Apr, 2013 4 commits
  4. 03 Apr, 2013 1 commit
    • Zheng Liu's avatar
      ext4: fix big-endian bugs which could cause fs corruptions · 8cde7ad1
      Zheng Liu authored
      
      
      When an extent was zeroed out, we forgot to do convert from cpu to le16.
      It could make us hit a BUG_ON when we try to write dirty pages out.  So
      fix it.
      
      [ Also fix a bug found by Dmitry Monakhov where we were missing
        le32_to_cpu() calls in the new indirect punch hole code.
      
        There are a number of other big endian warnings found by static code
        analyzers, but we'll wait for the next merge window to fix them all
        up.  These fixes are designed to be Obviously Correct by code
        inspection, and easy to demonstrate that it won't make any
        difference (and hence, won't introduce any bugs) on little endian
        architectures such as x86.  --tytso ]
      Signed-off-by: default avatarZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Reported-by: default avatarCAI Qian <caiqian@redhat.com>
      Reported-by: default avatarChristian Kujau <lists@nerdbynature.de>
      Cc: Dmitry Monakhov <dmonakhov@openvz.org>
      8cde7ad1
  5. 01 Apr, 2013 1 commit
    • Anatol Pomozov's avatar
      loop: prevent bdev freeing while device in use · c1681bf8
      Anatol Pomozov authored
      
      
      struct block_device lifecycle is defined by its inode (see fs/block_dev.c) -
      block_device allocated first time we access /dev/loopXX and deallocated on
      bdev_destroy_inode. When we create the device "losetup /dev/loopXX afile"
      we want that block_device stay alive until we destroy the loop device
      with "losetup -d".
      
      But because we do not hold /dev/loopXX inode its counter goes 0, and
      inode/bdev can be destroyed at any moment. Usually it happens at memory
      pressure or when user drops inode cache (like in the test below). When later in
      loop_clr_fd() we want to use bdev we have use-after-free error with following
      stack:
      
      BUG: unable to handle kernel NULL pointer dereference at 0000000000000280
        bd_set_size+0x10/0xa0
        loop_clr_fd+0x1f8/0x420 [loop]
        lo_ioctl+0x200/0x7e0 [loop]
        lo_compat_ioctl+0x47/0xe0 [loop]
        compat_blkdev_ioctl+0x341/0x1290
        do_filp_open+0x42/0xa0
        compat_sys_ioctl+0xc1/0xf20
        do_sys_open+0x16e/0x1d0
        sysenter_dispatch+0x7/0x1a
      
      To prevent use-after-free we need to grab the device in loop_set_fd()
      and put it later in loop_clr_fd().
      
      The issue is reprodusible on current Linus head and v3.3. Here is the test:
      
        dd if=/dev/zero of=loop.file bs=1M count=1
        while [ true ]; do
          losetup /dev/loop0 loop.file
          echo 2 > /proc/sys/vm/drop_caches
          losetup -d /dev/loop0
        done
      
      [ Doing bdgrab/bput in loop_set_fd/loop_clr_fd is safe, because every
        time we call loop_set_fd() we check that loop_device->lo_state is
        Lo_unbound and set it to Lo_bound If somebody will try to set_fd again
        it will get EBUSY.  And if we try to loop_clr_fd() on unbound loop
        device we'll get ENXIO.
      
        loop_set_fd/loop_clr_fd (and any other loop ioctl) is called under
        loop_device->lo_ctl_mutex. ]
      Signed-off-by: default avatarAnatol Pomozov <anatol.pomozov@gmail.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c1681bf8
  6. 29 Mar, 2013 2 commits
  7. 28 Mar, 2013 8 commits
  8. 27 Mar, 2013 5 commits
    • Al Viro's avatar
      vfs/splice: Fix missed checks in new __kernel_write() helper · 3e84f48e
      Al Viro authored
      Commit 06ae43f3
      
       ("Don't bother with redoing rw_verify_area() from
      default_file_splice_from()") lost the checks to test existence of the
      write/aio_write methods.  My apologies ;-/
      
      Eventually, we want that in fs/splice.c side of things (no point
      repeating it for every buffer, after all), but for now this is the
      obvious minimal fix.
      Reported-by: default avatarDave Jones <davej@redhat.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3e84f48e
    • Eric W. Biederman's avatar
      userns: Restrict when proc and sysfs can be mounted · 87a8ebd6
      Eric W. Biederman authored
      
      
      Only allow unprivileged mounts of proc and sysfs if they are already
      mounted when the user namespace is created.
      
      proc and sysfs are interesting because they have content that is
      per namespace, and so fresh mounts are needed when new namespaces
      are created while at the same time proc and sysfs have content that
      is shared between every instance.
      
      Respect the policy of who may see the shared content of proc and sysfs
      by only allowing new mounts if there was an existing mount at the time
      the user namespace was created.
      
      In practice there are only two interesting cases: proc and sysfs are
      mounted at their usual places, proc and sysfs are not mounted at all
      (some form of mount namespace jail).
      
      Cc: stable@vger.kernel.org
      Acked-by: default avatarSerge Hallyn <serge.hallyn@canonical.com>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      87a8ebd6
    • Eric W. Biederman's avatar
      vfs: Carefully propogate mounts across user namespaces · 132c94e3
      Eric W. Biederman authored
      
      
      As a matter of policy MNT_READONLY should not be changable if the
      original mounter had more privileges than creator of the mount
      namespace.
      
      Add the flag CL_UNPRIVILEGED to note when we are copying a mount from
      a mount namespace that requires more privileges to a mount namespace
      that requires fewer privileges.
      
      When the CL_UNPRIVILEGED flag is set cause clone_mnt to set MNT_NO_REMOUNT
      if any of the mnt flags that should never be changed are set.
      
      This protects both mount propagation and the initial creation of a less
      privileged mount namespace.
      
      Cc: stable@vger.kernel.org
      Acked-by: default avatarSerge Hallyn <serge.hallyn@canonical.com>
      Reported-by: default avatarAndy Lutomirski <luto@amacapital.net>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      132c94e3
    • Eric W. Biederman's avatar
      vfs: Add a mount flag to lock read only bind mounts · 90563b19
      Eric W. Biederman authored
      
      
      When a read-only bind mount is copied from mount namespace in a higher
      privileged user namespace to a mount namespace in a lesser privileged
      user namespace, it should not be possible to remove the the read-only
      restriction.
      
      Add a MNT_LOCK_READONLY mount flag to indicate that a mount must
      remain read-only.
      
      CC: stable@vger.kernel.org
      Acked-by: default avatarSerge Hallyn <serge.hallyn@canonical.com>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      90563b19
    • Eric W. Biederman's avatar
      userns: Don't allow creation if the user is chrooted · 3151527e
      Eric W. Biederman authored
      
      
      Guarantee that the policy of which files may be access that is
      established by setting the root directory will not be violated
      by user namespaces by verifying that the root directory points
      to the root of the mount namespace at the time of user namespace
      creation.
      
      Changing the root is a privileged operation, and as a matter of policy
      it serves to limit unprivileged processes to files below the current
      root directory.
      
      For reasons of simplicity and comprehensibility the privilege to
      change the root directory is gated solely on the CAP_SYS_CHROOT
      capability in the user namespace.  Therefore when creating a user
      namespace we must ensure that the policy of which files may be access
      can not be violated by changing the root directory.
      
      Anyone who runs a processes in a chroot and would like to use user
      namespace can setup the same view of filesystems with a mount
      namespace instead.  With this result that this is not a practical
      limitation for using user namespaces.
      
      Cc: stable@vger.kernel.org
      Acked-by: default avatarSerge Hallyn <serge.hallyn@canonical.com>
      Reported-by: default avatarAndy Lutomirski <luto@amacapital.net>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      3151527e
  9. 26 Mar, 2013 3 commits
    • Al Viro's avatar
      Nest rename_lock inside vfsmount_lock · 7ea600b5
      Al Viro authored
      
      
      ... lest we get livelocks between path_is_under() and d_path() and friends.
      
      The thing is, wrt fairness lglocks are more similar to rwsems than to rwlocks;
      it is possible to have thread B spin on attempt to take lock shared while thread
      A is already holding it shared, if B is on lower-numbered CPU than A and there's
      a thread C spinning on attempt to take the same lock exclusive.
      
      As the result, we need consistent ordering between vfsmount_lock (lglock) and
      rename_lock (seq_lock), even though everything that takes both is going to take
      vfsmount_lock only shared.
      Spotted-by: default avatarBrad Spengler <spender@grsecurity.net>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      7ea600b5
    • J. Bruce Fields's avatar
      nfsd4: reject "negative" acl lengths · 64a817cf
      J. Bruce Fields authored
      
      
      Since we only enforce an upper bound, not a lower bound, a "negative"
      length can get through here.
      
      The symptom seen was a warning when we attempt to a kmalloc with an
      excessive size.
      Reported-by: default avatarToralf Förster <toralf.foerster@gmx.de>
      Cc: stable@kernel.org
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      64a817cf
    • Chris Mason's avatar
      Btrfs: fix race between mmap writes and compression · 4adaa611
      Chris Mason authored
      
      
      Btrfs uses page_mkwrite to ensure stable pages during
      crc calculations and mmap workloads.  We call clear_page_dirty_for_io
      before we do any crcs, and this forces any application with the file
      mapped to wait for the crc to finish before it is allowed to change
      the file.
      
      With compression on, the clear_page_dirty_for_io step is happening after
      we've compressed the pages.  This means the applications might be
      changing the pages while we are compressing them, and some of those
      modifications might not hit the disk.
      
      This commit adds the clear_page_dirty_for_io before compression starts
      and makes sure to redirty the page if we have to fallback to
      uncompressed IO as well.
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      Reported-by: default avatarAlexandre Oliva <oliva@gnu.org>
      cc: stable@vger.kernel.org
      4adaa611
  10. 22 Mar, 2013 2 commits
    • Kent Overstreet's avatar
      nfsd: fix bad offset use · e49dbbf3
      Kent Overstreet authored
      vfs_writev() updates the offset argument - but the code then passes the
      offset to vfs_fsync_range(). Since offset now points to the offset after
      what was just written, this is probably not what was intended
      
      Introduced by face1502
      
       "nfsd: use
      vfs_fsync_range(), not O_SYNC, for stable writes".
      Signed-off-by: default avatarKent Overstreet <koverstreet@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: stable@vger.kernel.org
      Reviewed-by: default avatarZach Brown <zab@redhat.com>
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      e49dbbf3
    • Linus Torvalds's avatar
      vfs,proc: guarantee unique inodes in /proc · 51f0885e
      Linus Torvalds authored
      
      
      Dave Jones found another /proc issue with his Trinity tool: thanks to
      the namespace model, we can have multiple /proc dentries that point to
      the same inode, aliasing directories in /proc/<pid>/net/ for example.
      
      This ends up being a total disaster, because it acts like hardlinked
      directories, and causes locking problems.  We rely on the topological
      sort of the inodes pointed to by dentries, and if we have aliased
      directories, that odering becomes unreliable.
      
      In short: don't do this.  Multiple dentries with the same (directory)
      inode is just a bad idea, and the namespace code should never have
      exposed things this way.  But we're kind of stuck with it.
      
      This solves things by just always allocating a new inode during /proc
      dentry lookup, instead of using "iget_locked()" to look up existing
      inodes by superblock and number.  That actually simplies the code a bit,
      at the cost of potentially doing more inode [de]allocations.
      
      That said, the inode lookup wasn't free either (and did a lot of locking
      of inodes), so it is probably not that noticeable.  We could easily keep
      the old lookup model for non-directory entries, but rather than try to
      be excessively clever this just implements the minimal and simplest
      workaround for the problem.
      Reported-and-tested-by: default avatarDave Jones <davej@redhat.com>
      Analyzed-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      51f0885e
  11. 21 Mar, 2013 9 commits