1. 21 Nov, 2018 1 commit
    • Andreas Gruenbacher's avatar
      gfs2: Put bitmap buffers in put_super · 6c75d139
      Andreas Gruenbacher authored
      commit 10283ea5 upstream.
      gfs2_put_super calls gfs2_clear_rgrpd to destroy the gfs2_rgrpd objects
      attached to the resource group glocks.  That function should release the
      buffers attached to the gfs2_bitmap objects (bi_bh), but the call to
      gfs2_rgrp_brelse for doing that is missing.
      When gfs2_releasepage later runs across these buffers which are still
      referenced, it refuses to free them.  This causes the pages the buffers
      are attached to to remain referenced as well.  With enough mount/unmount
      cycles, the system will eventually run out of memory.
      Fix this by adding the missing call to gfs2_rgrp_brelse in
      (Also fix a gfs2_rgrp_relse -> gfs2_rgrp_brelse typo in a comment.)
      Fixes: 39b0f1e9
       ("GFS2: Don't brelse rgrp buffer_heads every allocation")
      Cc: stable@vger.kernel.org # v4.2+
      Signed-off-by: default avatarAndreas Gruenbacher <agruenba@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
  2. 26 Sep, 2018 1 commit
  3. 30 Aug, 2017 1 commit
  4. 09 Aug, 2017 1 commit
  5. 05 Jul, 2017 1 commit
  6. 19 Apr, 2017 1 commit
    • Bob Peterson's avatar
      GFS2: Non-recursive delete · d552a2b9
      Bob Peterson authored
      Implement truncate/delete as a non-recursive algorithm. The older
      algorithm was implemented with recursion to strip off each layer
      at a time (going by height, starting with the maximum height.
      This version tries to do the same thing but without recursion,
      and without needing to allocate new structures or lists in memory.
      For example, say you want to truncate a very large file to 1 byte,
      and its end-of-file metapath is: 0.505.463.428. The starting
      metapath would be Since it's a truncate to non-zero, it
      needs to preserve that byte, and all metadata pointing to it.
      So it would start at, look up all its metadata buffers,
      then free all data blocks pointed to at the highest level.
      After that buffer is "swept", it moves on to, then, etc., reading in buffers and sweeping them clean.
      When it gets to the end of the 0.0.0 metadata buffer (for 4K
      blocks the last valid one is, it backs up to the
      previous height and starts working on, then,
      and so forth. After it reaches the end and sweeps,
      it continues with, and so on. When that height is
      exhausted, and it reaches 0.0.508.508 it backs up another level,
      to, then, through So it has to keep
      marching backwards and forwards through the metadata until it's
      all swept clean. Once it has all the data blocks freed, it
      lowers the strip height, and begins the process all over again,
      but with one less height. This time it sweeps 0.0.0 through
      0.505.463. When that's clean, it lowers the strip height again
      and works to free 0.505. Eventually it strips the lowest height, 0.
      For a delete or truncate to 0, all metadata for all heights of would be freed. For a truncate to 1 byte, would
      be preserved.
      This isn't much different from normal integer incrementing,
      where an integer gets incremented from 0000 ( to 3021
      ( So 0000 gets increments to 0001, 0002, up to 0009,
      then on to 0010, 0011 up to 0099, then 0100 and so forth. It's
      just that each "digit" goes from 0 to 508 (for a total of 509
      pointers) rather than from 0 to 9.
      Note that the dinode will only have 483 pointers due to the
      dinode structure itself.
      Also note: this is just an example. These numbers (509 and 483)
      are based on a standard 4K block size. Smaller block sizes will
      yield smaller numbers of indirect pointers accordingly.
      The truncation process is accomplished with the help of two
      major functions and a few helper functions.
      Functions do_strip and recursive_scan are obsolete, so removed.
      New function sweep_bh_for_rgrps cleans a buffer_head pointed to
      by the given metapath and height. By cleaning, I mean it frees
      all blocks starting at the offset passed in metapath. It starts
      at the first block in the buffer pointed to by the metapath and
      identifies its resource group (rgrp). From there it frees all
      subsequent block pointers that lie within that rgrp. If it's
      already inside a transaction, it stays within it as long as it
      can. In other words, it doesn't close a transaction until it knows
      it's freed what it can from the resource group. In this way,
      multiple buffers may be cleaned in a single transaction, as long
      as those blocks in the buffer all lie within the same rgrp.
      If it's not in a transaction, it starts one. If the buffer_head
      has references to blocks within multiple rgrps, it frees all the
      blocks inside the first rgrp it finds, then closes the
      transaction. Then it repeats the cycle: identifies the next
      unfreed block, uses it to find its rgrp, then starts a new
      transaction for that set. It repeats this process repeatedly
      until the buffer_head contains no more references to any blocks
      past the given metapath.
      Function trunc_dealloc has been reworked into a finite state
      automaton. It has basically 3 active states:
      The DEALLOC_MP_FULL state implies the metapath has a full set
      of buffers out to the "shrink height", and therefore, it can
      call function sweep_bh_for_rgrps to free the blocks within the
      highest height of the metapath. If it's just swept the lowest
      level (or an error has occurred) the state machine is ended.
      Otherwise it proceeds to the DEALLOC_MP_LOWER state.
      The DEALLOC_MP_LOWER state implies we are finished with a given
      buffer_head, which may now be released, and therefore we are
      then missing some buffer information from the metapath. So we
      need to find more buffers to read in. In most cases, this is
      just a matter of releasing the buffer_head and moving to the
      next pointer from the previous height, so it may be read in and
      swept as well. If it can't find another non-null pointer to
      process, it checks whether it's reached the end of a height
      and needs to lower the strip height, or whether it still needs
      move forward through the previous height's metadata. In this
      state, all zero-pointers are skipped. From this state, it can
      only loop around (once more backing up another height) or,
      once a valid metapath is found (one that has non-zero
      pointers), proceed to state DEALLOC_FILL_MP.
      The DEALLOC_FILL_MP state implies that we have a metapath
      but not all its buffers are read in. So we must proceed to read
      in buffer_heads until the metapath has a valid buffer for every
      height. If the previous state backed us up 3 heights, we may
      need to read in a buffer, increment the height, then repeat the
      process until buffers have been read in for all required heights.
      If it's successful reading a buffer, and it's at the highest
      height we need, it proceeds back to the DEALLOC_MP_FULL state.
      If it's unable to fill in a buffer, (encounters a hole, etc.)
      it tries to find another non-zero block pointer. If they're all
      zero, it lowers the height and returns to the DEALLOC_MP_LOWER
      state. If it finds a good non-null pointer, it loops around and
      reads it in, while keeping the metapath in lock-step with the
      pointers it examines.
      The state machine runs until the truncation request is
      satisfied. Then any transactions are ended, the quota and
      statfs data are updated, and the function is complete.
      Helper function metaptr1 was introduced to be an easy way to
      determine the start of a buffer_head's indirect pointers.
      Helper function lookup_mp_height was introduced to find a
      metapath index and read in the buffer that corresponds to it.
      In this way, function lookup_metapath becomes a simple loop to
      call it for every height.
      Helper function fillup_metapath is similar to lookup_metapath
      except it can do partial lookups. If the state machine
      backed up multiple levels (like 2999 wrapping to 3000) it
      needs to find out the next starting point and start issuing
      metadata reads at that point.
      Helper function hptrs is a shortcut to determine how many
      pointers should be expected in a buffer. Height 0 is the dinode
      which has fewer pointers than the others.
      Signed-off-by: default avatarBob Peterson <rpeterso@redhat.com>
  7. 12 Jul, 2016 1 commit
    • Bob Peterson's avatar
      GFS2: Check rs_free with rd_rsspin protection · 44f52122
      Bob Peterson authored
      For the last process to close a file opened for write, function
      gfs2_rsqa_delete was deleting the file's inode's block reservation
      out of the rgrp reservations tree. Then it was checking to make sure
      rs_free was 0, but it was performing the check outside the protection
      of rd_rsspin spin_lock. The rd_rsspin spin_lock protection is needed
      to prevent a race between the process freeing the reservation and
      another who is allocating a new set of blocks inside the same rgrp
      for the same inode, thus changing its value.
      Signed-off-by: default avatarBob Peterson <rpeterso@redhat.com>
  8. 27 Jun, 2016 1 commit
    • Andreas Gruenbacher's avatar
      gfs2: Lock holder cleanup · 6df9f9a2
      Andreas Gruenbacher authored
      Make the code more readable by cleaning up the different ways of
      initializing lock holders and checking for initialized lock holders:
      mark lock holders as uninitialized by setting the holder's glock to NULL
      (gfs2_holder_mark_uninitialized) instead of zeroing out the entire
      object or using a separate flag.  Recognize initialized holders by their
      non-NULL glock (gfs2_holder_initialized).  Don't zero out holder objects
      which are immeditiately initialized via gfs2_holder_init or
      Signed-off-by: default avatarAndreas Gruenbacher <agruenba@redhat.com>
      Signed-off-by: default avatarBob Peterson <rpeterso@redhat.com>
  9. 10 Jun, 2016 1 commit
    • Bob Peterson's avatar
      GFS2: don't set rgrp gl_object until it's inserted into rgrp tree · 36e4ad03
      Bob Peterson authored
      Before this patch, function read_rindex_entry would set a rgrp
      glock's gl_object pointer to itself before inserting the rgrp into
      the rgrp rbtree. The problem is: if another process was also reading
      the rgrp in, and had already inserted its newly created rgrp, then
      the second call to read_rindex_entry would overwrite that value,
      then return a bad return code to the caller. Later, other functions
      would reference the now-freed rgrp memory by way of gl_object.
      In some cases, that could result in gfs2_rgrp_brelse being called
      twice for the same rgrp: once for the failed attempt and once for
      the "real" rgrp release. Eventually the kernel would panic.
      There are also a number of other things that could go wrong when
      a kernel module is accessing freed storage. For example, this could
      result in rgrp corruption because the fake rgrp would point to a
      fake bitmap in memory too, causing gfs2_inplace_reserve to search
      some random memory for free blocks, and find some, since we were
      never setting rgd->rd_bits to NULL before freeing it.
      This patch fixes the problem by not setting gl_object until we
      have successfully inserted the rgrp into the rbtree. Also, it sets
      rd_bits to NULL as it frees them, which will ensure any accidental
      access to the wrong rgrp will result in a kernel panic rather than
      file system corruption, which is preferred.
      Signed-off-by: default avatarBob Peterson <rpeterso@redhat.com>
  10. 02 May, 2016 1 commit
  11. 04 Apr, 2016 1 commit
    • Kirill A. Shutemov's avatar
      mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros · 09cbfeaf
      Kirill A. Shutemov authored
      PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
      ago with promise that one day it will be possible to implement page
      cache with bigger chunks than PAGE_SIZE.
      This promise never materialized.  And unlikely will.
      We have many places where PAGE_CACHE_SIZE assumed to be equal to
      PAGE_SIZE.  And it's constant source of confusion on whether
      PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
      especially on the border between fs and mm.
      Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
      breakage to be doable.
      Let's stop pretending that pages in page cache are special.  They are
      The changes are pretty straight-forward:
       - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
       - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
       - page_cache_get() -> get_page();
       - page_cache_release() -> put_page();
      This patch contains automated changes generated with coccinelle using
      script below.  For some reason, coccinelle doesn't patch header files.
      I've called spatch for them manually.
      The only adjustment after coccinelle is revert of changes to
      PAGE_CAHCE_ALIGN definition: we are going to drop it later.
      There are few places in the code where coccinelle didn't reach.  I'll
      fix them manually in a separate patch.  Comments and documentation also
      will be addressed with the separate patch.
      virtual patch
      expression E;
      + E
      expression E;
      + E
      + PAGE_SHIFT
      + PAGE_SIZE
      + PAGE_MASK
      expression E;
      + PAGE_ALIGN(E)
      expression E;
      - page_cache_get(E)
      + get_page(E)
      expression E;
      - page_cache_release(E)
      + put_page(E)
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  12. 18 Dec, 2015 1 commit
    • Bob Peterson's avatar
      GFS2: Always use iopen glock for gl_deletes · 5ea31bc0
      Bob Peterson authored
      Before this patch, when function try_rgrp_unlink queued a glock for
      delete_work to reclaim the space, it used the inode glock to do so.
      That's different from the iopen callback which uses the iopen glock
      for the same purpose. We should be consistent and always use the
      iopen glock. This may also save us reference counting problems with
      the inode glock, since clear_glock does an extra glock_put() for the
      inode glock.
      Signed-off-by: default avatarBob Peterson <rpeterso@redhat.com>
  13. 14 Dec, 2015 1 commit
    • Bob Peterson's avatar
      GFS2: Make rgrp reservations part of the gfs2_inode structure · a097dc7e
      Bob Peterson authored
      Before this patch, multi-block reservation structures were allocated
      from a special slab. This patch folds the structure into the gfs2_inode
      structure. The disadvantage is that the gfs2_inode needs more memory,
      even when a file is opened read-only. The advantages are: (a) we don't
      need the special slab and the extra time it takes to allocate and
      deallocate from it. (b) we no longer need to worry that the structure
      exists for things like quota management. (c) This also allows us to
      remove the calls to get_write_access and put_write_access since we
      know the structure will exist.
      Signed-off-by: default avatarBob Peterson <rpeterso@redhat.com>
  14. 24 Nov, 2015 1 commit
    • Bob Peterson's avatar
      GFS2: Extract quota data from reservations structure (revert 5407e242) · b54e9a0b
      Bob Peterson authored
      This patch basically reverts the majority of patch 5407e242
      That patch eliminated the gfs2_qadata structure in favor of just
      using the reservations structure. The problem with doing that is that
      it increases the size of the reservations structure. That is not an
      issue until it comes time to fold the reservations structure into the
      inode in memory so we know it's always there. By separating out the
      quota structure again, we aren't punishing the non-quota users by
      making all the inodes bigger, requiring more slab space. This patch
      creates a new slab area to allocate the quota stuff so it's managed
      a little more sanely.
      Signed-off-by: default avatarBob Peterson <rpeterso@redhat.com>
  15. 16 Nov, 2015 1 commit
  16. 09 Nov, 2015 1 commit
  17. 29 Oct, 2015 1 commit
  18. 03 Sep, 2015 2 commits
  19. 19 Jun, 2015 1 commit
  20. 18 May, 2015 1 commit
  21. 05 May, 2015 1 commit
    • Abhi Das's avatar
      gfs2: handle NULL rgd in set_rgrp_preferences · 959b6717
      Abhi Das authored
      The function set_rgrp_preferences() does not handle the (rarely
      returned) NULL value from gfs2_rgrpd_get_next() and this patch
      fixes that.
      The fs image in question is only 150MB in size which allows for
      only 1 rgrp to be created. The in-memory rb tree has only 1 node
      and when gfs2_rgrpd_get_next() is called on this sole rgrp, it
      returns NULL. (Default behavior is to wrap around the rb tree and
      return the first node to give the illusion of a circular linked
      list. In the case of only 1 rgrp, we can't have
      gfs2_rgrpd_get_next() return the same rgrp (first, last, next all
      point to the same rgrp)... that would cause unintended consequences
      and infinite loops.)
      Signed-off-by: default avatarAbhi Das <adas@redhat.com>
      Signed-off-by: default avatarBob Peterson <rpeterso@redhat.com>
  22. 24 Apr, 2015 2 commits
  23. 18 Mar, 2015 1 commit
    • Abhi Das's avatar
      gfs2: allow quota_check and inplace_reserve to return available blocks · 25435e5e
      Abhi Das authored
      struct gfs2_alloc_parms is passed to gfs2_quota_check() and
      gfs2_inplace_reserve() with ap->target containing the number of
      blocks being requested for allocation in the current operation.
      We add a new field to struct gfs2_alloc_parms called 'allowed'.
      gfs2_quota_check() and gfs2_inplace_reserve() return the max
      blocks allowed by quota and the max blocks allowed by the chosen
      rgrp respectively in 'allowed'.
      A new field 'min_target', when non-zero, tells gfs2_quota_check()
      and gfs2_inplace_reserve() to not return -EDQUOT/-ENOSPC when
      there are atleast 'min_target' blocks allowable/available. The
      assumption is that the caller is ok with just 'min_target' blocks
      and will likely proceed with allocating them.
      Signed-off-by: default avatarAbhi Das <adas@redhat.com>
      Signed-off-by: default avatarBob Peterson <rpeterso@redhat.com>
      Acked-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
  24. 03 Nov, 2014 2 commits
  25. 03 Oct, 2014 1 commit
  26. 19 Sep, 2014 1 commit
    • Abhi Das's avatar
      GFS2: fix bad inode i_goal values during block allocation · 00a158be
      Abhi Das authored
      This patch checks if i_goal is either zero or if doesn't exist
      within any rgrp (i.e gfs2_blk2rgrpd() returns NULL). If so, it
      assigns the ip->i_no_addr block as the i_goal.
      There are two scenarios where a bad i_goal can result in a
      -EBADSLT error.
      1. Attempting to allocate to an existing inode:
      Control reaches gfs2_inplace_reserve() and ip->i_goal is bad.
      We need to fix i_goal here.
      2. A new inode is created in a directory whose i_goal is hosed:
      In this case, the parent dir's i_goal is copied onto the new
      inode. Since the new inode is not yet created, the ip->i_no_addr
      field is invalid and so, the fix in gfs2_inplace_reserve() as per
      1) won't work in this scenario. We need to catch and fix it sooner
      in the parent dir itself (gfs2_create_inode()), before it is
      copied to the new inode.
      Signed-off-by: default avatarAbhi Das <adas@redhat.com>
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
  27. 18 Jul, 2014 1 commit
  28. 14 May, 2014 1 commit
    • Benjamin Marzinski's avatar
      GFS2: remove transaction glock · 24972557
      Benjamin Marzinski authored
      GFS2 has a transaction glock, which must be grabbed for every
      transaction, whose purpose is to deal with freezing the filesystem.
      Aside from this involving a large amount of locking, it is very easy to
      make the current fsfreeze code hang on unfreezing.
      This patch rewrites how gfs2 handles freezing the filesystem. The
      transaction glock is removed. In it's place is a freeze glock, which is
      cached (but not held) in a shared state by every node in the cluster
      when the filesystem is mounted. This lock only needs to be grabbed on
      freezing, and actions which need to be safe from freezing, like
      When a node wants to freeze the filesystem, it grabs this glock
      exclusively.  When the freeze glock state changes on the nodes (either
      from shared to unlocked, or shared to exclusive), the filesystem does a
      special log flush.  gfs2_log_flush() does all the work for flushing out
      the and shutting down the incore log, and then it tries to grab the
      freeze glock in a shared state again.  Since the filesystem is stuck in
      gfs2_log_flush, no new transaction can start, and nothing can be written
      to disk. Unfreezing the filesytem simply involes dropping the freeze
      glock, allowing gfs2_log_flush() to grab and then release the shared
      lock, so it is cached for next time.
      However, in order for the unfreezing ioctl to occur, gfs2 needs to get a
      shared lock on the filesystem root directory inode to check permissions.
      If that glock has already been grabbed exclusively, fsfreeze will be
      unable to get the shared lock and unfreeze the filesystem.
      In order to allow the unfreeze, this patch makes gfs2 grab a shared lock
      on the filesystem root directory during the freeze, and hold it until it
      unfreezes the filesystem.  The functions which need to grab a shared
      lock in order to allow the unfreeze ioctl to be issued now use the lock
      grabbed by the freeze code instead.
      The freeze and unfreeze code take care to make sure that this shared
      lock will not be dropped while another process is using it.
      Signed-off-by: default avatarBenjamin Marzinski <bmarzins@redhat.com>
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
  29. 07 Mar, 2014 1 commit
  30. 06 Mar, 2014 1 commit
  31. 10 Feb, 2014 1 commit
  32. 04 Feb, 2014 1 commit
    • Steven Whitehouse's avatar
      GFS2: Allocate block for xattr at inode alloc time, if required · b2c8b3ea
      Steven Whitehouse authored
      This is another step towards improving the allocation of xattr
      blocks at inode allocation time. Here we take advantage of
      Christoph's recent work on ACLs to allocate a block for the
      xattrs early if we know that we will be adding ACLs to the
      inode later on. The advantage of that is that it is much
      more likely that we'll get a contiguous run of two blocks
      where the first is the inode and the second is the xattr block.
      We still have to fall back to the original system in case we
      don't get the requested two contiguous blocks, or in case the
      ACLs are too large to fit into the block.
      Future patches will move more of the ACL setting code further
      up the gfs2_inode_create() function. Also, I'd like to be
      able to do the same thing with the xattrs from LSMs in
      due course, too. That way we should be able to slowly reduce
      the number of independent transactions, at least in the
      most common cases.
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
  33. 16 Jan, 2014 2 commits
  34. 03 Jan, 2014 3 commits
    • Steven Whitehouse's avatar
      GFS2: Use range based functions for rgrp sync/invalidation · 7005c3e4
      Steven Whitehouse authored
      Each rgrp header is represented as a single extent on disk, so we
      can calculate the position within the address space, since we are
      using address spaces mapped 1:1 to the disk. This means that it
      is possible to use the range based versions of filemap_fdatawrite/wait
      and for invalidating the page cache.
      Our eventual intent is to then be able to merge the address spaces
      used for rgrps into a single address space, rather than to have
      one for each glock, saving memory and reducing complexity.
      Since during umount, the rgrp structures are disposed of before
      the glocks, we need to store the extent information in the glock
      so that is is available for a final invalidation. This patch uses
      a field which is otherwise unused in rgrp glocks to do that, so
      that we do not have to expand the size of a glock.
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
    • Steven Whitehouse's avatar
      GFS2: Remove test which is always true · 7de41d36
      Steven Whitehouse authored
      Since gfs2_inplace_reserve() is always called with a valid
      alloc parms structure, there is no need to test for this
      within the function itself - and in any case, after we've
      all ready dereferenced it anyway.
      Reported-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
    • Bob Peterson's avatar
      GFS2: Implement a "rgrp has no extents longer than X" scheme · 5ea5050c
      Bob Peterson authored
      With the preceding patch, we started accepting block reservations
      smaller than the ideal size, which requires a lot more parsing of the
      bitmaps. To reduce the amount of bitmap searching, this patch
      implements a scheme whereby each rgrp keeps track of the point
      at this multi-block reservations will fail.
      Signed-off-by: default avatarBob Peterson <rpeterso@redhat.com>
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>