1. 07 Nov, 2012 6 commits
    • Steven Whitehouse's avatar
      GFS2: Add Orlov allocator · 9dbe9610
      Steven Whitehouse authored
      Just like ext3, this works on the root directory and any directory
      with the +T flag set. Also, just like ext3, any subdirectory created
      in one of the just mentioned cases will be allocated to a random
      resource group (GFS2 equivalent of a block group).
      If you are creating a set of directories, each of which will contain a
      job running on a different node, then by setting +T on the parent
      directory before creating the subdirectories, each will land up in a
      different resource group, and thus resource group contention between
      nodes will be kept to a minimum.
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
    • Steven Whitehouse's avatar
      GFS2: Add test for resource group congestion status · bcd97c06
      Steven Whitehouse authored
      This patch uses information gathered by the recent glock statistics
      patch in order to derrive a boolean verdict on the congestion
      status of a resource group. This is then used when making decisions
      on which resource group to choose during block allocation.
      The aim is to avoid resource groups which are heavily contended
      by other nodes, while still ensuring locality of access wherever
      Once a reservation has been made in a particular resource group
      we continue to use that resource group until a new reservation is
      required. This should help to ensure that we do not change resource
      groups too often.
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
    • Bob Peterson's avatar
      GFS2: Speed up gfs2_rbm_from_block · a68a0a35
      Bob Peterson authored
      This patch is a rewrite of function gfs2_rbm_from_block. Rather than
      looping to find the right bitmap, the code now does a few simple
      math calculations.
      I compared the performance of both algorithms side by side and the new
      algorithm is noticeably faster. Sample instrumentation output from a
      "fast" machine:
      5 million calls: millisec spent: Orig: 166 New: 113
      5 million calls: millisec spent: Orig: 189 New: 114
      In addition, I ran postmark (on a somewhat slowr CPU) before the after
      the new algorithm was put in place and postmark showed a decent
      Before the new algorithm:
      	645 seconds total
      	584 seconds of transactions (171 per second)
      	150087 created (232 per second)
      		Creation alone: 100000 files (2083 per second)
      		Mixed with transactions: 50087 files (85 per second)
      	49995 read (85 per second)
      	49991 appended (85 per second)
      	150087 deleted (232 per second)
      		Deletion alone: 100174 files (7705 per second)
      		Mixed with transactions: 49913 files (85 per second)
      	273.42 megabytes read (434.08 kilobytes per second)
      	852.13 megabytes written (1.32 megabytes per second)
      With the new algorithm:
      	599 seconds total
      	530 seconds of transactions (188 per second)
      	150087 created (250 per second)
      		Creation alone: 100000 files (1886 per second)
      		Mixed with transactions: 50087 files (94 per second)
      	49995 read (94 per second)
      	49991 appended (94 per second)
      	150087 deleted (250 per second)
      		Deletion alone: 100174 files (6260 per second)
      		Mixed with transactions: 49913 files (94 per second)
      	273.42 megabytes read (467.42 kilobytes per second)
      	852.13 megabytes written (1.42 megabytes per second)
      Signed-off-by: default avatarBob Peterson <rpeterso@redhat.com>
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
    • Lukas Czerner's avatar
      GFS2: Fix FITRIM argument handling · 076f0faa
      Lukas Czerner authored
      Currently implementation in gfs2 uses FITRIM arguments as it were in
      file system blocks units which is wrong. The FITRIM arguments
      (fstrim_range.start, fstrim_range.len and fstrim_range.minlen) are
      actually in bytes.
      Moreover, check for start argument beyond the end of file system, len
      argument being smaller than file system block and minlen argument being
      bigger than biggest resource group were missing.
      This commit converts the code to convert FITRIM argument to file system
      blocks and also adds appropriate checks mentioned above.
      All the problems were recognised by xfstests 251 and 260.
      Signed-off-by: default avatarLukas Czerner <lczerner@redhat.com>
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
    • Lukas Czerner's avatar
      GFS2: Require user to provide argument for FITRIM · 3a238ade
      Lukas Czerner authored
      When the fstrim_range argument is not provided by user in FITRIM ioctl
      we should just return EFAULT and not promoting bad behaviour by filling
      the structure in kernel. Let the user deal with it.
      Signed-off-by: default avatarLukas Czerner <lczerner@redhat.com>
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
    • Andrew Price's avatar
      GFS2: Fix possible null pointer deref in gfs2_rs_alloc · cd0ed19f
      Andrew Price authored
      Despite the return value from kmem_cache_zalloc() being checked, the
      error wasn't being returned until after a possible null pointer
      dereference. This patch returns the error immediately, allowing the
      removal of the error variable.
      Signed-off-by: default avatarAndrew Price <anprice@redhat.com>
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
  2. 24 Sep, 2012 17 commits
    • Bob Peterson's avatar
      GFS2: Fix infinite loop in rbm_find · 3701530a
      Bob Peterson authored
      This patch fixes an infinite loop in gfs2_rbm_find that was introduced
      by the previous patch. The problem occurred when the length was less
      than 3 but the rbm block was byte-aligned, causing it to improperly
      return a extent length of zero, which caused it to spin.
      Signed-off-by: default avatarBob Peterson <rpeterso@redhat.com>
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      Tested-by: default avatarBob Peterson <rpeterso@redhat.com>
      Tested-by: default avatarBarry Marson <bmarson@redhat.com>
    • Steven Whitehouse's avatar
      GFS2: Consolidate free block searching functions · ff7f4cb4
      Steven Whitehouse authored
      With the recently added block reservation code, an additional function
      was added to search for free blocks. This had a restriction of only being
      able to search for aligned extents of free blocks. As a result the
      allocation patterns when reserving blocks were suboptimal when the
      existing allocation of blocks for an inode was not aligned to the same
      This patch resolves that problem by adding the ability for gfs2_rbm_find
      to search for extents of a particular minimum size. We can then use
      gfs2_rbm_find for both looking for reservations, and also looking for
      free blocks on an individual basis when we actually come to do the
      allocation later on. As a result we only need a single set of code
      to deal with both situations.
      The function gfs2_rbm_from_block() is moved up rgrp.c so that it
      occurs before all of its callers.
      Many thanks are due to Bob for helping track down the final issue in
      this patch. That fix to the rb_tree traversal and to not share
      block reservations from a dirctory to its children is included here.
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      Signed-off-by: default avatarBob Peterson <rpeterso@redhat.com>
    • Bob Peterson's avatar
      GFS2: Stop block extents at the end of bitmaps · 0688a5ec
      Bob Peterson authored
      This patch stops multiple block allocations if a nonzero
      return code is received from gfs2_rbm_from_block. Without
      this patch, if enough pressure is put on the file system,
      you get a kernel warning quickly followed by:
      BUG: unable to handle kernel NULL pointer dereference at (null)
      IP: [<ffffffffa04f47e8>] gfs2_alloc_blocks+0x2c8/0x880 [gfs2]
      With this patch, things run normally.
      Signed-off-by: default avatarBob Peterson <rpeterso@redhat.com>
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
    • Steven Whitehouse's avatar
      GFS2: Fix unclaimed_blocks() wrapping bug and clean up · c743ffd0
      Steven Whitehouse authored
      When rgd->rd_free_clone is less than rgd->rd_reserved, the
      unclaimed_blocks() calculation would wrap and produce
      incorrect results. This patch checks for this condition
      when this function is called from gfs2_mblk_search()
      In addition, the use of this particular function in other
      places in the code has been dropped by means of a general
      clean up of gfs2_inplace_reserve(). This function is now
      much easier to follow.
      Also the setting of the rgd->rd_last_alloc field is corrected.
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
    • Steven Whitehouse's avatar
      GFS2: Improve block reservation tracing · 9e733d39
      Steven Whitehouse authored
      This patch improves the tracing of block reservations by
      removing some corner cases and also providing more useful
      detail in the traces.
      A new field is added to the reservation structure to contain
      the inode number. This is used since in certain contexts it is
      not possible to access the inode itself to obtain this information.
      As a result we can then display the inode number for all tracepoints
      and also in case we dump the resource group.
      The "del" tracepoint operation has been removed. This could be called
      with the reservation rgrp set to NULL. That resulted in not printing
      the device number, and thus making the information largely useless
      anyway. Also, the conditional on the rgrp being NULL can then be
      removed from the tracepoint. After this change, all the block
      reservation tracepoint calls will be called with the rgrp information.
      The existing ins,clm and tdel calls to the block reservation tracepoint
      are sufficient to track the entire life of the block reservation.
      In gfs2_block_alloc() the error detection is updated to print out
      the inode number of the problematic inode. This can then be compared
      against the information in the glock dump,tracepoints, etc.
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
    • Steven Whitehouse's avatar
      GFS2: Fall back to ignoring reservations, if there are no other blocks left · 137834a6
      Steven Whitehouse authored
      When we get to the stage of allocating blocks, we know that the
      resource group in question must contain enough free blocks, otherwise
      gfs2_inplace_reserve() would have failed. So if we are left with only
      free blocks which are reserved, then we must use those. This can happen
      if another node has sneeked in and use some blocks reserved on this
      node, for example. Generally this will happen very rarely and only
      when the resouce group is nearly full.
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
    • Steven Whitehouse's avatar
      GFS2: Use rbm for gfs2_setbit() · 3e6339dd
      Steven Whitehouse authored
      Use the rbm structure for gfs2_setbit() in order to simplify the
      arguments to the function. We have to add a bool to control whether
      the clone bitmap should be updated (if it exists) but otherwise it
      is a more or less direct substitution.
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
    • Steven Whitehouse's avatar
      GFS2: Use rbm for gfs2_testbit() · c04a2ef3
      Steven Whitehouse authored
      Change the arguments to gfs2_testbit() so that it now just takes an
      rbm specifying the position of the two bit entry to return.
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
    • Bob Peterson's avatar
      GFS2: Eliminate unnecessary check for state > 3 in bitfit · 29c05b20
      Bob Peterson authored
      Function gfs2_bitfit was checking for state > 3, but that's
      impossible since it is only called from rgblk_search, which receives
      only GFS2_BLKST_ constants.
      Signed-off-by: default avatarBob Peterson <rpeterso@redhat.com>
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
    • Bob Peterson's avatar
      GFS2: rbm code cleanup · 8d8b752a
      Bob Peterson authored
      This patch fixes a few small rbm related things. First, it fixes
      a corner case where the rbm needs to switch bitmaps and wasn't
      adjusting its buffer pointer. Second, there's a white space issue
      fixed. Third, the logic in function gfs2_rbm_from_block was optimized
      a bit. Lastly, a check for goal block overflows was added to function
      Signed-off-by: default avatarBob Peterson <rpeterso@redhat.com>
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
    • Steven Whitehouse's avatar
      GFS2: Fix case where reservation finished at end of rgrp · 5d50d532
      Steven Whitehouse authored
      One corner case which the original patch failed to take into
      account was when there is a reservation which ended such that
      the following block was one beyond the end of the rgrp in
      question. This extra test fixes that case.
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      Reported-by: default avatarBob Peterson <rpeterso@redhat.com>
      Tested-by: default avatarBob Peterson <rpeterso@redhat.com>
    • Michel Lespinasse's avatar
      GFS2: Use RB_CLEAR_NODE() rather than rb_init_node() · 24d634e8
      Michel Lespinasse authored
      gfs2 calls RB_EMPTY_NODE() to check if nodes are not on an rbtree.
      The corresponding initialization function is RB_CLEAR_NODE().
      rb_init_node() was never clearly defined and is going away.
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
    • Steven Whitehouse's avatar
      GFS2: Update rgblk_free() to use rbm · 3b1d0b9d
      Steven Whitehouse authored
      Replace open coded version with a call to gfs2_rbm_from_block()
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
    • Steven Whitehouse's avatar
      GFS2: Update gfs2_get_block_type() to use rbm · 3983903a
      Steven Whitehouse authored
      Use the new gfs2_rbm_from_block() function to replace an open
      coded version of the same code.
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
    • Steven Whitehouse's avatar
      GFS2: Replace rgblk_search with gfs2_rbm_find · 5b924ae2
      Steven Whitehouse authored
      This is part of a series of patches which are introducing the
      gfs2_rbm structure throughout the block allocation code. The
      main aim of this part is to create a search function which can
      deal directly with struct gfs2_rbm. In this case it specifies
      the initial position at which to start the search and also the
      point at which the search terminates.
      The net result of this is to clean up the search code and make
      it rather more readable, and the various possible exceptions which
      may occur during the search are partitioned into their own functions.
      There are some bug fixes too. We should not be checking the reservations
      while allocating extents - the time for that is when we are searching
      for where to put the extent, not when we've already made that decision.
      Also, rgblk_search had two uses, and in only one of those cases did
      it make sense to check for reservations. This is fixed in the new
      gfs2_rbm_find function, which has a cleaner interface.
      The reservation checking has been improved by always checking for
      contiguous reservations, and returning the first free block after
      all contiguous reservations. This is done under the spin lock to
      ensure consistancy of the tree.
      The allocation of extents is now in all cases done by the existing
      allocation code, and if there is an active reservation, that is updated
      after the fact. Again this is done under the spin lock, since it entails
      changing the lookup key for the reservation in question.
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
    • Steven Whitehouse's avatar
      GFS2: Add structure to contain rgrp, bitmap, offset tuple · 4a993fb1
      Steven Whitehouse authored
      This patch introduces a new structure, gfs2_rbm, which is a
      tuple of a resource group, a bitmap within the resource group
      and an offset within that bitmap. This is designed to make
      manipulating these sets of variables easier. There is also a
      new helper function which converts this representation back
      to a disk block address.
      In addition, the rbtree nodes which are used for the reservations
      were not being correctly initialised, which is now fixed. Also,
      the tracing was not passing through the inode where it should
      have been. That is mostly fixed aside from one corner case. This
      needs to be revisited since there can also be a NULL rgrp in
      some cases which results in the device being incorrect in the
      This is intended to be the first step towards cleaning up some
      of the allocation code, and some further bug fixes.
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
    • Steven Whitehouse's avatar
      GFS2: Remove rs_requested field from reservations · 71f890f7
      Steven Whitehouse authored
      The rs_requested field is left over from the original allocation
      code, however this should have been a parameter passed to the
      various functions from gfs2_inplace_reserve() and not a member of the
      reservation structure as the value is not required after the
      initial allocation.
      This also helps simplify the code since we no longer need to set
      the rs_requested to zero. Also the gfs2_inplace_release()
      function can also be simplified since the reservation structure
      will always be defined when it is called, and the only remaining
      task is to unlock the rgrp if required. It can also now be
      called unconditionally too, resulting in a further simplification.
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
  3. 13 Sep, 2012 1 commit
    • Steven Whitehouse's avatar
      GFS2: Take account of blockages when using reserved blocks · 62e252ee
      Steven Whitehouse authored
      The claim_reserved_blks() function was not taking account of
      the possibility of "blockages" while performing allocation.
      This can be caused by another node allocating something in
      the same extent which has been reserved locally.
      This patch tests for this condition and then skips the remainder
      of the reservation in this case. This is a relatively rare event,
      so that it should not affect the general performance improvement
      which the block reservations provide.
      The claim_reserved_blks() function also appears not to be able
      to deal with reservations which cross bitmap boundaries, but
      that can be dealt with in a future patch since we don't generate
      boundary crossing reservations currently.
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      Reported-by: default avatarDavid Teigland <teigland@redhat.com>
      Cc: Bob Peterson <rpeterso@redhat.com>
  4. 19 Jul, 2012 1 commit
    • Bob Peterson's avatar
      GFS2: Reduce file fragmentation · 8e2e0047
      Bob Peterson authored
      This patch reduces GFS2 file fragmentation by pre-reserving blocks. The
      resulting improved on disk layout greatly speeds up operations in cases
      which would have resulted in interlaced allocation of blocks previously.
      A typical example of this is 10 parallel dd processes, each writing to a
      file in a common dirctory.
      The implementation uses an rbtree of reservations attached to each
      resource group (and each inode).
      Signed-off-by: default avatarBob Peterson <rpeterso@redhat.com>
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
  5. 18 Jul, 2012 1 commit
  6. 14 Jun, 2012 1 commit
  7. 08 Jun, 2012 1 commit
    • Benjamin Marzinski's avatar
      GFS2: Use lvbs for storing rgrp information with mount option · 90306c41
      Benjamin Marzinski authored
      Instead of reading in the resource groups when gfs2 is checking
      for free space to allocate from, gfs2 can store the necessary infromation
      in the resource group's lvb.  Also, instead of searching for unlinked
      inodes in every resource group that's checked for free space, gfs2 can
      store the number of unlinked but inodes in the lvb, and only check for
      unlinked inodes if it will find some.
      The first time a resource group is locked, the lvb must initialized.
      Since this involves counting the unlinked inodes in the resource group,
      this takes a little extra time.  But after that, if the resource group
      is locked with GL_SKIP, the buffer head won't be read in unless it's
      actually needed.
      Enabling the resource groups lvbs is done via the rgrplvb mount option.  If
      this option isn't set, the lvbs will still be set and updated, but they won't
      be verfied or used by the filesystem.  To safely turn on this option, all of
      the nodes mounting the filesystem must be running code with this patch, and
      the filesystem must have been completely unmounted since they were updated.
      Signed-off-by: default avatarBenjamin Marzinski <bmarzins@redhat.com>
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
  8. 06 Jun, 2012 2 commits
  9. 11 May, 2012 1 commit
    • Bob Peterson's avatar
      GFS2: Add rgrp information to block_alloc trace point · 41db1ab9
      Bob Peterson authored
      This is a second attempt at a patch that adds rgrp information to the
      block allocation trace point for GFS2. As suggested, the patch was
      modified to list the rgrp information _after_ the fields that exist today.
      Again, the reason for this patch is to allow us to trace and debug
      problems with the block reservations patch, which is still in the works.
      We can debug problems with reservations if we can see what block allocations
      result from the block reservations. It may also be handy in figuring out
      if there are problems in rgrp free space accounting. In other words,
      we can use it to track the rgrp and its free space along side the allocations
      that are taking place.
      Signed-off-by: default avatarBob Peterson <rpeterso@redhat.com>
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
  10. 27 Apr, 2012 1 commit
  11. 24 Apr, 2012 5 commits
  12. 05 Apr, 2012 1 commit
  13. 26 Mar, 2012 1 commit
  14. 05 Mar, 2012 1 commit