1. 14 Nov, 2012 1 commit
    • David Teigland's avatar
      GFS2: skip dlm_unlock calls in unmount · fb6791d1
      David Teigland authored
      
      
      When unmounting, gfs2 does a full dlm_unlock operation on every
      cached lock.  This can create a very large amount of work and can
      take a long time to complete.  However, the vast majority of these
      dlm unlock operations are unnecessary because after all the unlocks
      are done, gfs2 leaves the dlm lockspace, which automatically clears
      the locks of the leaving node, without unlocking each one individually.
      So, gfs2 can skip explicit dlm unlocks, and use dlm_release_lockspace to
      remove the locks implicitly.  The one exception is when the lock's lvb is
      being used.  In this case, dlm_unlock is called because it may update the
      lvb of the resource.
      Signed-off-by: default avatarDavid Teigland <teigland@redhat.com>
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      fb6791d1
  2. 07 Nov, 2012 2 commits
    • Bob Peterson's avatar
      GFS2: Rename glops go_xmote_th to go_sync · 06dfc306
      Bob Peterson authored
      
      
      [Editorial: This is a nit, but has been a minor irritation for a long time:]
      
      This patch renames glops structure item for go_xmote_th to go_sync.
      The functionality is unchanged; it's just for readability.
      Signed-off-by: default avatarBob Peterson <rpeterso@redhat.com>
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      06dfc306
    • Bob Peterson's avatar
      GFS2: Speed up gfs2_rbm_from_block · a68a0a35
      Bob Peterson authored
      
      
      This patch is a rewrite of function gfs2_rbm_from_block. Rather than
      looping to find the right bitmap, the code now does a few simple
      math calculations.
      
      I compared the performance of both algorithms side by side and the new
      algorithm is noticeably faster. Sample instrumentation output from a
      "fast" machine:
      
      5 million calls: millisec spent: Orig: 166 New: 113
      5 million calls: millisec spent: Orig: 189 New: 114
      
      In addition, I ran postmark (on a somewhat slowr CPU) before the after
      the new algorithm was put in place and postmark showed a decent
      improvement:
      
      Before the new algorithm:
      -------------------------
      Time:
      	645 seconds total
      	584 seconds of transactions (171 per second)
      
      Files:
      	150087 created (232 per second)
      		Creation alone: 100000 files (2083 per second)
      		Mixed with transactions: 50087 files (85 per second)
      	49995 read (85 per second)
      	49991 appended (85 per second)
      	150087 deleted (232 per second)
      		Deletion alone: 100174 files (7705 per second)
      		Mixed with transactions: 49913 files (85 per second)
      
      Data:
      	273.42 megabytes read (434.08 kilobytes per second)
      	852.13 megabytes written (1.32 megabytes per second)
      
      With the new algorithm:
      -----------------------
      Time:
      	599 seconds total
      	530 seconds of transactions (188 per second)
      
      Files:
      	150087 created (250 per second)
      		Creation alone: 100000 files (1886 per second)
      		Mixed with transactions: 50087 files (94 per second)
      	49995 read (94 per second)
      	49991 appended (94 per second)
      	150087 deleted (250 per second)
      		Deletion alone: 100174 files (6260 per second)
      		Mixed with transactions: 49913 files (94 per second)
      
      Data:
      	273.42 megabytes read (467.42 kilobytes per second)
      	852.13 megabytes written (1.42 megabytes per second)
      Signed-off-by: default avatarBob Peterson <rpeterso@redhat.com>
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      a68a0a35
  3. 24 Sep, 2012 5 commits
    • Steven Whitehouse's avatar
      GFS2: Consolidate free block searching functions · ff7f4cb4
      Steven Whitehouse authored
      
      
      With the recently added block reservation code, an additional function
      was added to search for free blocks. This had a restriction of only being
      able to search for aligned extents of free blocks. As a result the
      allocation patterns when reserving blocks were suboptimal when the
      existing allocation of blocks for an inode was not aligned to the same
      boundary.
      
      This patch resolves that problem by adding the ability for gfs2_rbm_find
      to search for extents of a particular minimum size. We can then use
      gfs2_rbm_find for both looking for reservations, and also looking for
      free blocks on an individual basis when we actually come to do the
      allocation later on. As a result we only need a single set of code
      to deal with both situations.
      
      The function gfs2_rbm_from_block() is moved up rgrp.c so that it
      occurs before all of its callers.
      
      Many thanks are due to Bob for helping track down the final issue in
      this patch. That fix to the rb_tree traversal and to not share
      block reservations from a dirctory to its children is included here.
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      Signed-off-by: default avatarBob Peterson <rpeterso@redhat.com>
      ff7f4cb4
    • Steven Whitehouse's avatar
      GFS2: Improve block reservation tracing · 9e733d39
      Steven Whitehouse authored
      
      
      This patch improves the tracing of block reservations by
      removing some corner cases and also providing more useful
      detail in the traces.
      
      A new field is added to the reservation structure to contain
      the inode number. This is used since in certain contexts it is
      not possible to access the inode itself to obtain this information.
      As a result we can then display the inode number for all tracepoints
      and also in case we dump the resource group.
      
      The "del" tracepoint operation has been removed. This could be called
      with the reservation rgrp set to NULL. That resulted in not printing
      the device number, and thus making the information largely useless
      anyway. Also, the conditional on the rgrp being NULL can then be
      removed from the tracepoint. After this change, all the block
      reservation tracepoint calls will be called with the rgrp information.
      
      The existing ins,clm and tdel calls to the block reservation tracepoint
      are sufficient to track the entire life of the block reservation.
      
      In gfs2_block_alloc() the error detection is updated to print out
      the inode number of the problematic inode. This can then be compared
      against the information in the glock dump,tracepoints, etc.
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      9e733d39
    • Steven Whitehouse's avatar
      GFS2: Replace rgblk_search with gfs2_rbm_find · 5b924ae2
      Steven Whitehouse authored
      
      
      This is part of a series of patches which are introducing the
      gfs2_rbm structure throughout the block allocation code. The
      main aim of this part is to create a search function which can
      deal directly with struct gfs2_rbm. In this case it specifies
      the initial position at which to start the search and also the
      point at which the search terminates.
      
      The net result of this is to clean up the search code and make
      it rather more readable, and the various possible exceptions which
      may occur during the search are partitioned into their own functions.
      
      There are some bug fixes too. We should not be checking the reservations
      while allocating extents - the time for that is when we are searching
      for where to put the extent, not when we've already made that decision.
      
      Also, rgblk_search had two uses, and in only one of those cases did
      it make sense to check for reservations. This is fixed in the new
      gfs2_rbm_find function, which has a cleaner interface.
      
      The reservation checking has been improved by always checking for
      contiguous reservations, and returning the first free block after
      all contiguous reservations. This is done under the spin lock to
      ensure consistancy of the tree.
      
      The allocation of extents is now in all cases done by the existing
      allocation code, and if there is an active reservation, that is updated
      after the fact. Again this is done under the spin lock, since it entails
      changing the lookup key for the reservation in question.
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      5b924ae2
    • Steven Whitehouse's avatar
      GFS2: Add structure to contain rgrp, bitmap, offset tuple · 4a993fb1
      Steven Whitehouse authored
      
      
      This patch introduces a new structure, gfs2_rbm, which is a
      tuple of a resource group, a bitmap within the resource group
      and an offset within that bitmap. This is designed to make
      manipulating these sets of variables easier. There is also a
      new helper function which converts this representation back
      to a disk block address.
      
      In addition, the rbtree nodes which are used for the reservations
      were not being correctly initialised, which is now fixed. Also,
      the tracing was not passing through the inode where it should
      have been. That is mostly fixed aside from one corner case. This
      needs to be revisited since there can also be a NULL rgrp in
      some cases which results in the device being incorrect in the
      trace.
      
      This is intended to be the first step towards cleaning up some
      of the allocation code, and some further bug fixes.
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      4a993fb1
    • Steven Whitehouse's avatar
      GFS2: Remove rs_requested field from reservations · 71f890f7
      Steven Whitehouse authored
      
      
      The rs_requested field is left over from the original allocation
      code, however this should have been a parameter passed to the
      various functions from gfs2_inplace_reserve() and not a member of the
      reservation structure as the value is not required after the
      initial allocation.
      
      This also helps simplify the code since we no longer need to set
      the rs_requested to zero. Also the gfs2_inplace_release()
      function can also be simplified since the reservation structure
      will always be defined when it is called, and the only remaining
      task is to unlock the rgrp if required. It can also now be
      called unconditionally too, resulting in a further simplification.
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      71f890f7
  4. 19 Jul, 2012 1 commit
    • Bob Peterson's avatar
      GFS2: Reduce file fragmentation · 8e2e0047
      Bob Peterson authored
      
      
      This patch reduces GFS2 file fragmentation by pre-reserving blocks. The
      resulting improved on disk layout greatly speeds up operations in cases
      which would have resulted in interlaced allocation of blocks previously.
      A typical example of this is 10 parallel dd processes, each writing to a
      file in a common dirctory.
      
      The implementation uses an rbtree of reservations attached to each
      resource group (and each inode).
      Signed-off-by: default avatarBob Peterson <rpeterso@redhat.com>
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      8e2e0047
  5. 08 Jun, 2012 1 commit
    • Benjamin Marzinski's avatar
      GFS2: Use lvbs for storing rgrp information with mount option · 90306c41
      Benjamin Marzinski authored
      
      
      Instead of reading in the resource groups when gfs2 is checking
      for free space to allocate from, gfs2 can store the necessary infromation
      in the resource group's lvb.  Also, instead of searching for unlinked
      inodes in every resource group that's checked for free space, gfs2 can
      store the number of unlinked but inodes in the lvb, and only check for
      unlinked inodes if it will find some.
      
      The first time a resource group is locked, the lvb must initialized.
      Since this involves counting the unlinked inodes in the resource group,
      this takes a little extra time.  But after that, if the resource group
      is locked with GL_SKIP, the buffer head won't be read in unless it's
      actually needed.
      
      Enabling the resource groups lvbs is done via the rgrplvb mount option.  If
      this option isn't set, the lvbs will still be set and updated, but they won't
      be verfied or used by the filesystem.  To safely turn on this option, all of
      the nodes mounting the filesystem must be running code with this patch, and
      the filesystem must have been completely unmounted since they were updated.
      Signed-off-by: default avatarBenjamin Marzinski <bmarzins@redhat.com>
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      90306c41
  6. 06 Jun, 2012 1 commit
  7. 02 May, 2012 2 commits
    • David Teigland's avatar
      dlm: fixes for nodir mode · 4875647a
      David Teigland authored
      
      
      The "nodir" mode (statically assign master nodes instead
      of using the resource directory) has always been highly
      experimental, and never seriously used.  This commit
      fixes a number of problems, making nodir much more usable.
      
      - Major change to recovery: recover all locks and restart
        all in-progress operations after recovery.  In some
        cases it's not possible to know which in-progess locks
        to recover, so recover all.  (Most require recovery
        in nodir mode anyway since rehashing changes most
        master nodes.)
      
      - Change the way nodir mode is enabled, from a command
        line mount arg passed through gfs2, into a sysfs
        file managed by dlm_controld, consistent with the
        other config settings.
      
      - Allow recovering MSTCPY locks on an rsb that has not
        yet been turned into a master copy.
      
      - Ignore RCOM_LOCK and RCOM_LOCK_REPLY recovery messages
        from a previous, aborted recovery cycle.  Base this
        on the local recovery status not being in the state
        where any nodes should be sending LOCK messages for the
        current recovery cycle.
      
      - Hold rsb lock around dlm_purge_mstcpy_locks() because it
        may run concurrently with dlm_recover_master_copy().
      
      - Maintain highbast on process-copy lkb's (in addition to
        the master as is usual), because the lkb can switch
        back and forth between being a master and being a
        process copy as the master node changes in recovery.
      
      - When recovering MSTCPY locks, flag rsb's that have
        non-empty convert or waiting queues for granting
        at the end of recovery.  (Rename flag from LOCKS_PURGED
        to RECOVER_GRANT and similar for the recovery function,
        because it's not only resources with purged locks
        that need grant a grant attempt.)
      
      - Replace a couple of unnecessary assertion panics with
        error messages.
      Signed-off-by: default avatarDavid Teigland <teigland@redhat.com>
      4875647a
    • Bob Peterson's avatar
      GFS2: eliminate log elements and simplify · c0752aa7
      Bob Peterson authored
      
      
      This patch eliminates the gfs2_log_element data structure and
      rolls its two components into the gfs2_bufdata. This makes the code
      easier to understand and makes it easier to migrate to a rbtree
      to keep the list sorted.
      Signed-off-by: default avatarBob Peterson <rpeterso@redhat.com>
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      c0752aa7
  8. 30 Apr, 2012 1 commit
  9. 24 Apr, 2012 2 commits
    • Steven Whitehouse's avatar
      GFS2: Remove bd_list_tr · c50b91c4
      Steven Whitehouse authored
      
      
      This is another clean up in the logging code. This per-transaction
      list was largely unused. Its main function was to ensure that the
      number of buffers in a transaction was correct, however that counter
      was only used to check the number of buffers in the bd_list_tr, plus
      an assert at the end of each transaction. With the assert now changed
      to use the calculated buffer counts, we can remove both bd_list_tr and
      its associated counter.
      
      This should make the code easier to understand as well as shrinking
      a couple of structures.
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      c50b91c4
    • Steven Whitehouse's avatar
      GFS2: Clean up log write code path · e8c92ed7
      Steven Whitehouse authored
      
      
      Prior to this patch, we have two ways of sending i/o to the log.
      One of those is used when we need to allocate both the data
      to be written itself and also a buffer head to submit it. This
      is done via sb_getblk and friends. This is used mostly for writing
      log headers.
      
      The other method is used when writing blocks which have some
      in-place counterpart. This is the case for all the metadata
      blocks which are journalled, and when journaled data is in use,
      for unescaped journalled data blocks.
      
      This patch replaces both of those two methods, and about half
      a dozen separate i/o submission points with a single i/o
      submission function. We also go direct to bio rather than
      using buffer heads, since this allows us to build i/o
      requests of the maximum size for the block device in
      question. It also reduces the memory required for flushing
      the log, which can be very useful in low memory situations.
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      e8c92ed7
  10. 05 Mar, 2012 1 commit
    • Bob Peterson's avatar
      GFS2: Eliminate sd_rindex_mutex · 6aad1c3d
      Bob Peterson authored
      
      
      Over time, we've slowly eliminated the use of sd_rindex_mutex.
      Up to this point, it was only used in two places: function
      gfs2_ri_total (which totals the file system size by reading
      and parsing the rindex file) and function gfs2_rindex_update
      which updates the rgrps in memory. Both of these functions have
      the rindex glock to protect them, so the rindex is unnecessary.
      Since gfs2_grow writes to the rindex via the meta_fs, the mutex
      is in the wrong order according to the normal rules. This patch
      eliminates the mutex entirely to avoid the problem.
      Signed-off-by: default avatarBob Peterson <rpeterso@redhat.com>
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      6aad1c3d
  11. 28 Feb, 2012 1 commit
    • Steven Whitehouse's avatar
      GFS2: glock statistics gathering · a245769f
      Steven Whitehouse authored
      
      
      The stats are divided into two sets: those relating to the
      super block and those relating to an individual glock. The
      super block stats are done on a per cpu basis in order to
      try and reduce the overhead of gathering them. They are also
      further divided by glock type.
      
      In the case of both the super block and glock statistics,
      the same information is gathered in each case. The super
      block statistics are used to provide default values for
      most of the glock statistics, so that newly created glocks
      should have, as far as possible, a sensible starting point.
      
      The statistics are divided into three pairs of mean and
      variance, plus two counters. The mean/variance pairs are
      smoothed exponential estimates and the algorithm used is
      one which will be very familiar to those used to calculation
      of round trip times in network code.
      
      The three pairs of mean/variance measure the following
      things:
      
       1. DLM lock time (non-blocking requests)
       2. DLM lock time (blocking requests)
       3. Inter-request time (again to the DLM)
      
      A non-blocking request is one which will complete right
      away, whatever the state of the DLM lock in question. That
      currently means any requests when (a) the current state of
      the lock is exclusive (b) the requested state is either null
      or unlocked or (c) the "try lock" flag is set. A blocking
      request covers all the other lock requests.
      
      There are two counters. The first is there primarily to show
      how many lock requests have been made, and thus how much data
      has gone into the mean/variance calculations. The other counter
      is counting queueing of holders at the top layer of the glock
      code. Hopefully that number will be a lot larger than the number
      of dlm lock requests issued.
      
      So why gather these statistics? There are several reasons
      we'd like to get a better idea of these timings:
      
      1. To be able to better set the glock "min hold time"
      2. To spot performance issues more easily
      3. To improve the algorithm for selecting resource groups for
      allocation (to base it on lock wait time, rather than blindly
      using a "try lock")
      Due to the smoothing action of the updates, a step change in
      some input quantity being sampled will only fully be taken
      into account after 8 samples (or 4 for the variance) and this
      needs to be carefully considered when interpreting the
      results.
      
      Knowing both the time it takes a lock request to complete and
      the average time between lock requests for a glock means we
      can compute the total percentage of the time for which the
      node is able to use a glock vs. time that the rest of the
      cluster has its share. That will be very useful when setting
      the lock min hold time.
      
      The other point to remember is that all times are in
      nanoseconds. Great care has been taken to ensure that we
      measure exactly the quantities that we want, as accurately
      as possible. There are always inaccuracies in any
      measuring system, but I hope this is as accurate as we
      can reasonably make it.
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      a245769f
  12. 11 Jan, 2012 3 commits
  13. 22 Nov, 2011 1 commit
  14. 18 Nov, 2011 1 commit
  15. 21 Oct, 2011 6 commits
    • Steven Whitehouse's avatar
      GFS2: Remove two unused variables · 9ae32429
      Steven Whitehouse authored
      
      
      The two variables being initialised in gfs2_inplace_reserve
      to track the file & line number of the caller are never
      used, so we might as well remove them.
      
      If something does go wrong, then a stack trace is probably
      more useful anyway.
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      9ae32429
    • Benjamin Marzinski's avatar
      GFS2: rewrite fallocate code to write blocks directly · 64dd153c
      Benjamin Marzinski authored
      
      
      GFS2's fallocate code currently goes through the page cache. Since it's only
      writing to the end of the file or to holes in it, it doesn't need to, and it
      was causing issues on low memory environments. This patch pulls in some of
      Steve's block allocation work, and uses it to simply allocate the blocks for
      the file, and zero them out at allocation time.  It provides a slight
      performance increase, and it dramatically simplifies the code.
      Signed-off-by: default avatarBenjamin Marzinski <bmarzins@redhat.com>
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      64dd153c
    • Steven Whitehouse's avatar
      GFS2: Cache the most recently used resource group in the inode · 54335b1f
      Steven Whitehouse authored
      
      
      This means that after the initial allocation for any inode, the
      last used resource group is cached in the inode for future use.
      This drastically reduces the number of lookups of resource
      groups in the common case, and this the contention on that
      data structure.
      
      The allocation algorithm is the same as previously, except that we
      always check to see if the goal block is within the cached rgrp
      first before going to the rbtree to look one up.
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      54335b1f
    • Steven Whitehouse's avatar
      GFS2: Make resource groups "append only" during life of fs · 8339ee54
      Steven Whitehouse authored
      
      
      Since we have ruled out supporting online filesystem shrink,
      it is possible to make the resource group list append only
      during the life of a super block. This gives several benefits:
      
      Firstly, we only need to read new rindex elements as they are added
      rather than needing to reread the whole rindex file each time one
      element is added.
      
      Secondly, the rindex glock can be held for much shorter periods of
      time, and is completely removed from the fast path for allocations.
      The lock is taken in shared mode only when updating the resource
      groups when the first allocation occurs, and after a grow has
      taken place.
      
      Thirdly, this results in a reduction in code size, and everything
      gets a lot simpler to understand in this area.
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      8339ee54
    • Bob Peterson's avatar
      GFS2: Use rbtree for resource groups and clean up bitmap buffer ref count scheme · 7c9ca621
      Bob Peterson authored
      
      
      Here is an update of Bob's original rbtree patch which, in addition, also
      resolves the rather strange ref counting that was being done relating to
      the bitmap blocks.
      
      Originally we had a dual system for journaling resource groups. The metadata
      blocks were journaled and also the rgrp itself was added to a list. The reason
      for adding the rgrp to the list in the journal was so that the "repolish
      clones" code could be run to update the free space, and potentially send any
      discard requests when the log was flushed. This was done by comparing the
      "cloned" bitmap with what had been written back on disk during the transaction
      commit.
      
      Due to this, there was a requirement to hang on to the rgrps' bitmap buffers
      until the journal had been flushed. For that reason, there was a rather
      complicated set up in the ->go_lock ->go_unlock functions for rgrps involving
      both a mutex and a spinlock (the ->sd_rindex_spin) to maintain a reference
      count on the buffers.
      
      However, the journal maintains a reference count on the buffers anyway, since
      they are being journaled as metadata buffers. So by moving the code which deals
      with the post-journal accounting for bitmap blocks to the metadata journaling
      code, we can entirely dispense with the rather strange buffer ref counting
      scheme and also the requirement to journal the rgrps.
      
      The net result of all this is that the ->sd_rindex_spin is left to do exactly
      one job, and that is to look after the rbtree or rgrps.
      
      This patch is designed to be a stepping stone towards using RCU for the rbtree
      of resource groups, however the reduction in the number of uses of the
      ->sd_rindex_spin is likely to have benefits for multi-threaded workloads,
      anyway.
      
      The patch retains ->go_lock and ->go_unlock for rgrps, however these maybe also
      be removed in future in favour of calling the functions directly where required
      in the code. That will allow locking of resource groups without needing to
      actually read them in - something that could be useful in speeding up statfs.
      
      In the mean time though it is valid to dereference ->bi_bh only when the rgrp
      is locked. This is basically the same rule as before, modulo the references not
      being valid until the following journal flush.
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      Signed-off-by: default avatarBob Peterson <rpeterso@redhat.com>
      Cc: Benjamin Marzinski <bmarzins@redhat.com>
      7c9ca621
    • Steven Whitehouse's avatar
      GFS2: Fix inode allocation error path · 40ac218f
      Steven Whitehouse authored
      
      
      If we have got far enough through the inode allocation code
      path that an inode has already been allocated, then we must
      call iput to dispose of it, if an error occurs during a
      later part of the process. This will always be the final iput
      since there will be no other references to the inode.
      
      Unlike when the inode has been unlinked, its block state will
      be GFS2_BLKST_INODE rather than GFS2_BLKST_UNLINKED so we need
      to skip the test in ->evict_inode() for this one case in order
      to ensure that it will be deallocated correctly. This patch adds
      a new flag in order to ensure that this will happen correctly.
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      40ac218f
  16. 15 Jul, 2011 2 commits
    • Bob Peterson's avatar
      GFS2: Automatically adjust glock min hold time · 7cf8dcd3
      Bob Peterson authored
      
      
      This patch is a performance improvement for GFS2 in a clustered
      environment. It makes the glock hold time self-adjusting.
      Signed-off-by: default avatarBob Peterson <rpeterso@redhat.com>
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      7cf8dcd3
    • Steven Whitehouse's avatar
      GFS2: Cache dir hash table in a contiguous buffer · 17d539f0
      Steven Whitehouse authored
      
      
      This patch adds a cache for the hash table to the directory code
      in order to help simplify the way in which the hash table is
      accessed. This is intended to be a first step towards introducing
      some performance improvements in the directory code.
      
      There are two follow ups that I'm hoping to see fairly shortly. One
      is to simplify the hash table reading code now that we always read the
      complete hash table, whether we want one entry or all of them. The
      other is to introduce readahead on the heads of the hash chains
      which are referred to from the table.
      
      The hash table is a maximum of 128k in size, so it is not worth trying
      to read it in small chunks.
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      17d539f0
  17. 12 Jul, 2011 1 commit
    • Steven Whitehouse's avatar
      GFS2: Fix race during filesystem mount · 3942ae53
      Steven Whitehouse authored
      
      
      There is a potential race during filesystem mounting which has recently
      been reported. It occurs when the userland gfs_controld is able to
      process requests fast enough that it tries to use the sysfs interface
      before the lock module is properly initialised. This is a pretty
      unusual case as normally the lock module initialisation is very quick
      compared with gfs_controld.
      
      This patch adds an interruptible completion which is used to ensure that
      userland will wait for the initialisation of the lock module to
      complete.
      
      There are other potential solutions to this problem, but this is the
      quickest at this stage and has been tested both with and without
      mount.gfs2 present in the system.
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      Reported-by: default avatarDavid Booher <dbooher@adams.net>
      3942ae53
  18. 10 May, 2011 1 commit
  19. 20 Apr, 2011 3 commits
    • Steven Whitehouse's avatar
      GFS2: Make writeback more responsive to system conditions · 4667a0ec
      Steven Whitehouse authored
      
      
      This patch adds writeback_control to writing back the AIL
      list. This means that we can then take advantage of the
      information we get in ->write_inode() in order to set off
      some pre-emptive writeback.
      
      In addition, the AIL code is cleaned up a bit to make it
      a bit simpler to understand.
      
      There is still more which can usefully be done in this area,
      but this is a good start at least.
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      4667a0ec
    • Steven Whitehouse's avatar
      GFS2: Optimise glock lru and end of life inodes · f42ab085
      Steven Whitehouse authored
      
      
      The GLF_LRU flag introduced in the previous patch can be
      used to check if a glock is on the lru list when a new
      holder is queued and if so remove it, without having first
      to get the lru_lock.
      
      The main purpose of this patch however is to optimise the
      glocks left over when an inode at end of life is being
      evicted. Previously such glocks were left with the GLF_LFLUSH
      flag set, so that when reclaimed, each one required a log flush.
      This patch resets the GLF_LFLUSH flag when there is nothing
      left to flush thus preventing later log flushes as glocks are
      reused or demoted.
      
      In order to do this, we need to keep track of the number of
      revokes which are outstanding, and also to clear the GLF_LFLUSH
      bit after a log commit when only revokes have been processed.
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      f42ab085
    • Steven Whitehouse's avatar
      GFS2: Improve tracing support (adds two flags) · 627c10b7
      Steven Whitehouse authored
      
      
      This adds support for two new flags. One keeps track of whether
      the glock is on the LRU list or not. The other isn't really a
      flag as such, but an indication of whether the glock has an
      attached object or not. This indication is reported without
      any locking, which is ok since we do not dereference the object
      pointer but merely report whether it is NULL or not.
      
      Also, this fixes one place where a tracepoint was missing, which
      was at the point we remove deallocated blocks from the journal.
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      627c10b7
  20. 11 Mar, 2011 1 commit
    • Dave Chinner's avatar
      GFS2: introduce AIL lock · d6a079e8
      Dave Chinner authored
      
      
      The log lock is currently used to protect the AIL lists and
      the movements of buffers into and out of them. The lists
      are self contained and no log specific items outside the
      lists are accessed when starting or emptying the AIL lists.
      
      Hence the operation of the AIL does not require the protection
      of the log lock so split them out into a new AIL specific lock
      to reduce the amount of traffic on the log lock. This will
      also reduce the amount of serialisation that occurs when
      the gfs2_logd pushes on the AIL to move it forward.
      
      This reduces the impact of log pushing on sequential write
      throughput.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      d6a079e8
  21. 09 Mar, 2011 1 commit
    • Abhijith Das's avatar
      GFS2: quota allows exceeding hard limit · 662e3a55
      Abhijith Das authored
      
      
      Immediately after being synced to disk, cached quotas are zeroed out and a
      subsequent access of the cached quotas results in incorrect zero values. This
      meant that gfs2 assumed the actual usage to be the zero (or near-zero) usage
      values it found in the cached quotas and comparison against warn/limits never
      triggered a quota violation.
      
      This patch adds a new flag QDF_REFRESH that is set after a sync so that the
      cached quotas are forcefully refreshed from disk on a subsequent access on
      seeing this flag set.
      
      Resolves: rhbz#675944
      Signed-off-by: default avatarAbhi Das <adas@redhat.com>
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      662e3a55
  22. 21 Jan, 2011 1 commit
    • Steven Whitehouse's avatar
      GFS2: Use RCU for glock hash table · bc015cb8
      Steven Whitehouse authored
      
      
      This has a number of advantages:
      
       - Reduces contention on the hash table lock
       - Makes the code smaller and simpler
       - Should speed up glock dumps when under load
       - Removes ref count changing in examine_bucket
       - No longer need hash chain lock in glock_put() in common case
      
      There are some further changes which this enables and which
      we may do in the future. One is to look at using SLAB_RCU,
      and another is to look at using a per-cpu counter for the
      per-sb glock counter, since that is touched twice in the
      lifetime of each glock (but only used at umount time).
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      bc015cb8
  23. 10 Jan, 2011 1 commit