1. 01 Apr, 2012 3 commits
    • Tejun Heo's avatar
      blkcg: introduce blkg_stat and blkg_rwstat · edcb0722
      Tejun Heo authored
      blkcg uses u64_stats_sync to avoid reading wrong u64 statistic values
      on 32bit archs and some stat counters have subtypes to distinguish
      read/writes and sync/async IOs.  The stat code paths are confusing and
      involve a lot of going back and forth between blkcg core and specific
      policy implementations, and synchronization and subtype handling are
      open coded in blkcg core.
      This patch introduces struct blkg_stat and blkg_rwstat which, with
      accompanying operations, encapsulate stat updating and accessing with
      proper synchronization.
      blkg_stat is simple u64 counter with 64bit read-access protection.
      blkg_rwstat is the one with rw and [a]sync subcounters and takes @rw
      flags to distinguish IO subtypes (%REQ_WRITE and %REQ_SYNC) and
      replaces stat_sub_type indexed arrays.
      All counters in blkio_group_stats and blkio_group_stats_cpu are
      replaced with either blkg_stat or blkg_rwstat along with all users.
      This does add one u64_stats_sync per counter and increase stats_sync
      operations but they're empty/noops on 64bit archs and blkcg doesn't
      have too many counters, especially with DEBUG_BLK_CGROUP off.
      While the currently resulting code isn't necessarily simpler at the
      moment, this will enable further clean up of blkcg stats code.
      - blkg_stat_add() replaces blkio_add_stat() and
        blkio_check_and_dec_stat().  Note that BUG_ON() on underflow in the
        latter function no longer exists.  It's *way* better to have
        underflowed stat counters than oopsing.
      - blkio_group_stats->dequeue is now a proper u64 stat counter instead
        of ulong.
      - reset_stats() updated to clear each stat counters individually and
        BLKG_STATS_DEBUG_CLEAR_{START|SIZE} are removed.
      - Some functions reconstruct rw flags from direction and sync
        booleans.  This will be removed by future patches.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
    • Tejun Heo's avatar
      blkcg: BLKIO_STAT_CPU_SECTORS doesn't have subcounters · 2aa4a152
      Tejun Heo authored
      BLKIO_STAT_CPU_SECTORS doesn't need read/write/sync/async subcounters
      and is counted by blkio_group_stats_cpu->sectors; however, it still
      holds a member in blkio_group_stats_cpu->stat_arr_cpu.
      Rearrange stat_type_cpu and define BLKIO_STAT_CPU_ARR_NR and use it
      for stat_arr_cpu[] size so that only SERVICE_BYTES and SERVICED have
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
    • Tejun Heo's avatar
      blkcg: remove unused @pol and @plid parameters · aaec55a0
      Tejun Heo authored
      @pol to blkg_to_pdata() and @plid to blkg_lookup_create() are no
      longer necessary.  Drop them.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
  2. 20 Mar, 2012 6 commits
    • Tejun Heo's avatar
      blkcg: add blkcg->id · 9a9e8a26
      Tejun Heo authored
      Add 64bit unique id to blkcg.  This will be used by policies which
      want blkcg identity test to tell whether the associated blkcg has
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    • Tejun Heo's avatar
      blkcg: remove blkio_group->stats_lock · edf1b879
      Tejun Heo authored
      With recent plug merge updates, all non-percpu stat updates happen
      under queue_lock making stats_lock unnecessary to synchronize stat
      updates.  The only synchronization necessary is stat reading, which
      can be done using u64_stats_sync instead.
      This patch removes blkio_group->stats_lock and adds
      blkio_group_stats->syncp for reader synchronization.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    • Tejun Heo's avatar
      blkcg: restructure blkio_get_stat() · c4c76a05
      Tejun Heo authored
      Restructure blkio_get_stat() to prepare for removal of stats_lock.
      * Define BLKIO_STAT_ARR_NR explicitly to denote which stats have
        subtypes instead of using BLKIO_STAT_QUEUED.
      * Separate out stat acquisition and printing.  After this, there are
        only two users of blkio_fill_stat().  Just open code it.
      * The code was mixing MAX_KEY_LEN and MAX_KEY_LEN - 1.  There's no
        need to subtract one.  Use MAX_KEY_LEN consistently.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    • Tejun Heo's avatar
      blkcg: simplify stat reset · 997a026c
      Tejun Heo authored
      blkiocg_reset_stats() implements stat reset for blkio.reset_stats
      cgroupfs file.  This feature is very unconventional and something
      which shouldn't have been merged.  It's only useful when there's only
      one user or tool looking at the stats.  As soon as multiple users
      and/or tools are involved, it becomes useless as resetting disrupts
      other usages.  There are very good reasons why all other stats expect
      readers to read values at the start and end of a period and subtract
      to determine delta over the period.
      The implementation is rather complex - some fields shouldn't be
      cleared and it saves some fields, resets whole and restores for some
      reason.  Reset of percpu stats is also racy.  The comment points to
      64bit store atomicity for the reason but even without that stores for
      zero can simply race with other CPUs doing RMW and get clobbered.
      Simplify reset by
      * Clear selectively instead of resetting and restoring.
      * Grouping debug stat fields to be reset and using memset() over them.
      * Not caring about stats_lock.
      * Using memset() to reset percpu stats.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    • Tejun Heo's avatar
      blkcg: don't use percpu for merged stats · 5fe224d2
      Tejun Heo authored
      With recent plug merge updates, merged stats are no longer called for
      plug merges and now only updated while holding queue_lock.  As
      stats_lock is scheduled to be removed, there's no reason to use percpu
      for merged stats.  Don't use percpu for merged stats.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    • Vivek Goyal's avatar
      blkcg: alloc per cpu stats from worker thread in a delayed manner · 1cd9e039
      Vivek Goyal authored
      Current per cpu stat allocation assumes GFP_KERNEL allocation flag. But in
      IO path there are times when we want GFP_NOIO semantics. As there is no
      way to pass the allocation flags to alloc_percpu(), this patch delays the
      allocation of stats using a worker thread.
      v2-> tejun suggested following changes. Changed the patch accordingly.
      	- move alloc_node location in structure
      	- reduce the size of names of some of the fields
      	- Reduce the scope of locking of alloc_list_lock
      	- Simplified stat_alloc_fn() by allocating stats for all
      	  policies in one go and then assigning these to a group.
      v3 -> Andrew suggested to put some comments in the code. Also raised
            concerns about trying to allocate infinitely in case of allocation
            failure. I have changed the logic to sleep for 10ms before retrying.
            That should take care of non-preemptible UP kernels.
      v4 -> Tejun had more suggestions.
      	- drop list_for_each_entry_all()
      	- instead of msleep() use queue_delayed_work()
      	- Some cleanups realted to more compact coding.
      v5-> tejun suggested more cleanups leading to more compact code.
      tj: - Relocated pcpu_stats into blkio_stat_alloc_fn().
          - Minor comment update.
          - This also fixes suspicious RCU usage warning caused by invoking
            cgroup_path() from blkg_alloc() without holding RCU read lock.
            Now that blkg_alloc() doesn't require sleepable context, RCU
            read lock from blkg_lookup_create() is maintained throughout
      Signed-off-by: default avatarVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
  3. 06 Mar, 2012 20 commits
    • Tejun Heo's avatar
      block: make block cgroup policies follow bio task association · 4f85cb96
      Tejun Heo authored
      Implement bio_blkio_cgroup() which returns the blkcg associated with
      the bio if exists or %current's blkcg, and use it in blk-throttle and
      cfq-iosched propio.  This makes both cgroup policies honor task
      association for the bio instead of always assuming %current.
      As nobody is using bio_set_task() yet, this doesn't introduce any
      behavior change.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    • Tejun Heo's avatar
      blkcg: drop unnecessary RCU locking · c875f4d0
      Tejun Heo authored
      Now that blkg additions / removals are always done under both q and
      blkcg locks, the only places RCU locking is necessary are
      blkg_lookup[_create]() for lookup w/o blkcg lock.  This patch drops
      unncessary RCU locking replacing it with plain blkcg locking as
      * blkiocg_pre_destroy() already perform proper locking and don't need
        RCU.  Dropped.
      * blkio_read_blkg_stats() now uses blkcg->lock instead of RCU read
        lock.  This isn't a hot path.
      * Now unnecessary synchronize_rcu() from queue exit paths removed.
        This makes q->nr_blkgs unnecessary.  Dropped.
      * RCU annotation on blkg->q removed.
      -v2: Vivek pointed out that blkg_lookup_create() still needs to be
           called under rcu_read_lock().  Updated.
      -v3: After the update, stats_lock locking in blkio_read_blkg_stats()
           shouldn't be using _irq variant as it otherwise ends up enabling
           irq while blkcg->lock is locked.  Fixed.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    • Tejun Heo's avatar
      blkcg: use double locking instead of RCU for blkg synchronization · 9f13ef67
      Tejun Heo authored
      blkgs are chained from both blkcgs and request_queues and thus
      subjected to two locks - blkcg->lock and q->queue_lock.  As both blkcg
      and q can go away anytime, locking during removal is tricky.  It's
      currently solved by wrapping removal inside RCU, which makes the
      synchronization complex.  There are three locks to worry about - the
      outer RCU, q lock and blkcg lock, and it leads to nasty subtle
      complications like conditional synchronize_rcu() on queue exit paths.
      For all other paths, blkcg lock is naturally nested inside q lock and
      the only exception is blkcg removal path, which is a very cold path
      and can be implemented as clumsy but conceptually-simple reverse
      double lock dancing.
      This patch updates blkg removal path such that blkgs are removed while
      holding both q and blkcg locks, which is trivial for request queue
      exit path - blkg_destroy_all().  The blkcg removal path,
      blkiocg_pre_destroy(), implements reverse double lock dancing
      essentially identical to ioc_release_fn().
      This simplifies blkg locking - no half-dead blkgs to worry about.  Now
      unnecessary RCU annotations will be removed by the next patch.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    • Tejun Heo's avatar
      blkcg: unify blkg's for blkcg policies · e8989fae
      Tejun Heo authored
      Currently, blkg is per cgroup-queue-policy combination.  This is
      unnatural and leads to various convolutions in partially used
      duplicate fields in blkg, config / stat access, and general management
      of blkgs.
      This patch make blkg's per cgroup-queue and let them serve all
      policies.  blkgs are now created and destroyed by blkcg core proper.
      This will allow further consolidation of common management logic into
      blkcg core and API with better defined semantics and layering.
      As a transitional step to untangle blkg management, elvswitch and
      policy [de]registration, all blkgs except the root blkg are being shot
      down during elvswitch and bypass.  This patch adds blkg_root_update()
      to update root blkg in place on policy change.  This is hacky and racy
      but should be good enough as interim step until we get locking
      simplified and switch over to proper in-place update for all blkgs.
      -v2: Root blkgs need to be updated on elvswitch too and blkg_alloc()
           comment wasn't updated according to the function change.  Fixed.
           Both pointed out by Vivek.
      -v3: v2 updated blkg_destroy_all() to invoke update_root_blkg_pd() for
           all policies.  This freed root pd during elvswitch before the
           last queue finished exiting and led to oops.  Directly invoke
           update_root_blkg_pd() only on BLKIO_POLICY_PROP from
           cfq_exit_queue().  This also is closer to what will be done with
           proper in-place blkg update.  Reported by Vivek.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    • Tejun Heo's avatar
      blkcg: let blkcg core manage per-queue blkg list and counter · 03aa264a
      Tejun Heo authored
      With the previous patch to move blkg list heads and counters to
      request_queue and blkg, logic to manage them in both policies are
      almost identical and can be moved to blkcg core.
      This patch moves blkg link logic into blkg_lookup_create(), implements
      common blkg unlink code in blkg_destroy(), and updates
      blkg_destory_all() so that it's policy specific and can skip root
      group.  The updated blkg_destroy_all() is now used to both clear queue
      for bypassing and elv switching, and release all blkgs on q exit.
      This patch introduces a race window where policy [de]registration may
      race against queue blkg clearing.  This can only be a problem on cfq
      unload and shouldn't be a real problem in practice (and we have many
      other places where this race already exists).  Future patches will
      remove these unlikely races.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    • Tejun Heo's avatar
      blkcg: move per-queue blkg list heads and counters to queue and blkg · 4eef3049
      Tejun Heo authored
      Currently, specific policy implementations are responsible for
      maintaining list and number of blkgs.  This duplicates code
      unnecessarily, and hinders factoring common code and providing blkcg
      API with better defined semantics.
      After this patch, request_queue hosts list heads and counters and blkg
      has list nodes for both policies.  This patch only relocates the
      necessary fields and the next patch will actually move management code
      into blkcg core.
      Note that request_queue->blkg_list[] and ->nr_blkgs[] are hardcoded to
      have 2 elements.  This is to avoid include dependency and will be
      removed by the next patch.
      This patch doesn't introduce any behavior change.
      -v2: Now unnecessary conditional on CONFIG_BLK_CGROUP_MODULE removed
           as pointed out by Vivek.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    • Tejun Heo's avatar
      blkcg: don't use blkg->plid in stat related functions · c1768268
      Tejun Heo authored
      blkg is scheduled to be unified for all policies and thus there won't
      be one-to-one mapping from blkg to policy.  Update stat related
      functions to take explicit @pol or @plid arguments and not use
      This is painful for now but most of specific stat interface functions
      will be replaced with a handful of generic helpers.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    • Tejun Heo's avatar
      blkcg: make blkg->pd an array and move configuration and stats into it · 549d3aa8
      Tejun Heo authored
      To prepare for unifying blkgs for different policies, make blkg->pd an
      array with BLKIO_NR_POLICIES elements and move blkg->conf, ->stats,
      and ->stats_cpu into blkg_policy_data.
      This patch doesn't introduce any functional difference.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    • Tejun Heo's avatar
      blkcg: move refcnt to blkcg core · 1adaf3dd
      Tejun Heo authored
      Currently, blkcg policy implementations manage blkg refcnt duplicating
      mostly identical code in both policies.  This patch moves refcnt to
      blkg and let blkcg core handle refcnt and freeing of blkgs.
      * cfq blkgs now also get freed via RCU.
      * cfq blkgs lose RB_EMPTY_ROOT() sanity check on blkg free.  If
        necessary, we can add blkio_exit_group_fn() to resurrect this.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    • Tejun Heo's avatar
      blkcg: let blkcg core handle policy private data allocation · 0381411e
      Tejun Heo authored
      Currently, blkg's are embedded in private data blkcg policy private
      data structure and thus allocated and freed by policies.  This leads
      to duplicate codes in policies, hinders implementing common part in
      blkcg core with strong semantics, and forces duplicate blkg's for the
      same cgroup-q association.
      This patch introduces struct blkg_policy_data which is a separate data
      structure chained from blkg.  Policies specifies the amount of private
      data it needs in its blkio_policy_type->pdata_size and blkcg core
      takes care of allocating them along with blkg which can be accessed
      using blkg_to_pdata().  blkg can be determined from pdata using
      pdata_to_blkg().  blkio_alloc_group_fn() method is accordingly updated
      to blkio_init_group_fn().
      For consistency, tg_of_blkg() and cfqg_of_blkg() are replaced with
      blkg_to_tg() and blkg_to_cfqg() respectively, and functions to map in
      the reverse direction are added.
      Except that policy specific data now lives in a separate data
      structure from blkg, this patch doesn't introduce any functional
      This will be used to unify blkg's for different policies.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    • Tejun Heo's avatar
      blkcg: add blkcg_{init|drain|exit}_queue() · 5efd6113
      Tejun Heo authored
      Currently block core calls directly into blk-throttle for init, drain
      and exit.  This patch adds blkcg_{init|drain|exit}_queue() which wraps
      the blk-throttle functions.  This is to give more control and
      visiblity to blkcg core layer for proper layering.  Further patches
      will add logic common to blkcg policies to the functions.
      While at it, collapse blk_throtl_release() into blk_throtl_exit().
      There's no reason to keep them separate.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    • Tejun Heo's avatar
      blkcg: let blkio_group point to blkio_cgroup directly · 7ee9c562
      Tejun Heo authored
      Currently, blkg points to the associated blkcg via its css_id.  This
      unnecessarily complicates dereferencing blkcg.  Let blkg hold a
      reference to the associated blkcg and point directly to it and disable
      css_id on blkio_subsys.
      This change requires splitting blkiocg_destroy() into
      blkiocg_pre_destroy() and blkiocg_destroy() so that all blkg's can be
      destroyed and all the blkcg references held by them dropped during
      cgroup removal.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    • Tejun Heo's avatar
      blkcg: kill the mind-bending blkg->dev · 7a4dd281
      Tejun Heo authored
      blkg->dev is dev_t recording the device number of the block device for
      the associated request_queue.  It is used to identify the associated
      block device when printing out configuration or stats.
      This is redundant to begin with.  A blkg is an association between a
      cgroup and a request_queue and it of course is possible to reach
      request_queue from blkg and synchronization conventions are in place
      for safe q dereferencing, so this shouldn't be necessary from the
      beginning.  Furthermore, it's initialized by sscanf()ing the device
      name of backing_dev_info.  The mind boggles.
      Anyways, if blkg is visible under rcu lock, we *know* that the
      associated request_queue hasn't gone away yet and its bdi is
      registered and alive - blkg can't be created for request_queue which
      hasn't been fully initialized and it can't go away before blkg is
      Let stat and conf read functions get device name from
      blkg->q->backing_dev_info.dev and pass it down to printing functions
      and remove blkg->dev.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    • Tejun Heo's avatar
      blkcg: kill blkio_policy_node · 4bfd482e
      Tejun Heo authored
      Now that blkcg configuration lives in blkg's, blkio_policy_node is no
      longer necessary.  Kill it.
      blkio_policy_parse_and_set() now fails if invoked for missing device
      and functions to print out configurations are updated to print from
      cftype_blkg_same_policy() is dropped along with other policy functions
      for consistency.  Its one line is open coded in the only user -
      -v2: Update to reflect the retry-on-bypass logic change of the
           previous patch.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    • Tejun Heo's avatar
      blkcg: don't allow or retain configuration of missing devices · e56da7e2
      Tejun Heo authored
      blkcg is very peculiar in that it allows setting and remembering
      configurations for non-existent devices by maintaining separate data
      structures for configuration.
      This behavior is completely out of the usual norms and outright
      confusing; furthermore, it uses dev_t number to match the
      configuration to devices, which is unpredictable to begin with and
      becomes completely unuseable if EXT_DEVT is fully used.
      It is wholely unnecessary - we already have fully functional userland
      mechanism to program devices being hotplugged which has full access to
      device identification, connection topology and filesystem information.
      Add a new struct blkio_group_conf which contains all blkcg
      configurations to blkio_group and let blkio_group, which can be
      created iff the associated device exists and is removed when the
      associated device goes away, carry all configurations.
      Note that, after this patch, all newly created blkg's will always have
      the default configuration (unlimited for throttling and blkcg's weight
      for propio).
      This patch makes blkio_policy_node meaningless but doesn't remove it.
      The next patch will.
      -v2: Updated to retry after short sleep if blkg lookup/creation failed
           due to the queue being temporarily bypassed as indicated by
           -EBUSY return.  Pointed out by Vivek.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Kay Sievers <kay.sievers@vrfy.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    • Tejun Heo's avatar
      blkcg: factor out blkio_group creation · cd1604fa
      Tejun Heo authored
      Currently both blk-throttle and cfq-iosched implement their own
      blkio_group creation code in throtl_get_tg() and cfq_get_cfqg().  This
      patch factors out the common code into blkg_lookup_create(), which
      returns ERR_PTR value so that transitional failures due to queue
      bypass can be distinguished from other failures.
      * New plkio_policy_ops methods blkio_alloc_group_fn() and
        blkio_link_group_fn added.  Both are transitional and will be
        removed once the blkg management code is fully moved into
      * blkio_alloc_group_fn() allocates policy-specific blkg which is
        usually a larger data structure with blkg as the first entry and
        intiailizes it.  Note that initialization of blkg proper, including
        percpu stats, is responsibility of blk-cgroup proper.
        Note that default config (weight, bps...) initialization is done
        from this method; otherwise, we end up violating locking order
        between blkcg and q locks via blkcg_get_CONF() functions.
      * blkio_link_group_fn() is called under queue_lock and responsible for
        linking the blkg to the queue.  blkcg side is handled by blk-cgroup
      * The common blkg creation function is named blkg_lookup_create() and
        blkiocg_lookup_group() is renamed to blkg_lookup() for consistency.
        Also, throtl / cfq related functions are similarly [re]named for
      This simplifies blkcg policy implementations and enables further
      -v2: Vivek noticed that blkg_lookup_create() incorrectly tested
           blk_queue_dead() instead of blk_queue_bypass() leading a user of
           the function ending up creating a new blkg on bypassing queue.
           This is a bug introduced while relocating bypass patches before
           this one.  Fixed.
      -v3: ERR_PTR patch folded into this one.  @for_root added to
           blkg_lookup_create() to allow creating root group on a bypassed
           queue during elevator switch.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    • Tejun Heo's avatar
      blkcg: add blkio_policy[] array and allow one policy per policy ID · 035d10b2
      Tejun Heo authored
      Block cgroup policies are maintained in a linked list and,
      theoretically, multiple policies sharing the same policy ID are
      This patch temporarily restricts one policy per plid and adds
      blkio_policy[] array which indexes registered policy types by plid.
      Both the restriction and blkio_policy[] array are transitional and
      will be removed once API cleanup is complete.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    • Tejun Heo's avatar
      blkcg: use q and plid instead of opaque void * for blkio_group association · ca32aefc
      Tejun Heo authored
      blkgio_group is association between a block cgroup and a queue for a
      given policy.  Using opaque void * for association makes things
      confusing and hinders factoring of common code.  Use request_queue *
      and, if necessary, policy id instead.
      This will help block cgroup API cleanup.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    • Tejun Heo's avatar
      blkcg: shoot down blkio_groups on elevator switch · 72e06c25
      Tejun Heo authored
      Elevator switch may involve changes to blkcg policies.  Implement
      shoot down of blkio_groups.
      Combined with the previous bypass updates, the end goal is updating
      blkcg core such that it can ensure that blkcg's being affected become
      quiescent and don't have any per-blkg data hanging around before
      commencing any policy updates.  Until queues are made aware of the
      policies that applies to them, as an interim step, all per-policy blkg
      data will be shot down.
      * blk-throtl doesn't need this change as it can't be disabled for a
        live queue; however, update it anyway as the scheduled blkg
        unification requires this behavior change.  This means that
        blk-throtl configuration will be unnecessarily lost over elevator
        switch.  This oddity will be removed after blkcg learns to associate
        individual policies with request_queues.
      * blk-throtl dosen't shoot down root_tg.  This is to ease transition.
        Unified blkg will always have persistent root group and not shooting
        down root_tg for now eases transition to that point by avoiding
        having to update td->root_tg and is safe as blk-throtl can never be
      -v2: Vivek pointed out that group list is not guaranteed to be empty
           on return from clear function if it raced cgroup removal and
           lost.  Fix it by waiting a bit and retrying.  This kludge will
           soon be removed once locking is updated such that blkg is never
           in limbo state between blkcg and request_queue locks.
           blk-throtl no longer shoots down root_tg to avoid breaking
           Also, Nest queue_lock inside blkio_list_lock not the other way
           around to avoid introduce possible deadlock via blkcg lock.
      -v3: blkcg_clear_queue() repositioned and renamed to
           blkg_destroy_all() to increase consistency with later changes.
           cfq_clear_queue() updated to check q->elevator before
           dereferencing it to avoid NULL dereference on not fully
           initialized queues (used by later change).
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    • Tejun Heo's avatar
      blkcg: make CONFIG_BLK_CGROUP bool · 32e380ae
      Tejun Heo authored
      Block cgroup core can be built as module; however, it isn't too useful
      as blk-throttle can only be built-in and cfq-iosched is usually the
      default built-in scheduler.  Scheduled blkcg cleanup requires calling
      into blkcg from block core.  To simplify that, disallow building blkcg
      as module by making CONFIG_BLK_CGROUP bool.
      If building blkcg core as module really matters, which I doubt, we can
      revisit it after blkcg API cleanup.
      -v2: Vivek pointed out that IOSCHED_CFQ was incorrectly updated to
           depend on BLK_CGROUP.  Fixed.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
  4. 24 Oct, 2011 1 commit
  5. 23 May, 2011 1 commit
  6. 20 May, 2011 3 commits
  7. 16 May, 2011 1 commit
    • Vivek Goyal's avatar
      blk-throttle: Use task_subsys_state() to determine a task's blkio_cgroup · 70087dc3
      Vivek Goyal authored
      Currentlly we first map the task to cgroup and then cgroup to
      blkio_cgroup. There is a more direct way to get to blkio_cgroup
      from task using task_subsys_state(). Use that.
      The real reason for the fix is that it also avoids a race in generic
      cgroup code. During remount/umount rebind_subsystems() is called and
      it can do following with and rcu protection.
      cgrp->subsys[i] = NULL;
      That means if somebody got hold of cgroup under rcu and then it tried
      to do cgroup->subsys[] to get to blkio_cgroup, it would get NULL which
      is wrong. I was running into this race condition with ltp running on a
      upstream derived kernel and that lead to crash.
      So ideally we should also fix cgroup generic code to wait for rcu
      grace period before setting pointer to NULL. Li Zefan is not very keen
      on introducing synchronize_wait() as he thinks it will slow
      down moun/remount/umount operations.
      So for the time being atleast fix the kernel crash by taking a more
      direct route to blkio_cgroup.
      One tester had reported a crash while running LTP on a derived kernel
      and with this fix crash is no more seen while the test has been
      running for over 6 days.
      Signed-off-by: default avatarVivek Goyal <vgoyal@redhat.com>
      Reviewed-by: default avatarLi Zefan <lizf@cn.fujitsu.com>
      Signed-off-by: default avatarJens Axboe <jaxboe@fusionio.com>
  8. 12 Mar, 2011 1 commit
  9. 08 Mar, 2011 1 commit
  10. 01 Oct, 2010 2 commits
    • Vivek Goyal's avatar
      blkio-throttle: limit max iops value to UINT_MAX · 9355aede
      Vivek Goyal authored
      - Limit max iops value to UINT_MAX and return error to user if value is more
        than that instead of accepting bigger values and truncating implicitly.
      Signed-off-by: default avatarJens Axboe <jaxboe@fusionio.com>
    • Vivek Goyal's avatar
      blkio: Recalculate the throttled bio dispatch time upon throttle limit change · fe071437
      Vivek Goyal authored
      o Currently any cgroup throttle limit changes are processed asynchronousy and
        the change does not take affect till a new bio is dispatched from same group.
      o It might happen that a user sets a redicuously low limit on throttling.
        Say 1 bytes per second on reads. In such cases simple operations like mount
        a disk can wait for a very long time.
      o Once bio is throttled, there is no easy way to come out of that wait even if
        user increases the read limit later.
      o This patch fixes it. Now if a user changes the cgroup limits, we recalculate
        the bio dispatch time according to new limits.
      o Can't take queueu lock under blkcg_lock, hence after the change I wake
        up the dispatch thread again which recalculates the time. So there are some
        variables being synchronized across two threads without lock and I had to
        make use of barriers. Hoping I have used barriers correctly. Any review of
        memory barrier code especially will help.
      Signed-off-by: default avatarVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <jaxboe@fusionio.com>
  11. 16 Sep, 2010 1 commit