1. 09 Jul, 2015 2 commits
    • Tejun Heo's avatar
      blkcg: fix blkcg_policy_data allocation bug · 06b285bd
      Tejun Heo authored
       ("block, cgroup: implement policy-specific per-blkcg
      data") updated per-blkcg policy data to be dynamically allocated.
      When a policy is registered, its policy data aren't created.  Instead,
      when the policy is activated on a queue, the policy data are allocated
      if there are blkg's (blkcg_gq's) which are attached to a given blkcg.
      This is buggy.  Consider the following scenario.
      1. A blkcg is created.  No blkg's attached yet.
      2. The policy is registered.  No policy data is allocated.
      3. The policy is activated on a queue.  As the above blkcg doesn't
         have any blkg's, it won't allocate the matching blkcg_policy_data.
      4. An IO is issued from the blkcg and blkg is created and the blkcg
         still doesn't have the matching policy data allocated.
      With cfq-iosched, this leads to an oops.
      It also doesn't free policy data on policy unregistration assuming
      that freeing of all policy data on blkcg destruction should take care
      of it; however, this also is incorrect.
      1. A blkcg has policy data.
      2. The policy gets unregistered but the policy data remains.
      3. Another policy gets registered on the same slot.
      4. Later, the new policy tries to allocate policy data on the previous
         blkcg but the slot is already occupied and gets skipped.  The
         policy ends up operating on the policy data of the previous policy.
      There's no reason to manage blkcg_policy_data lazily.  The reason we
      do lazy allocation of blkg's is that the number of all possible blkg's
      is the product of cgroups and block devices which can reach a
      surprising level.  blkcg_policy_data is contrained by the number of
      cgroups and shouldn't be a problem.
      This patch makes blkcg_policy_data to be allocated for all existing
      blkcg's on policy registration and freed on unregistration and removes
      blkcg_policy_data handling from policy [de]activation paths.  This
      makes that blkcg_policy_data are created and removed with the policy
      they belong to and fixes the above described problems.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Fixes: e48453c3
       ("block, cgroup: implement policy-specific per-blkcg data")
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
    • Tejun Heo's avatar
      blkcg: implement all_blkcgs list · 7876f930
      Tejun Heo authored
      Add all_blkcgs list goes through blkcg->all_blkcgs_node and is
      protected by blkcg_pol_mutex.  This will be used to fix
      blkcg_policy_data allocation bug.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
  2. 04 Jul, 2015 4 commits
  3. 03 Jul, 2015 1 commit
  4. 02 Jul, 2015 4 commits
    • Joel Porquet's avatar
      irqchip: Move IRQCHIP_DECLARE macro to include/linux/irqchip.h · 91e20b50
      Joel Porquet authored
      At the moment the IRQCHIP_DECLARE macro is only declared locally in
      drivers/irqchip/irqchip.h. It prevents from using it directly in arch/*
      directories whenever irqchip drivers only exist there, which happens in a few
      cases (e.g. arc, arm, microblaze and mips).
      This patch makes the macro to be globally defined, i.e. in
      include/linux/irqchip.h, and thus usable for arch-specific declarations of
      irqchip drivers. In this way, it is very similar to what clocksource does (ie
      CLOCKSOURCE_OF_DECLARE is defined in include/linux/clocksource.h).
      For now, this patch only moves the declaration of the macro
      IRQCHIP_DECLARE to the global header 'include/linux/irqchip.h' and make
      'drivers/irqchip/irqchip.h' include 'include/linux/irqchip.h'. Later, other
      patches will get rid of 'drivers/irqchip/irqchip.h' and modify all the impacted
      irqchip drivers.
      Signed-off-by: default avatarJoel Porquet <joel@porquet.org>
      Cc: Jason Cooper <jason@lakedaemon.net>
      Link: http://lkml.kernel.org/r/1435865565-14114-1-git-send-email-joel@porquet.org
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
    • Tejun Heo's avatar
      writeback: don't embed root bdi_writeback_congested in bdi_writeback · a13f35e8
      Tejun Heo authored
       ("writeback: make backing_dev_info host cgroup-specific
      bdi_writebacks") made bdi (backing_dev_info) host per-cgroup wb's
      (bdi_writeback's).  As the congested state needs to be per-wb and
      referenced from blkcg side and multiple wbs, the patch made all
      non-root cong's (bdi_writeback_congested's) reference counted and
      indexed on bdi.
      When a bdi is destroyed, cgwb_bdi_destroy() tries to drain all
      non-root cong's; however, this can hang indefinitely because wb's can
      also be referenced from blkcg_gq's which are destroyed after bdi
      destruction is complete.
      To fix the bug, bdi destruction will be updated to not wait for cong's
      to drain, which naturally means that cong's may outlive the associated
      bdi.  This is fine for non-root cong's but is problematic for the root
      cong's which are embedded in their bdi's as they may end up getting
      dereferenced after the containing bdi's are freed.
      This patch makes root cong's behave the same as non-root cong's.  They
      are no longer embedded in their bdi's but allocated separately during
      bdi initialization, indexed and reference counted the same way.
      * As cong handling is the same for all wb's, wb->congested
        initialization is moved into wb_init().
      * When !CONFIG_CGROUP_WRITEBACK, there was no indexing or refcnting.
        bdi->wb_congested is now a pointer pointing to the root cong
        allocated during bdi init and minimal refcnting operations are
      * The above makes root wb init paths diverge depending on
        CONFIG_CGROUP_WRITEBACK.  root wb init is moved to cgwb_bdi_init().
      This patch in itself shouldn't cause any consequential behavior
      differences but prepares for the actual fix.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarJon Christopherson <jon@jons.org>
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=100681
      Tested-by: default avatarJon Christopherson <jon@jons.org>
      Added <linux/slab.h> include to backing-dev.h for kfree() definition.
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
    • Allen Hubbe's avatar
      NTB: Move files in preparation for NTB abstraction · ec110bc7
      Allen Hubbe authored
      This patch only moves files to their new locations, before applying the
      next two patches adding the NTB Abstraction layer.  Splitting this patch
      from the next is intended make distinct which code is changed only due
      to moving the files, versus which are substantial code changes in adding
      the NTB Abstraction layer.
      Signed-off-by: default avatarAllen Hubbe <Allen.Hubbe@emc.com>
      Signed-off-by: default avatarJon Mason <jdmason@kudzu.us>
    • Nikolay Borisov's avatar
      bufferhead: Add _gfp version for sb_getblk() · bd7ade3c
      Nikolay Borisov authored
      sb_getblk() is used during ext4 (and possibly other FSes) writeback
      paths. Sometimes such path require allocating memory and guaranteeing
      that such allocation won't block. Currently, however, there is no way
      to provide user flags for sb_getblk which could lead to deadlocks.
      This patch implements a sb_getblk_gfp with the only difference it can
      accept user-provided GFP flags.
      Signed-off-by: default avatarNikolay Borisov <kernel@kyup.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Cc: stable@vger.kernel.org
  5. 01 Jul, 2015 16 commits
  6. 29 Jun, 2015 2 commits
  7. 28 Jun, 2015 1 commit
  8. 27 Jun, 2015 1 commit
  9. 26 Jun, 2015 9 commits
    • Thomas Gleixner's avatar
      genirq: Implement irq_set_handler_locked()/irq_set_chip_handler_name_locked() · bbc9d21f
      Thomas Gleixner authored
      The main use case for the exisiting __irq_set_*_locked() inlines is to
      replace the handler [,chip and name] of an interrupt from a region
      which has the irq descriptor lock held, e.g. from the irq_set_type()
      callback. The first argument is the irq number, so the functions need
      so perform a pointless lookup of the interrupt descriptor for those
      cases which have the irq_data pointer handy.
      Provide new functions which take an irq_data pointer instead of the
      interrupt number, so the lookup of the interrupt descriptor can be
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Jiang Liu <jiang.liu@linux.intel.com>
    • Jiang Liu's avatar
      genirq: Introduce helper irq_desc_get_irq() · 304adf8a
      Jiang Liu authored
      Introduce helper irq_desc_get_irq() to retrieve the irq number from
      the irq descriptor.
      Signed-off-by: default avatarJiang Liu <jiang.liu@linux.intel.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Link: http://lkml.kernel.org/r/1433391238-19471-17-git-send-email-jiang.liu@linux.intel.com
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
    • Ross Zwisler's avatar
      arch, x86: pmem api for ensuring durability of persistent memory updates · 61031952
      Ross Zwisler authored
      Based on an original patch by Ross Zwisler [1].
      Writes to persistent memory have the potential to be posted to cpu
      cache, cpu write buffers, and platform write buffers (memory controller)
      before being committed to persistent media.  Provide apis,
      memcpy_to_pmem(), wmb_pmem(), and memremap_pmem(), to write data to
      pmem and assert that it is durable in PMEM (a persistent linear address
      range).  A '__pmem' attribute is added so sparse can track proper usage
      of pointers to pmem.
      This continues the status quo of pmem being x86 only for 4.2, but
      reworks to ioremap, and wider implementation of memremap() will enable
      other archs in 4.3.
      [1]: https://lists.01.org/pipermail/linux-nvdimm/2015-May/000932.html
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: default avatarRoss Zwisler <ross.zwisler@linux.intel.com>
      [djbw: various reworks]
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
    • Toshi Kani's avatar
      libnvdimm: Add sysfs numa_node to NVDIMM devices · 74ae66c3
      Toshi Kani authored
      Add support of sysfs 'numa_node' to I/O-related NVDIMM devices
      under /sys/bus/nd/devices, regionN, namespaceN.0, and bttN.x.
      An example of numa_node values on a 2-socket system with a single
      NVDIMM range on each socket is shown below.
        |-- btt0.0/numa_node:0
        |-- btt1.0/numa_node:1
        |-- btt1.1/numa_node:1
        |-- namespace0.0/numa_node:0
        |-- namespace1.0/numa_node:1
        |-- region0/numa_node:0
        |-- region1/numa_node:1
      These numa_node files are then linked under the block class of
      their device names.
      This enables numactl(8) to accept 'block:' and 'file:' paths of
      pmem and btt devices as shown in the examples below.
        numactl --preferred block:pmem0 --show
        numactl --preferred file:/dev/pmem1s --show
      Signed-off-by: default avatarToshi Kani <toshi.kani@hp.com>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
    • Toshi Kani's avatar
      libnvdimm: Set numa_node to NVDIMM devices · 41d7a6d6
      Toshi Kani authored
      ACPI NFIT table has System Physical Address Range Structure entries that
      describe a proximity ID of each range when ACPI_NFIT_PROXIMITY_VALID is
      set in the flags.
      Change acpi_nfit_register_region() to map a proximity ID to its node ID,
      and set it to a new numa_node field of nd_region_desc, which is then
      conveyed to the nd_region device.
      The device core arranges for btt and namespace devices to inherit their
      node from their parent region.
      Signed-off-by: default avatarToshi Kani <toshi.kani@hp.com>
      [djbw: move set_dev_node() from region.c to bus.c]
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
    • Toshi Kani's avatar
      acpi: Add acpi_map_pxm_to_online_node() · 99759869
      Toshi Kani authored
      The kernel initializes CPU & memory's NUMA topology from ACPI
      SRAT table.  Some other ACPI tables, such as NFIT and DMAR, also
      contain proximity IDs for their device's NUMA topology.  This
      information can be used to improve performance of these devices.
      This patch introduces acpi_map_pxm_to_online_node(), which is
      similar to acpi_map_pxm_to_node(), but always returns an online
      node.  When the mapped node from a given proximity ID is offline,
      it looks up the node distance table and returns the nearest
      online node.
      ACPI device drivers, which are called after the NUMA initialization
      has completed in the kernel, can call this interface to obtain their
      device NUMA topology from ACPI tables.  Such drivers do not have to
      deal with offline nodes.  A node may be offline when a device
      proximity ID is unique, SRAT memory entry does not exist, or NUMA is
      disabled, ex. "numa=off" on x86.
      This patch also moves the pxm range check from acpi_get_node() to
      Signed-off-by: default avatarToshi Kani <toshi.kani@hp.com>
      Acked-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com&gt;>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
    • Dan Williams's avatar
      libnvdimm, nfit: handle unarmed dimms, mark namespaces read-only · 58138820
      Dan Williams authored
      Upon detection of an unarmed dimm in a region, arrange for descendant
      BTT, PMEM, or BLK instances to be read-only.  A dimm is primarily marked
      "unarmed" via flags passed by platform firmware (NFIT).
      The flags in the NFIT memory device sub-structure indicate the state of
      the data on the nvdimm relative to its energy source or last "flush to
      persistence".  For the most part there is nothing the driver can do but
      advertise the state of these flags in sysfs and emit a message if
      firmware indicates that the contents of the device may be corrupted.
      However, for the case of ACPI_NFIT_MEM_ARMED, the driver can arrange for
      the block devices incorporating that nvdimm to be marked read-only.
      This is a safe default as the data is still available and new writes are
      held off until the administrator either forces read-write mode, or the
      energy source becomes armed.
      A 'read_only' attribute is added to REGION devices to allow for
      overriding the default read-only policy of all descendant block devices.
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
    • Ross Zwisler's avatar
      libnvdimm, nfit, nd_blk: driver for BLK-mode access persistent memory · 047fc8a1
      Ross Zwisler authored
      The libnvdimm implementation handles allocating dimm address space (DPA)
      between PMEM and BLK mode interfaces.  After DPA has been allocated from
      a BLK-region to a BLK-namespace the nd_blk driver attaches to handle I/O
      as a struct bio based block device. Unlike PMEM, BLK is required to
      handle platform specific details like mmio register formats and memory
      controller interleave.  For this reason the libnvdimm generic nd_blk
      driver calls back into the bus provider to carry out the I/O.
      This initial implementation handles the BLK interface defined by the
      ACPI 6 NFIT [1] and the NVDIMM DSM Interface Example [2] composed from
      DCR (dimm control region), BDW (block data window), IDT (interleave
      descriptor) NFIT structures and the hardware register format.
      [1]: http://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf
      [2]: http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Boaz Harrosh <boaz@plexistor.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarRoss Zwisler <ross.zwisler@linux.intel.com>
      Acked-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
    • Vishal Verma's avatar
      nd_btt: atomic sector updates · 5212e11f
      Vishal Verma authored
      BTT stands for Block Translation Table, and is a way to provide power
      fail sector atomicity semantics for block devices that have the ability
      to perform byte granularity IO. It relies on the capability of libnvdimm
      namespace devices to do byte aligned IO.
      The BTT works as a stacked blocked device, and reserves a chunk of space
      from the backing device for its accounting metadata. It is a bio-based
      driver because all IO is done synchronously, and there is no queuing or
      asynchronous completions at either the device or the driver level.
      The BTT uses 'lanes' to index into various 'on-disk' data structures,
      and lanes also act as a synchronization mechanism in case there are more
      CPUs than available lanes. We did a comparison between two lane lock
      strategies - first where we kept an atomic counter around that tracked
      which was the last lane that was used, and 'our' lane was determined by
      atomically incrementing that. That way, for the nr_cpus > nr_lanes case,
      theoretically, no CPU would be blocked waiting for a lane. The other
      strategy was to use the cpu number we're scheduled on to and hash it to
      a lane number. Theoretically, this could block an IO that could've
      otherwise run using a different, free lane. But some fio workloads
      showed that the direct cpu -> lane hash performed faster than tracking
      'last lane' - my reasoning is the cache thrash caused by moving the
      atomic variable made that approach slower than simply waiting out the
      in-progress IO. This supports the conclusion that the driver can be a
      very simple bio-based one that does synchronous IOs instead of queuing.
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Boaz Harrosh <boaz@plexistor.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      [jmoyer: fix nmi watchdog timeout in btt_map_init]
      [jmoyer: move btt initialization to module load path]
      [jmoyer: fix memory leak in the btt initialization path]
      [jmoyer: Don't overwrite corrupted arenas]
      Signed-off-by: default avatarVishal Verma <vishal.l.verma@linux.intel.com>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>