1. 28 Apr, 2008 37 commits
    • Paul Jackson's avatar
      mempolicy: add bitmap_onto() and bitmap_fold() operations · 7ea931c9
      Paul Jackson authored
      
      
      The following adds two more bitmap operators, bitmap_onto() and bitmap_fold(),
      with the usual cpumask and nodemask wrappers.
      
      The bitmap_onto() operator computes one bitmap relative to another.  If the
      n-th bit in the origin mask is set, then the m-th bit of the destination mask
      will be set, where m is the position of the n-th set bit in the relative mask.
      
      The bitmap_fold() operator folds a bitmap into a second that has bit m set iff
      the input bitmap has some bit n set, where m == n mod sz, for the specified sz
      value.
      
      There are two substantive changes between this patch and its
      predecessor bitmap_relative:
       1) Renamed bitmap_relative() to be bitmap_onto().
       2) Added bitmap_fold().
      
      The essential motivation for bitmap_onto() is to provide a mechanism for
      converting a cpuset-relative CPU or Node mask to an absolute mask.  Cpuset
      relative masks are written as if the current task were in a cpuset whose CPUs
      or Nodes were just the consecutive ones numbered 0..N-1, for some N.  The
      bitmap_onto() operator is provided in anticipation of adding support for the
      first such cpuset relative mask, by the mbind() and set_mempolicy() system
      calls, using a planned flag of MPOL_F_RELATIVE_NODES.  These bitmap operators
      (and their nodemask wrappers, in particular) will be used in code that
      converts the user specified cpuset relative memory policy to a specific system
      node numbered policy, given the current mems_allowed of the tasks cpuset.
      
      Such cpuset relative mempolicies will address two deficiencies
      of the existing interface between cpusets and mempolicies:
       1) A task cannot at present reliably establish a cpuset
          relative mempolicy because there is an essential race
          condition, in that the tasks cpuset may be changed in
          between the time the task can query its cpuset placement,
          and the time the task can issue the applicable mbind or
          set_memplicy system call.
       2) A task cannot at present establish what cpuset relative
          mempolicy it would like to have, if it is in a smaller
          cpuset than it might have mempolicy preferences for,
          because the existing interface only allows specifying
          mempolicies for nodes currently allowed by the cpuset.
      
      Cpuset relative mempolicies are useful for tasks that don't distinguish
      particularly between one CPU or Node and another, but only between how many of
      each are allowed, and the proper placement of threads and memory pages on the
      various CPUs and Nodes available.
      
      The motivation for the added bitmap_fold() can be seen in the following
      example.
      
      Let's say an application has specified some mempolicies that presume 16 memory
      nodes, including say a mempolicy that specified MPOL_F_RELATIVE_NODES (cpuset
      relative) nodes 12-15.  Then lets say that application is crammed into a
      cpuset that only has 8 memory nodes, 0-7.  If one just uses bitmap_onto(),
      this mempolicy, mapped to that cpuset, would ignore the requested relative
      nodes above 7, leaving it empty of nodes.  That's not good; better to fold the
      higher nodes down, so that some nodes are included in the resulting mapped
      mempolicy.  In this case, the mempolicy nodes 12-15 are taken modulo 8 (the
      weight of the mems_allowed of the confining cpuset), resulting in a mempolicy
      specifying nodes 4-7.
      
      Signed-off-by: default avatarPaul Jackson <pj@sgi.com>
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Andi Kleen <ak@suse.de>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: <kosaki.motohiro@jp.fujitsu.com>
      Cc: <ray-lk@madrabbit.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7ea931c9
    • David Rientjes's avatar
      mempolicy: add MPOL_F_STATIC_NODES flag · f5b087b5
      David Rientjes authored
      
      
      Add an optional mempolicy mode flag, MPOL_F_STATIC_NODES, that suppresses the
      node remap when the policy is rebound.
      
      Adds another member to struct mempolicy, nodemask_t user_nodemask, as part of
      a union with cpuset_mems_allowed:
      
      	struct mempolicy {
      		...
      		union {
      			nodemask_t cpuset_mems_allowed;
      			nodemask_t user_nodemask;
      		} w;
      	}
      
      that stores the the nodemask that the user passed when he or she created the
      mempolicy via set_mempolicy() or mbind().  When using MPOL_F_STATIC_NODES,
      which is passed with any mempolicy mode, the user's passed nodemask
      intersected with the VMA or task's allowed nodes is always used when
      determining the preferred node, setting the MPOL_BIND zonelist, or creating
      the interleave nodemask.  This happens whenever the policy is rebound,
      including when a task's cpuset assignment changes or the cpuset's mems are
      changed.
      
      This creates an interesting side-effect in that it allows the mempolicy
      "intent" to lie dormant and uneffected until it has access to the node(s) that
      it desires.  For example, if you currently ask for an interleaved policy over
      a set of nodes that you do not have access to, the mempolicy is not created
      and the task continues to use the previous policy.  With this change, however,
      it is possible to create the same mempolicy; it is only effected when access
      to nodes in the nodemask is acquired.
      
      It is also possible to mount tmpfs with the static nodemask behavior when
      specifying a node or nodemask.  To do this, simply add "=static" immediately
      following the mempolicy mode at mount time:
      
      	mount -o remount mpol=interleave=static:1-3
      
      Also removes mpol_check_policy() and folds its logic into mpol_new() since it
      is now obsoleted.  The unused vma_mpol_equal() is also removed.
      
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f5b087b5
    • David Rientjes's avatar
      mempolicy: support optional mode flags · 028fec41
      David Rientjes authored
      
      
      With the evolution of mempolicies, it is necessary to support mempolicy mode
      flags that specify how the policy shall behave in certain circumstances.  The
      most immediate need for mode flag support is to suppress remapping the
      nodemask of a policy at the time of rebind.
      
      Both the mempolicy mode and flags are passed by the user in the 'int policy'
      formal of either the set_mempolicy() or mbind() syscall.  A new constant,
      MPOL_MODE_FLAGS, represents the union of legal optional flags that may be
      passed as part of this int.  Mempolicies that include illegal flags as part of
      their policy are rejected as invalid.
      
      An additional member to struct mempolicy is added to support the mode flags:
      
      	struct mempolicy {
      		...
      		unsigned short policy;
      		unsigned short flags;
      	}
      
      The splitting of the 'int' actual passed by the user is done in
      sys_set_mempolicy() and sys_mbind() for their respective syscalls.  This is
      done by intersecting the actual with MPOL_MODE_FLAGS, rejecting the syscall of
      there are additional flags, and storing it in the new 'flags' member of struct
      mempolicy.  The intersection of the actual with ~MPOL_MODE_FLAGS is stored in
      the 'policy' member of the struct and all current users of pol->policy remain
      unchanged.
      
      The union of the policy mode and optional mode flags is passed back to the
      user in get_mempolicy().
      
      This combination of mode and flags within the same actual does not break
      userspace code that relies on get_mempolicy(&policy, ...) and either
      
      	switch (policy) {
      	case MPOL_BIND:
      		...
      	case MPOL_INTERLEAVE:
      		...
      	};
      
      statements or
      
      	if (policy == MPOL_INTERLEAVE) {
      		...
      	}
      
      statements.  Such applications would need to use optional mode flags when
      calling set_mempolicy() or mbind() for these previously implemented statements
      to stop working.  If an application does start using optional mode flags, it
      will need to mask the optional flags off the policy in switch and conditional
      statements that only test mode.
      
      An additional member is also added to struct shmem_sb_info to store the
      optional mode flags.
      
      [hugh@veritas.com: shmem mpol: fix build warning]
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      028fec41
    • David Rientjes's avatar
      mempolicy: convert MPOL constants to enum · a3b51e01
      David Rientjes authored
      
      
      The mempolicy mode constants, MPOL_DEFAULT, MPOL_PREFERRED, MPOL_BIND, and
      MPOL_INTERLEAVE, are better declared as part of an enum since they are
      sequentially numbered and cannot be combined.
      
      The policy member of struct mempolicy is also converted from type short to
      type unsigned short.  A negative policy does not have any legitimate meaning,
      so it is possible to change its type in preparation for adding optional mode
      flags later.
      
      The equivalent member of struct shmem_sb_info is also changed from int to
      unsigned short.
      
      For compatibility, the policy formal to get_mempolicy() remains as a pointer
      to an int:
      
      	int get_mempolicy(int *policy, unsigned long *nmask,
      			  unsigned long maxnode, unsigned long addr,
      			  unsigned long flags);
      
      although the only possible values is the range of type unsigned short.
      
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a3b51e01
    • Pekka Enberg's avatar
      mm: move cache_line_size() to <linux/cache.h> · 1b27d05b
      Pekka Enberg authored
      
      
      Not all architectures define cache_line_size() so as suggested by Andrew move
      the private implementations in mm/slab.c and mm/slob.c to <linux/cache.h>.
      
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Reviewed-by: default avatarChristoph Lameter <clameter@sgi.com>
      Signed-off-by: default avatarPekka Enberg <penberg@cs.helsinki.fi>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1b27d05b
    • Adam Litke's avatar
      hugetlb: decrease hugetlb_lock cycling in gather_surplus_huge_pages · 19fc3f0a
      Adam Litke authored
      
      
      To reduce hugetlb_lock acquisitions and releases when freeing excess surplus
      pages, scan the page list in two parts.  First, transfer the needed pages to
      the hugetlb pool.  Then drop the lock and free the remaining pages back to the
      buddy allocator.
      
      In the common case there are zero excess pages and no lock operations are
      required.
      
      Thanks Mel Gorman for this improvement.
      
      Signed-off-by: default avatarAdam Litke <agl@us.ibm.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      19fc3f0a
    • Chris Dearman's avatar
      mm: try both endianess when checking for endianess · 797df574
      Chris Dearman authored
      
      
      When checking for the swap header try byteswapping the endianess dependent
      fields to allow the swap partition to be shared between big & little endian
      systems.
      
      Signed-off-by: default avatarChris Dearman <chris@mips.com>
      Signed-off-by: default avatarRalf Baechle <ralf@linux-mips.org>
      Acked-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      797df574
    • Mel Gorman's avatar
      mm: filter based on a nodemask as well as a gfp_mask · 19770b32
      Mel Gorman authored
      
      
      The MPOL_BIND policy creates a zonelist that is used for allocations
      controlled by that mempolicy.  As the per-node zonelist is already being
      filtered based on a zone id, this patch adds a version of __alloc_pages() that
      takes a nodemask for further filtering.  This eliminates the need for
      MPOL_BIND to create a custom zonelist.
      
      A positive benefit of this is that allocations using MPOL_BIND now use the
      local node's distance-ordered zonelist instead of a custom node-id-ordered
      zonelist.  I.e., pages will be allocated from the closest allowed node with
      available memory.
      
      [Lee.Schermerhorn@hp.com: Mempolicy: update stale documentation and comments]
      [Lee.Schermerhorn@hp.com: Mempolicy: make dequeue_huge_page_vma() obey MPOL_BIND nodemask]
      [Lee.Schermerhorn@hp.com: Mempolicy: make dequeue_huge_page_vma() obey MPOL_BIND nodemask rework]
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Acked-by: default avatarChristoph Lameter <clameter@sgi.com>
      Signed-off-by: default avatarLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      19770b32
    • Mel Gorman's avatar
      mm: have zonelist contains structs with both a zone pointer and zone_idx · dd1a239f
      Mel Gorman authored
      
      
      Filtering zonelists requires very frequent use of zone_idx().  This is costly
      as it involves a lookup of another structure and a substraction operation.  As
      the zone_idx is often required, it should be quickly accessible.  The node idx
      could also be stored here if it was found that accessing zone->node is
      significant which may be the case on workloads where nodemasks are heavily
      used.
      
      This patch introduces a struct zoneref to store a zone pointer and a zone
      index.  The zonelist then consists of an array of these struct zonerefs which
      are looked up as necessary.  Helpers are given for accessing the zone index as
      well as the node index.
      
      [kamezawa.hiroyu@jp.fujitsu.com: Suggested struct zoneref instead of embedding information in pointers]
      [hugh@veritas.com: mm-have-zonelist: fix memcg ooms]
      [hugh@veritas.com: just return do_try_to_free_pages]
      [hugh@veritas.com: do_try_to_free_pages gfp_mask redundant]
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Acked-by: default avatarChristoph Lameter <clameter@sgi.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dd1a239f
    • Mel Gorman's avatar
      mm: use two zonelist that are filtered by GFP mask · 54a6eb5c
      Mel Gorman authored
      
      
      Currently a node has two sets of zonelists, one for each zone type in the
      system and a second set for GFP_THISNODE allocations.  Based on the zones
      allowed by a gfp mask, one of these zonelists is selected.  All of these
      zonelists consume memory and occupy cache lines.
      
      This patch replaces the multiple zonelists per-node with two zonelists.  The
      first contains all populated zones in the system, ordered by distance, for
      fallback allocations when the target/preferred node has no free pages.  The
      second contains all populated zones in the node suitable for GFP_THISNODE
      allocations.
      
      An iterator macro is introduced called for_each_zone_zonelist() that interates
      through each zone allowed by the GFP flags in the selected zonelist.
      
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Acked-by: default avatarChristoph Lameter <clameter@sgi.com>
      Signed-off-by: default avatarLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      54a6eb5c
    • Mel Gorman's avatar
      mm: remember what the preferred zone is for zone_statistics · 18ea7e71
      Mel Gorman authored
      
      
      On NUMA, zone_statistics() is used to record events like numa hit, miss and
      foreign.  It assumes that the first zone in a zonelist is the preferred zone.
      When multiple zonelists are replaced by one that is filtered, this is no
      longer the case.
      
      This patch records what the preferred zone is rather than assuming the first
      zone in the zonelist is it.  This simplifies the reading of later patches in
      this set.
      
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Reviewed-by: default avatarChristoph Lameter <clameter@sgi.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      18ea7e71
    • Mel Gorman's avatar
      mm: introduce node_zonelist() for accessing the zonelist for a GFP mask · 0e88460d
      Mel Gorman authored
      
      
      Introduce a node_zonelist() helper function.  It is used to lookup the
      appropriate zonelist given a node and a GFP mask.  The patch on its own is a
      cleanup but it helps clarify parts of the two-zonelist-per-node patchset.  If
      necessary, it can be merged with the next patch in this set without problems.
      
      Reviewed-by: default avatarChristoph Lameter <clameter@sgi.com>
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0e88460d
    • Mel Gorman's avatar
      mm: use zonelists instead of zones when direct reclaiming pages · dac1d27b
      Mel Gorman authored
      
      
      The following patches replace multiple zonelists per node with two zonelists
      that are filtered based on the GFP flags.  The patches as a set fix a bug with
      regard to the use of MPOL_BIND and ZONE_MOVABLE.  With this patchset, the
      MPOL_BIND will apply to the two highest zones when the highest zone is
      ZONE_MOVABLE.  This should be considered as an alternative fix for the
      MPOL_BIND+ZONE_MOVABLE in 2.6.23 to the previously discussed hack that filters
      only custom zonelists.
      
      The first patch cleans up an inconsistency where direct reclaim uses
      zonelist->zones where other places use zonelist.
      
      The second patch introduces a helper function node_zonelist() for looking up
      the appropriate zonelist for a GFP mask which simplifies patches later in the
      set.
      
      The third patch defines/remembers the "preferred zone" for numa statistics, as
      it is no longer always the first zone in a zonelist.
      
      The forth patch replaces multiple zonelists with two zonelists that are
      filtered.  The two zonelists are due to the fact that the memoryless patchset
      introduces a second set of zonelists for __GFP_THISNODE.
      
      The fifth patch introduces helper macros for retrieving the zone and node
      indices of entries in a zonelist.
      
      The final patch introduces filtering of the zonelists based on a nodemask.
      Two zonelists exist per node, one for normal allocations and one for
      __GFP_THISNODE.
      
      Performance results varied depending on the machine configuration.  In real
      workloads the gain/loss will depend on how much the userspace portion of the
      benchmark benefits from having more cache available due to reduced referencing
      of zonelists.
      
      These are the range of performance losses/gains when running against
      2.6.24-rc4-mm1.  The set and these machines are a mix of i386, x86_64 and
      ppc64 both NUMA and non-NUMA.
      			     loss   to  gain
      Total CPU time on Kernbench: -0.86% to  1.13%
      Elapsed   time on Kernbench: -0.79% to  0.76%
      page_test from aim9:         -4.37% to  0.79%
      brk_test  from aim9:         -0.71% to  4.07%
      fork_test from aim9:         -1.84% to  4.60%
      exec_test from aim9:         -0.71% to  1.08%
      
      This patch:
      
      The allocator deals with zonelists which indicate the order in which zones
      should be targeted for an allocation.  Similarly, direct reclaim of pages
      iterates over an array of zones.  For consistency, this patch converts direct
      reclaim to use a zonelist.  No functionality is changed by this patch.  This
      simplifies zonelist iterators in the next patch.
      
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Acked-by: default avatarChristoph Lameter <clameter@sgi.com>
      Signed-off-by: default avatarLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dac1d27b
    • Adrian Bunk's avatar
      make swap_pte_to_pagemap_entry() static · 9d02dbc8
      Adrian Bunk authored
      
      
      Make the needlessly global swap_pte_to_pagemap_entry() static.
      
      Signed-off-by: default avatarAdrian Bunk <bunk@kernel.org>
      Acked-by: default avatarMatt Mackall <mpm@selenic.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9d02dbc8
    • Nick Piggin's avatar
      mm: remove nopage · 3c18ddd1
      Nick Piggin authored
      
      
      Nothing in the tree uses nopage any more.  Remove support for it in the
      core mm code and documentation (and a few stray references to it in
      comments).
      
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3c18ddd1
    • Oleg Nesterov's avatar
      mmap_region: cleanup the final vma_merge() related code · 4d3d5b41
      Oleg Nesterov authored
      
      
      It is not easy to actually understand the "if (!file || !vma_merge())"
      code, turn it into "if (file && vma_merge())".  This makes immediately
      obvious that the subsequent "if (file)" is superfluous.
      
      As Hugh Dickins pointed out, we can also factor out the ->i_writecount
      corrections, and add a small comment about that.
      
      Signed-off-by: default avatarOleg Nesterov <oleg@tv-sign.ru>
      Cc: Miklos Szeredi <miklos@szeredi.hu>
      Cc: Hugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4d3d5b41
    • Hisashi Hifumi's avatar
      fix invalidate_inode_pages2_range() to not clear ret · 0dd1334f
      Hisashi Hifumi authored
      
      
      DIO invalidates page cache through invalidate_inode_pages2_range().
      invalidate_inode_pages2_range() sets ret=-EIO when
      invalidate_complete_page2() fails, but this ret is cleared if
      do_launder_page() succeed on a page of next index.
      
      In this case, dio is carried out even if invalidate_complete_page2() fails
      on some pages.
      
      This can cause inconsistency between memory and blocks on HDD because the
      page cache still exists.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: default avatarHisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Ken Chen <kenchen@google.com>
      Cc: Zach Brown <zach.brown@oracle.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      Cc: Chuck Lever <cel@citi.umich.edu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0dd1334f
    • Harvey Harrison's avatar
      remove sparse warning for mmzone.h · ddc81ed2
      Harvey Harrison authored
      
      
      include/linux/mmzone.h:640:22: warning: potentially expensive pointer subtraction
      
      Calculate the offset into the node_zones array rather than the index
      using casts to (char *) and comparing against the index * sizeof(struct zone).
      
      On X86_32 this saves a sar, but code size increases by one byte per
      is_highmem() use due to 32-bit cmps rather than 16 bit cmps.
      
      Before:
       207:   2b 80 8c 07 00 00       sub    0x78c(%eax),%eax
       20d:   c1 f8 0b                sar    $0xb,%eax
       210:   83 f8 02                cmp    $0x2,%eax
       213:   74 16                   je     22b <kmap_atomic_prot+0x144>
       215:   83 f8 03                cmp    $0x3,%eax
       218:   0f 85 8f 00 00 00       jne    2ad <kmap_atomic_prot+0x1c6>
       21e:   83 3d 00 00 00 00 02    cmpl   $0x2,0x0
       225:   0f 85 82 00 00 00       jne    2ad <kmap_atomic_prot+0x1c6>
       22b:   64 a1 00 00 00 00       mov    %fs:0x0,%eax
      
      After:
       207:   2b 80 8c 07 00 00       sub    0x78c(%eax),%eax
       20d:   3d 00 10 00 00          cmp    $0x1000,%eax
       212:   74 18                   je     22c <kmap_atomic_prot+0x145>
       214:   3d 00 18 00 00          cmp    $0x1800,%eax
       219:   0f 85 8f 00 00 00       jne    2ae <kmap_atomic_prot+0x1c7>
       21f:   83 3d 00 00 00 00 02    cmpl   $0x2,0x0
       226:   0f 85 82 00 00 00       jne    2ae <kmap_atomic_prot+0x1c7>
       22c:   64 a1 00 00 00 00       mov    %fs:0x0,%eax
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: default avatarHarvey Harrison <harvey.harrison@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ddc81ed2
    • Christoph Lameter's avatar
      Remove set_migrateflags() · 488514d1
      Christoph Lameter authored
      
      
      Migrate flags must be set on slab creation as agreed upon when the antifrag
      logic was reviewed.  Otherwise some slabs of a slabcache will end up in the
      unmovable and others in the reclaimable section depending on which flag was
      active when a new slab page was allocated.
      
      This likely slid in somehow when antifrag was merged. Remove it.
      
      The buffer_heads are always allocated with __GFP_RECLAIMABLE because the
      SLAB_RECLAIM_ACCOUNT option is set.  The set_migrateflags() never had any
      effect there.
      
      Radix tree allocations are not directly reclaimable but they are allocated
      with __GFP_RECLAIMABLE set on each allocation.  We now set
      SLAB_RECLAIM_ACCOUNT on radix tree slab creation making sure that radix
      tree slabs are consistently placed in the reclaimable section.  Radix tree
      slabs will also be accounted as such.
      
      There is then no user left of set_migratepages. So remove it.
      
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      488514d1
    • Jeff Moyer's avatar
      aio: io_getevents() should return if io_destroy() is invoked · e92adcba
      Jeff Moyer authored
      
      
      This patch wakes up a thread waiting in io_getevents if another thread
      destroys the context.  This was tested using a small program that spawns a
      thread to wait in io_getevents while the parent thread destroys the io context
      and then waits for the getevents thread to exit.  Without this patch, the
      program hangs indefinitely.  With the patch, the program exits as expected.
      
      Signed-off-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Cc: Zach Brown <zach.brown@oracle.com>
      Cc: Christopher Smith <x@xman.org>
      Cc: Benjamin LaHaise <bcrl@kvack.org>
      Cc: <stable@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e92adcba
    • Jeremy Fitzhardinge's avatar
      hotplug-memory: make online_page() common · 180c06ef
      Jeremy Fitzhardinge authored
      
      
      All architectures use an effectively identical definition of online_page(), so
      just make it common code.  x86-64, ia64, powerpc and sh are actually
      identical; x86-32 is slightly different.
      
      x86-32's differences arise because it puts its hotplug pages in the highmem
      zone.  We can handle this in the generic code by inspecting the page to see if
      its in highmem, and update the totalhigh_pages count appropriately.  This
      leaves init_32.c:free_new_highpage with a single caller, so I folded it into
      add_one_highpage_init.
      
      I also removed an incorrect comment referring to the NUMA case; any NUMA
      details have already been dealt with by the time online_page() is called.
      
      [akpm@linux-foundation.org: fix indenting]
      Signed-off-by: default avatarJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
      Acked-by: default avatarDave Hansen <dave@linux.vnet.ibm.com>
      Reviewed-by: default avatarKAMEZAWA Hiroyuki <kamez.hiroyu@jp.fujitsu.com>
      Tested-by: default avatarKAMEZAWA Hiroyuki <kamez.hiroyu@jp.fujitsu.com>
      Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Acked-by: default avatarIngo Molnar <mingo@elte.hu>
      Acked-by: default avatarYasunori Goto <y-goto@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      180c06ef
    • Badari Pulavarty's avatar
      hotplug memory remove: generic __remove_pages() support · ea01ea93
      Badari Pulavarty authored
      
      
      Generic helper function to remove section mappings and sysfs entries for the
      section of the memory we are removing.  offline_pages() correctly adjusted
      zone and marked the pages reserved.
      
      TODO: Yasunori Goto is working on patches to free up allocations from bootmem.
      
      Signed-off-by: default avatarBadari Pulavarty <pbadari@us.ibm.com>
      Acked-by: default avatarYasunori Goto <y-goto@jp.fujitsu.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ea01ea93
    • Harvey Harrison's avatar
      rtc: replace remaining __FUNCTION__ occurrences · 2a4e2b87
      Harvey Harrison authored
      
      
      __FUNCTION__ is gcc-specific, use __func__
      
      Signed-off-by: default avatarHarvey Harrison <harvey.harrison@gmail.com>
      Cc: David Brownell <david-b@pacbell.net>
      Cc: Alessandro Zummo <a.zummo@towertech.it>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2a4e2b87
    • Julia Lawall's avatar
      drivers/char/rtc.c: use time_before, time_before_eq, etc · dca03a51
      Julia Lawall authored
      The functions time_before, time_before_eq, time_after, and time_after_eq
      are more robust for comparing jiffies against other values.
      
      A simplified version of the semantic patch making this change is as follows:
      (http://www.emn.fr/x-info/coccinelle/
      
      )
      
      // <smpl>
      @ change_compare_np @
      expression E;
      @@
      
      (
      - jiffies <= E
      + time_before_eq(jiffies,E)
      |
      - jiffies >= E
      + time_after_eq(jiffies,E)
      |
      - jiffies < E
      + time_before(jiffies,E)
      |
      - jiffies > E
      + time_after(jiffies,E)
      )
      
      @ include depends on change_compare_np @
      @@
      
      #include <linux/jiffies.h>
      
      @ no_include depends on !include && change_compare_np @
      @@
      
        #include <linux/...>
      + #include <linux/jiffies.h>
      // </smpl>
      
      Signed-off-by: default avatarJulia Lawall <julia@diku.dk>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Alessandro Zummo <a.zummo@towertech.it>
      Cc: David Brownell <david-b@pacbell.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dca03a51
    • Zhao Yakui's avatar
      rtc: add the support for alarm time relative to current time in sysfs · c116bc2a
      Zhao Yakui authored
      
      
      In current kernel if we want to set the alarm time, the absolute time the
      seconds relative to 1970-01-01 00:00:00) should be written into
      /sys/class/rtc/rtc0/wakealarm.  It is not convenient.
      
      It is more reasonable to add the support for the alarm time relative to
      current RTC time.(the unit is second)
      
      For example:
      If the RTC is required to generate alarm after 2 minutes, the following
      will be OK.
      	echo +120 > /sys/class/rtc/rtc0/wakealarm
      or      echo +0x78 > /sys/class/rtc/rtc0/wakealarm
      
      Signed-off-by: default avatarZhao Yakui <yakui.zhao@intel.com>
      Signed-off-by: default avatarZhang Rui <rui.zhang@intel.com>
      Signed-off-by: default avatarDavid Brownell <dbrownell@users.sourceforge.net>
      Cc: Alessandro Zummo <a.zummo@towertech.it>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c116bc2a
    • Paul Mundt's avatar
      rtc: rtc-rs5c372: fix up NULL name in transfer error path · e2bfe342
      Paul Mundt authored
      
      
      rs5c_get_regs() currently uses rs5c->rtc->name for its debug printk when
      i2c_transfer() fails, though it is used several times before the rtc dev
      has been registered. The earliest we can get at the symbolic name is via
      the i2c client's struct device, which can be handled by moving the first
      rs5c_get_regs() until after the client pointer is assigned.
      
      Signed-off-by: default avatarPaul Mundt <lethal@linux-sh.org>
      Cc: David Brownell <david-b@pacbell.net>
      Cc: Alessandro Zummo <a.zummo@towertech.it>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e2bfe342
    • David Brownell's avatar
      kerneldoc for <linux/clk.h> · e275ac47
      David Brownell authored
      
      
      Add <linux/clk.h> to the generated kerneldoc, with some overview
      to go along with those per-function descriptions.
      
      Signed-off-by: default avatarDavid Brownell <dbrownell@users.sourceforge.net>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Alessandro Zummo <a.zummo@towertech.it>
      Cc: "Randy.Dunlap" <rdunlap@xenotime.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e275ac47
    • Adrian Bunk's avatar
      make ds1511_rtc_{read,set}_time() static · a3ed107e
      Adrian Bunk authored
      
      
      Make the needlessly global ds1511_rtc_{read,set}_time() static.
      
      Signed-off-by: default avatarAdrian Bunk <bunk@kernel.org>
      Cc: David Brownell <david-b@pacbell.net>
      Cc: Alessandro Zummo <a.zummo@towertech.it>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a3ed107e
    • Sam Ravnborg's avatar
      rtc: silence section mismatch warning in rtc-test · c4646528
      Sam Ravnborg authored
      
      
      Fix following warning:
      WARNING: vmlinux.o(.data+0x253e28): Section mismatch in reference from the variable test_drv to the function .devexit.text:test_remove()
      
      Fix by renaming the platfrom_driver variable from *_drv to *_driver
      so modpost ignore the reference to an __devexit section.
      
      Signed-off-by: default avatarSam Ravnborg <sam@ravnborg.org>
      Cc: Alessandro Zummo <a.zummo@towertech.it>
      Cc: David Brownell <david-b@pacbell.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c4646528
    • Alessandro Zummo's avatar
      rtc-x1205: new style conversion · 4edac2b4
      Alessandro Zummo authored
      
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: default avatarAlessandro Zummo <a.zummo@towertech.it>
      Cc: David Brownell <david-b@pacbell.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4edac2b4
    • Alessandro Zummo's avatar
      rtc-pcf8563: new style conversion · e5fc9cc0
      Alessandro Zummo authored
      
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: default avatarAlessandro Zummo <a.zummo@towertech.it>
      Cc: David Brownell <david-b@pacbell.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e5fc9cc0
    • Alessandro Zummo's avatar
      rtc-isl1208: new style conversion and minor bug fixes · 9edae7bc
      Alessandro Zummo authored
      
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: default avatarAlessandro Zummo <a.zummo@towertech.it>
      Cc: Herbert Valerio Riedel <hvr@gnu.org>
      Cc: David Brownell <david-b@pacbell.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9edae7bc
    • David Brownell's avatar
      rtc: avoid legacy drivers with generic framework · c7500900
      David Brownell authored
      
      
      Kconfig tweaks to help reduce RTC configuration bugs, by avoiding
      legacy RTC drivers when the generic RTC framework is enabled:
      
       - If rtc-cmos is selected, disable the legacy rtc driver;
      
       - When using generic RTC on x86, enable rtc-cmos by default;
      
       - In the old "chardev RTC" section of Kconfig, add a comment
         warning people off these (seven) legacy RTC drivers when
         the generic framework is in use.
      
      People can still use the legacy drivers if they want (or need) to.
      
      This doesn't fix the broken dependencies for the legacy "CMOS" RTC driver.
      Ideally it would be a full list of platforms where it works, not a partial
      list of ones where it won't.  Or better yet, it would depend on a
      "HAVE_CMOS_RTC" flag defined by various platforms ...  surely there's a
      Kconfig style guideline lurking there.
      
      Signed-off-by: default avatarDavid Brownell <dbrownell@users.sourceforge.net>
      Acked-by: default avatarAlessandro Zummo <a.zummo@towertech.it>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c7500900
    • David Brownell's avatar
      rtc-pcf8583 build fix · 77459b05
      David Brownell authored
      
      
      Fix bogus #include in rtc-pcf8583, so it compiles on platforms that
      don't support PC clone RTCs.  (Original issue noted by Adrian Bunk.)
      
      Signed-off-by: default avatarDavid Brownell <dbrownell@users.sourceforge.net>
      Cc: Adrian Bunk <bunk@kernel.org>
      Acked-by: default avatarAlessandro Zummo <a.zummo@towertech.it>
      Cc: <stable@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      77459b05
    • Roel Kluin's avatar
      dz: test after postfix decrement fails in dz_console_putchar() · 1ecf0d0c
      Roel Kluin authored
      
      
      When loops reaches 0 the postfix decrement still subtracts, so the subsequent
      test fails.
      
      Signed-off-by: default avatarRoel Kluin <12o3l@tiscali.nl>
      Acked-by: default avatarMaciej W. Rozycki <macro@linux-mips.org>
      Cc: Johannes Weiner <hannes@saeurebad.de>
      Cc: <stable@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1ecf0d0c
    • Johannes Weiner's avatar
      mm: fix possible off-by-one in walk_pte_range() · 556637cd
      Johannes Weiner authored
      
      
      After the loop in walk_pte_range() pte might point to the first address after
      the pmd it walks.  The pte_unmap() is then applied to something bad.
      
      Spotted by Roel Kluin and Andreas Schwab.
      
      Signed-off-by: default avatarJohannes Weiner <hannes@saeurebad.de>
      Cc: Roel Kluin <12o3l@tiscali.nl>
      Cc: Andreas Schwab <schwab@suse.de>
      Acked-by: default avatarMatt Mackall <mpm@selenic.com>
      Acked-by: default avatarMikael Pettersson <mikpe@it.uu.se>
      Cc: <stable@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      556637cd
    • Ingo Molnar's avatar
      x86: PAT fix · f022bfd5
      Ingo Molnar authored
      Adrian Bunk noticed the following Coverity report:
      
      > Commit e7f260a2
      
      
      > (x86: PAT use reserve free memtype in mmap of /dev/mem)
      > added the following gem to arch/x86/mm/pat.c:
      >
      > <--  snip  -->
      >
      > ...
      > int phys_mem_access_prot_allowed(struct file *file, unsigned long pfn,
      >                                 unsigned long size, pgprot_t *vma_prot)
      > {
      >         u64 offset = ((u64) pfn) << PAGE_SHIFT;
      >         unsigned long flags = _PAGE_CACHE_UC_MINUS;
      >         unsigned long ret_flags;
      > ...
      > ...  (nothing that touches ret_flags)
      > ...
      >         if (flags != _PAGE_CACHE_UC_MINUS) {
      >                 retval = reserve_memtype(offset, offset + size, flags, NULL);
      >         } else {
      >                 retval = reserve_memtype(offset, offset + size, -1, &ret_flags);
      >         }
      >
      >         if (retval < 0)
      >                 return 0;
      >
      >         flags = ret_flags;
      >
      >         if (pfn <= max_pfn_mapped &&
      >             ioremap_change_attr((unsigned long)__va(offset), size, flags) < 0) {
      >                 free_memtype(offset, offset + size);
      >                 printk(KERN_INFO
      >                 "%s:%d /dev/mem ioremap_change_attr failed %s for %Lx-%Lx\n",
      >                         current->comm, current->pid,
      >                         cattr_name(flags),
      >                         offset, offset + size);
      >                 return 0;
      >         }
      >
      >         *vma_prot = __pgprot((pgprot_val(*vma_prot) & ~_PAGE_CACHE_MASK) |
      >                              flags);
      >         return 1;
      > }
      >
      > <--  snip  -->
      >
      > If (flags != _PAGE_CACHE_UC_MINUS) we pass garbage from the stack to
      > ioremap_change_attr() and/or __pgprot().
      >
      > Spotted by the Coverity checker.
      
      the fix simplifies the code as we get rid of the 'ret_flags'
      complication.
      
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f022bfd5
  2. 27 Apr, 2008 3 commits