1. 02 Jun, 2011 3 commits
  2. 31 May, 2011 1 commit
  3. 30 May, 2011 4 commits
  4. 29 May, 2011 14 commits
    • Linus Torvalds's avatar
      mm: Fix boot crash in mm_alloc() · 6345d24d
      Linus Torvalds authored
      Thomas Gleixner reports that we now have a boot crash triggered by
          BUG: unable to handle kernel NULL pointer dereference at   (null)
          IP: [<c11ae035>] find_next_bit+0x55/0xb0
          Call Trace:
           [<c11addda>] cpumask_any_but+0x2a/0x70
           [<c102396b>] flush_tlb_mm+0x2b/0x80
           [<c1022705>] pud_populate+0x35/0x50
           [<c10227ba>] pgd_alloc+0x9a/0xf0
           [<c103a3fc>] mm_init+0xec/0x120
           [<c103a7a3>] mm_alloc+0x53/0xd0
      which was introduced by commit de03c72c
       ("mm: convert
      mm->cpu_vm_cpumask into cpumask_var_t"), and is due to wrong ordering of
      mm_init() vs mm_init_cpumask
      Thomas wrote a patch to just fix the ordering of initialization, but I
      hate the new double allocation in the fork path, so I ended up instead
      doing some more radical surgery to clean it all up.
      Reported-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reported-by: default avatarIngo Molnar <mingo@elte.hu>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Benny Halevy's avatar
    • Benny Halevy's avatar
      pnfs: layoutreturn · cbe82603
      Benny Halevy authored
      NFSv4.1 LAYOUTRETURN implementation
      Currently, does not support layout-type payload encoding.
      Signed-off-by: default avatarAlexandros Batsakis <batsakis@netapp.com>
      Signed-off-by: default avatarAndy Adamson <andros@citi.umich.edu>
      Signed-off-by: default avatarAndy Adamson <andros@netapp.com>
      Signed-off-by: default avatarDean Hildebrand <dhildeb@us.ibm.com>
      Signed-off-by: default avatarFred Isaman <iisaman@citi.umich.edu>
      Signed-off-by: default avatarFred Isaman <iisaman@netapp.com>
      Signed-off-by: default avatarMarc Eshel <eshel@almaden.ibm.com>
      Signed-off-by: default avatarZhang Jingwang <zhangjingwang@nrchpc.ac.cn>
      [call pnfs_return_layout right before pnfs_destroy_layout]
      [remove assert_spin_locked from pnfs_clear_lseg_list]
      [remove wait parameter from the layoutreturn path.]
      [remove return_type field from nfs4_layoutreturn_args]
      [remove range from nfs4_layoutreturn_args]
      [no need to send layoutcommit from _pnfs_return_layout]
      [don't wait on sync layoutreturn]
      [fix layout stateid in layoutreturn args]
      [fixed NULL deref in _pnfs_return_layout]
      [removed recaim member of nfs4_layoutreturn_args]
      Signed-off-by: default avatarBenny Halevy <bhalevy@panasas.com>
    • Benny Halevy's avatar
      pnfs: support for non-rpc layout drivers · d20581aa
      Benny Halevy authored
      Non-rpc layout driver such as for objects and blocks
      implement their own I/O path and error handling logic.
      Therefore bypass NFS-based error handling for these layout drivers.
      [fix lseg ref-count bugs, and null de-refs]
      [Fall out from: non-rpc layout drivers]
      Signed-off-by: default avatarBoaz Harrosh <bharrosh@panasas.com>
      [get rid of PNFS_USE_RPC_CODE]
      [get rid of __nfs4_write_done_cb]
      [revert useless change in nfs4_write_done_cb]
      Signed-off-by: default avatarBenny Halevy <bhalevy@panasas.com>
    • Benny Halevy's avatar
      pnfs-obj: pnfs_osd XDR definitions · 38b7c401
      Benny Halevy authored
      * Add the pnfs_osd_xdr.h header
      * defintions the pnfs_osd_layout structure including all it's
        sub-types and constants.
      * Declare the pnfs_osd_xdr_decode_layout API + all needed
        inline helpers.
      * Define the pnfs_osd_deviceaddr structure and all its subtypes and
      * Declare API for decoding of a pnfs_osd_deviceaddr from XDR stream.
      * Define the pnfs_osd_ioerr structure, its substructures and constants.
      * Declare API for encoding of a pnfs_osd_ioerr into XDR stream.
      * Define the pnfs_osd_layoutupdate structure and its substructures.
      * Declare API for encoding of a pnfs_osd_layoutupdate into XDR stream.
      [Remove server definitions]
      Signed-off-by: default avatarBoaz Harrosh <bharrosh@panasas.com>
      Signed-off-by: default avatarBenny Halevy <bhalevy@panasas.com>
    • Benny Halevy's avatar
      SUNRPC: introduce xdr_init_decode_pages · f7da7a12
      Benny Halevy authored
      Initialize xdr_stream and xdr_buf using an array of page pointers
      and length of buffer.
      Signed-off-by: default avatarBenny Halevy <bhalevy@panasas.com>
    • Mikulas Patocka's avatar
      dm kcopyd: return client directly and not through a pointer · fa34ce73
      Mikulas Patocka authored
      Return client directly from dm_kcopyd_client_create, not through a
      parameter, making it consistent with dm_io_client_create.
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
    • Mikulas Patocka's avatar
      dm kcopyd: reserve fewer pages · 5f43ba29
      Mikulas Patocka authored
      Reserve just the minimum of pages needed to process one job.
      Because we allocate pages from page allocator, we don't need to reserve
      a large number of pages.  The maximum job size is SUB_JOB_SIZE and we
      calculate the number of reserved pages based on this.
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
    • Mikulas Patocka's avatar
      dm io: use fixed initial mempool size · bda8efec
      Mikulas Patocka authored
      Replace the arbitrary calculation of an initial io struct mempool size
      with a constant.
      The code calculated the number of reserved structures based on the request
      size and used a "magic" multiplication constant of 4.  This patch changes
      it to reserve a fixed number - itself still chosen quite arbitrarily.
      Further testing might show if there is a better number to choose.
      Note that if there is no memory pressure, we can still allocate an
      arbitrary number of "struct io" structures.  One structure is enough to
      process the whole request.
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
    • Mike Snitzer's avatar
      dm table: allow targets to support discards internally · 4c259327
      Mike Snitzer authored
      Permit a target to support discards regardless of whether or not all its
      underlying devices do.
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
    • Heiko Carstens's avatar
      [S390] mm: fix storage key handling · a43a9d93
      Heiko Carstens authored
      page_get_storage_key() and page_set_storage_key() expect a page address
      and not its page frame number. This got inconsistent with 2d42552d
      "[S390] merge page_test_dirty and page_clear_dirty".
      Result is that we read/write storage keys from random pages and do not
      have a working dirty bit tracking at all.
      E.g. SetPageUpdate() doesn't clear the dirty bit of requested pages, which
      for example ext4 doesn't like very much and panics after a while.
      Unable to handle kernel paging request at virtual user address (null)
      Modules linked in:
      CPU: 1 Not tainted 2.6.39-07551-g139f37f5
      -dirty #152
      Process flush-94:0 (pid: 1576, task: 000000003eb34538, ksp: 000000003c287b70)
      Krnl PSW : 0704c00180000000 0000000000316b12 (jbd2_journal_file_inode+0x10e/0x138)
                 R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 EA:3
      Krnl GPRS: 0000000000000000 0000000000000000 0000000000000000 0700000000000000
                 0000000000316a62 000000003eb34cd0 0000000000000025 000000003c287b88
                 0000000000000001 000000003c287a70 000000003f1ec678 000000003f1ec000
                 0000000000000000 000000003e66ec00 0000000000316a62 000000003c287988
      Krnl Code: 0000000000316b04: f0a0000407f4       srp     4(11,%r0),2036,0
                 0000000000316b0a: b9020022           ltgr    %r2,%r2
                 0000000000316b0e: a7740015           brc     7,316b38
                >0000000000316b12: e3d0c0000024       stg     %r13,0(%r12)
                 0000000000316b18: 4120c010           la      %r2,16(%r12)
                 0000000000316b1c: 4130d060           la      %r3,96(%r13)
                 0000000000316b20: e340d0600004       lg      %r4,96(%r13)
                 0000000000316b26: c0e50002b567       brasl   %r14,36d5f4
      Call Trace:
      ([<0000000000316a62>] jbd2_journal_file_inode+0x5e/0x138)
       [<00000000002da13c>] mpage_da_map_and_submit+0x2e8/0x42c
       [<00000000002daac2>] ext4_da_writepages+0x2da/0x504
       [<00000000002597e8>] writeback_single_inode+0xf8/0x268
       [<0000000000259f06>] writeback_sb_inodes+0xd2/0x18c
       [<000000000025a700>] writeback_inodes_wb+0x80/0x168
       [<000000000025aa92>] wb_writeback+0x2aa/0x324
       [<000000000025abde>] wb_do_writeback+0xd2/0x274
       [<000000000025ae3a>] bdi_writeback_thread+0xba/0x1c4
       [<00000000001737be>] kthread+0xa6/0xb0
       [<000000000056c1da>] kernel_thread_starter+0x6/0xc
       [<000000000056c1d4>] kernel_thread_starter+0x0/0xc
      INFO: lockdep is turned off.
      Last Breaking-Event-Address:
       [<0000000000316a8a>] jbd2_journal_file_inode+0x86/0x138
      Reported-by: default avatarSebastian Ott <sebott@linux.vnet.ibm.com>
      Signed-off-by: default avatarHeiko Carstens <heiko.carstens@de.ibm.com>
    • Lin Ming's avatar
      ACPI: Add D3 cold state · 28c2103d
      Lin Ming authored
      _SxW returns an Integer containing the lowest D-state supported in state
      Sx. If OSPM has not indicated that it supports _PR3, then the value “3”
      corresponds to D3.  If it has indicated _PR3 support, the value “3”
      represents D3hot and the value “4” represents D3cold.
      Linux does set _OSC._PR3, so we should fix it to expect that _SxW can
      return 4.
      Signed-off-by: default avatarLin Ming <ming.m.lin@intel.com>
      Acked-by: default avatarJesse Barnes <jbarnes@virtuousgeek.org>
      Signed-off-by: default avatarLen Brown <len.brown@intel.com>
    • Lin Ming's avatar
      ACPI: processor: fix processor_physically_present in UP kernel · 932df741
      Lin Ming authored
      Usually, there are multiple processors defined in ACPI table, for
          Scope (_PR)
              Processor (CPU0, 0x00, 0x00000410, 0x06) {}
              Processor (CPU1, 0x01, 0x00000410, 0x06) {}
              Processor (CPU2, 0x02, 0x00000410, 0x06) {}
              Processor (CPU3, 0x03, 0x00000410, 0x06) {}
      processor_physically_present(...) will be called to check whether those
      processors are physically present.
      Currently we have below codes in processor_physically_present,
      cpuid = acpi_get_cpuid(...);
      if ((cpuid == -1) && (num_possible_cpus() > 1))
              return false;
      return true;
      In UP kernel, acpi_get_cpuid(...) always return -1 and
      num_possible_cpus() always return 1, so
      processor_physically_present(...) always returns true for all passed in
      processor handles.
      This is wrong for UP processor or SMP processor running UP kernel.
      This patch removes the !SMP version of acpi_get_cpuid(), so both UP and
      SMP kernel use the same acpi_get_cpuid function.
      And for UP kernel, only processor 0 is valid.
      Tested-by: default avatarAnton Kochkov <anton.kochkov@gmail.com>
      Tested-by: default avatarAmbroz Bizjak <ambrop7@gmail.com>
      Signed-off-by: default avatarLin Ming <ming.m.lin@intel.com>
      Signed-off-by: default avatarLen Brown <len.brown@intel.com>
    • Tim Chen's avatar
      idle governor: Avoid lock acquisition to read pm_qos before entering idle · 333c5ae9
      Tim Chen authored
      Thanks to the reviews and comments by Rafael, James, Mark and Andi.
      Here's version 2 of the patch incorporating your comments and also some
      update to my previous patch comments.
      I noticed that before entering idle state, the menu idle governor will
      look up the current pm_qos target value according to the list of qos
      requests received.  This look up currently needs the acquisition of a
      lock to access the list of qos requests to find the qos target value,
      slowing down the entrance into idle state due to contention by multiple
      cpus to access this list.  The contention is severe when there are a lot
      of cpus waking and going into idle.  For example, for a simple workload
      that has 32 pair of processes ping ponging messages to each other, where
      64 cpu cores are active in test system, I see the following profile with
      37.82% of cpu cycles spent in contention of pm_qos_lock:
      -     37.82%          swapper  [kernel.kallsyms]          [k]
         - _raw_spin_lock_irqsave
            - 95.65% pm_qos_request
               - cpu_idle
                    99.98% start_secondary
      A better approach will be to cache the updated pm_qos target value so
      reading it does not require lock acquisition as in the patch below.
      With this patch the contention for pm_qos_lock is removed and I saw a
      2.2X increase in throughput for my message passing workload.
      cc: stable@kernel.org
      Signed-off-by: default avatarTim Chen <tim.c.chen@linux.intel.com>
      Acked-by: default avatarAndi Kleen <ak@linux.intel.com>
      Acked-by: default avatarJames Bottomley <James.Bottomley@suse.de>
      Acked-by: default avatarmark gross <markgross@thegnar.org>
      Signed-off-by: default avatarLen Brown <len.brown@intel.com>
  5. 28 May, 2011 4 commits
    • Eric W. Biederman's avatar
      ns: Wire up the setns system call · 7b21fddd
      Eric W. Biederman authored
      32bit and 64bit on x86 are tested and working.  The rest I have looked
      at closely and I can't find any problems.
      setns is an easy system call to wire up.  It just takes two ints so I
      don't expect any weird architecture porting problems.
      While doing this I have noticed that we have some architectures that are
      very slow to get new system calls.  cris seems to be the slowest where
      the last system calls wired up were preadv and pwritev.  avr32 is weird
      in that recvmmsg was wired up but never declared in unistd.h.  frv is
      behind with perf_event_open being the last syscall wired up.  On h8300
      the last system call wired up was epoll_wait.  On m32r the last system
      call wired up was fallocate.  mn10300 has recvmmsg as the last system
      call wired up.  The rest seem to at least have syncfs wired up which was
      new in the 2.6.39.
      v2: Most of the architecture support added by Daniel Lezcano <dlezcano@fr.ibm.com>
      v3: ported to v2.6.36-rc4 by: Eric W. Biederman <ebiederm@xmission.com>
      v4: Moved wiring up of the system call to another patch
      v5: ported to v2.6.39-rc6
      v6: rebased onto parisc-next and net-next to avoid syscall  conflicts.
      v7: ported to Linus's latest post 2.6.39 tree.
      >  arch/blackfin/include/asm/unistd.h     |    3 ++-
      >  arch/blackfin/mach-common/entry.S      |    1 +
      Acked-by: default avatarMike Frysinger <vapier@gentoo.org>
      Oh - ia64 wiring looks good.
      Acked-by: default avatarTony Luck <tony.luck@intel.com>
      Signed-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Andi Kleen's avatar
      Cache xattr security drop check for write v2 · 69b45732
      Andi Kleen authored
      Some recent benchmarking on btrfs showed that a major scaling bottleneck
      on large systems on btrfs is currently the xattr lookup on every write.
      Why xattr lookup on every write I hear you ask?
      write wants to drop suid and security related xattrs that could set o
      capabilities for executables.  To do that it currently looks up
      security.capability on EVERY write (even for non executables) to decide
      whether to drop it or not.
      In btrfs this causes an additional tree walk, hitting some per file system
      locks and quite bad scalability. In a simple read workload on a 8S
      system I saw over 90% CPU time in spinlocks related to that.
      Chris Mason tells me this is also a problem in ext4, where it hits
      the global mbcache lock.
      This patch adds a simple per inode to avoid this problem.  We only
      do the lookup once per file and then if there is no xattr cache
      the decision. All xattr changes clear the flag.
      I also used the same flag to avoid the suid check, although
      that one is pretty cheap.
      A file system can also set this flag when it creates the inode,
      if it has a cheap way to do so.  This is done for some common file systems
      in followon patches.
      With this patch a major part of the lock contention disappears
      for btrfs. Some testing on smaller systems didn't show significant
      performance changes, but at least it helps the larger systems
      and is generally more efficient.
      v2: Rename is_sgid. add file system helper.
      Cc: chris.mason@oracle.com
      Cc: josef@redhat.com
      Cc: viro@zeniv.linux.org.uk
      Cc: agruen@linbit.com
      Cc: Serge E. Hallyn <serue@us.ibm.com>
      Signed-off-by: default avatarAndi Kleen <ak@linux.intel.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
    • Paul E. McKenney's avatar
      atomic: Add atomic_or() · 55c2945a
      Paul E. McKenney authored
      An atomic_or() function is needed by TREE_RCU to avoid deadlock, so
      add a generic version.
      Signed-off-by: default avatarPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
    • KOSAKI Motohiro's avatar
      cpuset: Fix cpuset_cpus_allowed_fallback(), don't update tsk->rt.nr_cpus_allowed · 1e1b6c51
      KOSAKI Motohiro authored
      The rule is, we have to update tsk->rt.nr_cpus_allowed if we change
      tsk->cpus_allowed. Otherwise RT scheduler may confuse.
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/4DD4B3FA.5060901@jp.fujitsu.com
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
  6. 27 May, 2011 14 commits