1. 18 Jan, 2013 1 commit
    • Tejun Heo's avatar
      workqueue: rename kernel/workqueue_sched.h to kernel/workqueue_internal.h · ea138446
      Tejun Heo authored
      
      
      Workqueue wants to expose more interface internal to kernel/.  Instead
      of adding a new header file, repurpose kernel/workqueue_sched.h.
      Rename it to workqueue_internal.h and add include protector.
      
      This patch doesn't introduce any functional changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      ea138446
  2. 11 Dec, 2012 5 commits
    • Mel Gorman's avatar
      mm: sched: numa: Control enabling and disabling of NUMA balancing if !SCHED_DEBUG · 3105b86a
      Mel Gorman authored
      
      
      The "mm: sched: numa: Control enabling and disabling of NUMA balancing"
      depends on scheduling debug being enabled but it's perfectly legimate to
      disable automatic NUMA balancing even without this option. This should
      take care of it.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      3105b86a
    • Mel Gorman's avatar
      mm: sched: numa: Control enabling and disabling of NUMA balancing · 1a687c2e
      Mel Gorman authored
      
      
      This patch adds Kconfig options and kernel parameters to allow the
      enabling and disabling of automatic NUMA balancing. The existance
      of such a switch was and is very important when debugging problems
      related to transparent hugepages and we should have the same for
      automatic NUMA placement.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      1a687c2e
    • Mel Gorman's avatar
      mm: sched: Adapt the scanning rate if a NUMA hinting fault does not migrate · b8593bfd
      Mel Gorman authored
      
      
      The PTE scanning rate and fault rates are two of the biggest sources of
      system CPU overhead with automatic NUMA placement.  Ideally a proper policy
      would detect if a workload was properly placed, schedule and adjust the
      PTE scanning rate accordingly. We do not track the necessary information
      to do that but we at least know if we migrated or not.
      
      This patch scans slower if a page was not migrated as the result of a
      NUMA hinting fault up to sysctl_numa_balancing_scan_period_max which is
      now higher than the previous default. Once every minute it will reset
      the scanner in case of phase changes.
      
      This is hilariously crude and the numbers are arbitrary. Workloads will
      converge quite slowly in comparison to what a proper policy should be able
      to do. On the plus side, we will chew up less CPU for workloads that have
      no need for automatic balancing.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      b8593bfd
    • Peter Zijlstra's avatar
      mm: sched: numa: Implement slow start for working set sampling · 4b96a29b
      Peter Zijlstra authored
      
      
      Add a 1 second delay before starting to scan the working set of
      a task and starting to balance it amongst nodes.
      
      [ note that before the constant per task WSS sampling rate patch
        the initial scan would happen much later still, in effect that
        patch caused this regression. ]
      
      The theory is that short-run tasks benefit very little from NUMA
      placement: they come and go, and they better stick to the node
      they were started on. As tasks mature and rebalance to other CPUs
      and nodes, so does their NUMA placement have to change and so
      does it start to matter more and more.
      
      In practice this change fixes an observable kbuild regression:
      
         # [ a perf stat --null --repeat 10 test of ten bzImage builds to /dev/shm ]
      
         !NUMA:
         45.291088843 seconds time elapsed                                          ( +-  0.40% )
         45.154231752 seconds time elapsed                                          ( +-  0.36% )
      
         +NUMA, no slow start:
         46.172308123 seconds time elapsed                                          ( +-  0.30% )
         46.343168745 seconds time elapsed                                          ( +-  0.25% )
      
         +NUMA, 1 sec slow start:
         45.224189155 seconds time elapsed                                          ( +-  0.25% )
         45.160866532 seconds time elapsed                                          ( +-  0.17% )
      
      and it also fixes an observable perf bench (hackbench) regression:
      
         # perf stat --null --repeat 10 perf bench sched messaging
      
         -NUMA:
      
         -NUMA:                  0.246225691 seconds time elapsed                   ( +-  1.31% )
         +NUMA no slow start:    0.252620063 seconds time elapsed                   ( +-  1.13% )
      
         +NUMA 1sec delay:       0.248076230 seconds time elapsed                   ( +-  1.35% )
      
      The implementation is simple and straightforward, most of the patch
      deals with adding the /proc/sys/kernel/numa_balancing_scan_delay_ms tunable
      knob.
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      [ Wrote the changelog, ran measurements, tuned the default. ]
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      4b96a29b
    • Peter Zijlstra's avatar
      mm: numa: Add fault driven placement and migration · cbee9f88
      Peter Zijlstra authored
      
      
      NOTE: This patch is based on "sched, numa, mm: Add fault driven
      	placement and migration policy" but as it throws away all the policy
      	to just leave a basic foundation I had to drop the signed-offs-by.
      
      This patch creates a bare-bones method for setting PTEs pte_numa in the
      context of the scheduler that when faulted later will be faulted onto the
      node the CPU is running on.  In itself this does nothing useful but any
      placement policy will fundamentally depend on receiving hints on placement
      from fault context and doing something intelligent about it.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      cbee9f88
  3. 30 Nov, 2012 1 commit
    • Frederic Weisbecker's avatar
      context_tracking: New context tracking susbsystem · 91d1aa43
      Frederic Weisbecker authored
      
      
      Create a new subsystem that probes on kernel boundaries
      to keep track of the transitions between level contexts
      with two basic initial contexts: user or kernel.
      
      This is an abstraction of some RCU code that use such tracking
      to implement its userspace extended quiescent state.
      
      We need to pull this up from RCU into this new level of indirection
      because this tracking is also going to be used to implement an "on
      demand" generic virtual cputime accounting. A necessary step to
      shutdown the tick while still accounting the cputime.
      Signed-off-by: default avatarFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Li Zhong <zhong@linux.vnet.ibm.com>
      Cc: Gilad Ben-Yossef <gilad@benyossef.com>
      Reviewed-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      [ paulmck: fix whitespace error and email address. ]
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      91d1aa43
  4. 28 Nov, 2012 1 commit
  5. 20 Nov, 2012 1 commit
    • Eric W. Biederman's avatar
      userns: Kill task_user_ns · 4c44aaaf
      Eric W. Biederman authored
      
      
      The task_user_ns function hides the fact that it is getting the user
      namespace from struct cred on the task.  struct cred may go away as
      soon as the rcu lock is released.  This leads to a race where we
      can dereference a stale user namespace pointer.
      
      To make it obvious a struct cred is involved kill task_user_ns.
      
      To kill the race modify the users of task_user_ns to only
      reference the user namespace while the rcu lock is held.
      
      Cc: Kees Cook <keescook@chromium.org>
      Cc: James Morris <james.l.morris@oracle.com>
      Acked-by: default avatarKees Cook <keescook@chromium.org>
      Acked-by: default avatarSerge Hallyn <serge.hallyn@canonical.com>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      4c44aaaf
  6. 19 Nov, 2012 1 commit
  7. 16 Nov, 2012 1 commit
    • Paul E. McKenney's avatar
      sched: Mark RCU reader in sched_show_task() · 4e79752c
      Paul E. McKenney authored
      
      
      When sched_show_task() is invoked from try_to_freeze_tasks(), there is
      no RCU read-side critical section, resulting in the following splat:
      
      [  125.780730] ===============================
      [  125.780766] [ INFO: suspicious RCU usage. ]
      [  125.780804] 3.7.0-rc3+ #988 Not tainted
      [  125.780838] -------------------------------
      [  125.780875] /home/rafael/src/linux/kernel/sched/core.c:4497 suspicious rcu_dereference_check() usage!
      [  125.780946]
      [  125.780946] other info that might help us debug this:
      [  125.780946]
      [  125.781031]
      [  125.781031] rcu_scheduler_active = 1, debug_locks = 0
      [  125.781087] 4 locks held by s2ram/4211:
      [  125.781120]  #0:  (&buffer->mutex){+.+.+.}, at: [<ffffffff811e2acf>] sysfs_write_file+0x3f/0x160
      [  125.781233]  #1:  (s_active#94){.+.+.+}, at: [<ffffffff811e2b58>] sysfs_write_file+0xc8/0x160
      [  125.781339]  #2:  (pm_mutex){+.+.+.}, at: [<ffffffff81090a81>] pm_suspend+0x81/0x230
      [  125.781439]  #3:  (tasklist_lock){.?.?..}, at: [<ffffffff8108feed>] try_to_freeze_tasks+0x2cd/0x3f0
      [  125.781543]
      [  125.781543] stack backtrace:
      [  125.781584] Pid: 4211, comm: s2ram Not tainted 3.7.0-rc3+ #988
      [  125.781632] Call Trace:
      [  125.781662]  [<ffffffff810a3c73>] lockdep_rcu_suspicious+0x103/0x140
      [  125.781719]  [<ffffffff8107cf21>] sched_show_task+0x121/0x180
      [  125.781770]  [<ffffffff8108ffb4>] try_to_freeze_tasks+0x394/0x3f0
      [  125.781823]  [<ffffffff810903b5>] freeze_kernel_threads+0x25/0x80
      [  125.781876]  [<ffffffff81090b65>] pm_suspend+0x165/0x230
      [  125.781924]  [<ffffffff8108fa29>] state_store+0x99/0x100
      [  125.781975]  [<ffffffff812f5867>] kobj_attr_store+0x17/0x20
      [  125.782038]  [<ffffffff811e2b71>] sysfs_write_file+0xe1/0x160
      [  125.782091]  [<ffffffff811667a6>] vfs_write+0xc6/0x180
      [  125.782138]  [<ffffffff81166ada>] sys_write+0x5a/0xa0
      [  125.782185]  [<ffffffff812ff6ae>] ? trace_hardirqs_on_thunk+0x3a/0x3f
      [  125.782242]  [<ffffffff81669dd2>] system_call_fastpath+0x16/0x1b
      
      This commit therefore adds the needed RCU read-side critical section.
      Reported-by: default avatar"Rafael J. Wysocki" <rjw@sisk.pl>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      4e79752c
  8. 24 Oct, 2012 4 commits
  9. 23 Oct, 2012 2 commits
  10. 05 Oct, 2012 2 commits
    • Tang Chen's avatar
      sched: Update sched_domains_numa_masks[][] when new cpus are onlined · 301a5cba
      Tang Chen authored
      
      
      Once array sched_domains_numa_masks[] []is defined, it is never updated.
      
      When a new cpu on a new node is onlined, the coincident member in
      sched_domains_numa_masks[][] is not initialized, and all the masks are 0.
      As a result, the build_overlap_sched_groups() will initialize a NULL
      sched_group for the new cpu on the new node, which will lead to kernel panic:
      
      [ 3189.403280] Call Trace:
      [ 3189.403286]  [<ffffffff8106c36f>] warn_slowpath_common+0x7f/0xc0
      [ 3189.403289]  [<ffffffff8106c3ca>] warn_slowpath_null+0x1a/0x20
      [ 3189.403292]  [<ffffffff810b1d57>] build_sched_domains+0x467/0x470
      [ 3189.403296]  [<ffffffff810b2067>] partition_sched_domains+0x307/0x510
      [ 3189.403299]  [<ffffffff810b1ea2>] ? partition_sched_domains+0x142/0x510
      [ 3189.403305]  [<ffffffff810fcc93>] cpuset_update_active_cpus+0x83/0x90
      [ 3189.403308]  [<ffffffff810b22a8>] cpuset_cpu_active+0x38/0x70
      [ 3189.403316]  [<ffffffff81674b87>] notifier_call_chain+0x67/0x150
      [ 3189.403320]  [<ffffffff81664647>] ? native_cpu_up+0x18a/0x1b5
      [ 3189.403328]  [<ffffffff810a044e>] __raw_notifier_call_chain+0xe/0x10
      [ 3189.403333]  [<ffffffff81070470>] __cpu_notify+0x20/0x40
      [ 3189.403337]  [<ffffffff8166663e>] _cpu_up+0xe9/0x131
      [ 3189.403340]  [<ffffffff81666761>] cpu_up+0xdb/0xee
      [ 3189.403348]  [<ffffffff8165667c>] store_online+0x9c/0xd0
      [ 3189.403355]  [<ffffffff81437640>] dev_attr_store+0x20/0x30
      [ 3189.403361]  [<ffffffff8124aa63>] sysfs_write_file+0xa3/0x100
      [ 3189.403368]  [<ffffffff811ccbe0>] vfs_write+0xd0/0x1a0
      [ 3189.403371]  [<ffffffff811ccdb4>] sys_write+0x54/0xa0
      [ 3189.403375]  [<ffffffff81679c69>] system_call_fastpath+0x16/0x1b
      [ 3189.403377] ---[ end trace 1e6cf85d0859c941 ]---
      [ 3189.403398] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
      
      This patch registers a new notifier for cpu hotplug notify chain, and
      updates sched_domains_numa_masks every time a new cpu is onlined or offlined.
      Signed-off-by: default avatarTang Chen <tangchen@cn.fujitsu.com>
      Signed-off-by: default avatarWen Congyang <wency@cn.fujitsu.com>
      [ fixed compile warning ]
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/1348578751-16904-3-git-send-email-tangchen@cn.fujitsu.com
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      301a5cba
    • Tang Chen's avatar
      sched: Ensure 'sched_domains_numa_levels' is safe to use in other functions · 5f7865f3
      Tang Chen authored
      
      
      We should temporarily reset 'sched_domains_numa_levels' to 0 after
      it is reset to 'level' in sched_init_numa(). If it fails to allocate
      memory for array sched_domains_numa_masks[][], the array will contain
      less then 'level' members. This could be dangerous when we use it to
      iterate array sched_domains_numa_masks[][] in other functions.
      
      This patch set sched_domains_numa_levels to 0 before initializing
      array sched_domains_numa_masks[][], and reset it to 'level' when
      sched_domains_numa_masks[][] is fully initialized.
      Signed-off-by: default avatarTang Chen <tangchen@cn.fujitsu.com>
      Signed-off-by: default avatarWen Congyang <wency@cn.fujitsu.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/1348578751-16904-2-git-send-email-tangchen@cn.fujitsu.com
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      5f7865f3
  11. 01 Oct, 2012 1 commit
    • Al Viro's avatar
      sanitize tsk_is_polling() · 16a80163
      Al Viro authored
      
      
      Make default just return 0.  The current default (checking
      TIF_POLLING_NRFLAG) is taken to architectures that need it;
      ones that don't do polling in their idle threads don't need
      to defined TIF_POLLING_NRFLAG at all.
      
      ia64 defined both TS_POLLING (used by its tsk_is_polling())
      and TIF_POLLING_NRFLAG (not used at all).  Killed the latter...
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      16a80163
  12. 26 Sep, 2012 3 commits
    • Frederic Weisbecker's avatar
      rcu: Exit RCU extended QS on user preemption · 20ab65e3
      Frederic Weisbecker authored
      
      
      When exceptions or irq are about to resume userspace, if
      the task needs to be rescheduled, the arch low level code
      calls schedule() directly.
      
      If we call it, it is because we have the TIF_RESCHED flag:
      
      - It can be set after random local calls to set_need_resched()
      (RCU, drm, ...)
      
      - A wake up happened and the CPU needs preemption. This can
        happen in several ways:
      
          * Remotely: the remote waking CPU has set TIF_RESCHED and send the
            wakee an IPI to schedule the new task.
          * Remotely enqueued: the remote waking CPU sends an IPI to the target
            and the wake up is made by the target.
          * Locally: waking CPU == wakee CPU and the wakeup is done locally.
            set_need_resched() is called without IPI.
      
      In the case of local and remotely enqueued wake ups, the tick can
      be restarted when we enqueue the new task and RCU can exit the
      extended quiescent state at the same time. Then by the time we reach
      irq exit path and we call schedule, we are not in RCU user mode.
      
      But if we call schedule() only because something called set_need_resched(),
      RCU may still be in user mode when we reach schedule.
      
      Also if a wake up is done remotely, the CPU might see the TIF_RESCHED
      flag and call schedule while the IPI has not yet happen to restart the
      tick and exit RCU user mode.
      
      We need to manually protect against these corner cases.
      
      Create a new API schedule_user() that calls schedule() inside
      rcu_user_exit()-rcu_user_enter() in order to protect it. Archs
      will need to rely on it now to implement user preemption safely.
      Signed-off-by: default avatarFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Alessio Igor Bogani <abogani@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Geoff Levand <geoff@infradead.org>
      Cc: Gilad Ben Yossef <gilad@benyossef.com>
      Cc: Hakan Akkan <hakanakkan@gmail.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Kevin Hilman <khilman@ti.com>
      Cc: Max Krasnyansky <maxk@qualcomm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephen Hemminger <shemminger@vyatta.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: default avatarJosh Triplett <josh@joshtriplett.org>
      20ab65e3
    • Frederic Weisbecker's avatar
      rcu: Exit RCU extended QS on kernel preemption after irq/exception · 90a340ed
      Frederic Weisbecker authored
      
      
      When an exception or an irq exits, and we are going to resume into
      interrupted kernel code, the low level architecture code calls
      preempt_schedule_irq() if there is a need to reschedule.
      
      If the interrupt/exception occured between a call to rcu_user_enter()
      (from syscall exit, exception exit, do_notify_resume exit, ...) and
      a real resume to userspace (iret,...), preempt_schedule_irq() can be
      called whereas RCU thinks we are in userspace. But preempt_schedule_irq()
      is going to run kernel code and may be some RCU read side critical
      section. We must exit the userspace extended quiescent state before
      we call it.
      
      To solve this, just call rcu_user_exit() in the beginning of
      preempt_schedule_irq().
      Signed-off-by: default avatarFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Alessio Igor Bogani <abogani@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Geoff Levand <geoff@infradead.org>
      Cc: Gilad Ben Yossef <gilad@benyossef.com>
      Cc: Hakan Akkan <hakanakkan@gmail.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Kevin Hilman <khilman@ti.com>
      Cc: Max Krasnyansky <maxk@qualcomm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephen Hemminger <shemminger@vyatta.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: default avatarJosh Triplett <josh@joshtriplett.org>
      90a340ed
    • Frederic Weisbecker's avatar
      rcu: Switch task's syscall hooks on context switch · 04e7e951
      Frederic Weisbecker authored
      
      
      Clear the syscalls hook of a task when it's scheduled out so that if
      the task migrates, it doesn't run the syscall slow path on a CPU
      that might not need it.
      
      Also set the syscalls hook on the next task if needed.
      Signed-off-by: default avatarFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Alessio Igor Bogani <abogani@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Geoff Levand <geoff@infradead.org>
      Cc: Gilad Ben Yossef <gilad@benyossef.com>
      Cc: Hakan Akkan <hakanakkan@gmail.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Kevin Hilman <khilman@ti.com>
      Cc: Max Krasnyansky <maxk@qualcomm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephen Hemminger <shemminger@vyatta.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: default avatarJosh Triplett <josh@joshtriplett.org>
      04e7e951
  13. 25 Sep, 2012 1 commit
    • Frederic Weisbecker's avatar
      cputime: Use a proper subsystem naming for vtime related APIs · bf9fae9f
      Frederic Weisbecker authored
      
      
      Use a naming based on vtime as a prefix for virtual based
      cputime accounting APIs:
      
      - account_system_vtime() -> vtime_account()
      - account_switch_vtime() -> vtime_task_switch()
      
      It makes it easier to allow for further declension such
      as vtime_account_system(), vtime_account_idle(), ... if we
      want to find out the context we account to from generic code.
      
      This also make it better to know on which subsystem these APIs
      refer to.
      Signed-off-by: default avatarFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      bf9fae9f
  14. 23 Sep, 2012 1 commit
    • Peter Zijlstra's avatar
      sched: Fix load avg vs cpu-hotplug · 5d180232
      Peter Zijlstra authored
      
      
      Rabik and Paul reported two different issues related to the same few
      lines of code.
      
      Rabik's issue is that the nr_uninterruptible migration code is wrong in
      that he sees artifacts due to this (Rabik please do expand in more
      detail).
      
      Paul's issue is that this code as it stands relies on us using
      stop_machine() for unplug, we all would like to remove this assumption
      so that eventually we can remove this stop_machine() usage altogether.
      
      The only reason we'd have to migrate nr_uninterruptible is so that we
      could use for_each_online_cpu() loops in favour of
      for_each_possible_cpu() loops, however since nr_uninterruptible() is the
      only such loop and its using possible lets not bother at all.
      
      The problem Rabik sees is (probably) caused by the fact that by
      migrating nr_uninterruptible we screw rq->calc_load_active for both rqs
      involved.
      
      So don't bother with fancy migration schemes (meaning we now have to
      keep using for_each_possible_cpu()) and instead fold any nr_active delta
      after we migrate all tasks away to make sure we don't have any skewed
      nr_active accounting.
      
      [ paulmck: Move call to calc_load_migration to CPU_DEAD to avoid
      miscounting noted by Rakib. ]
      Reported-by: default avatarRakib Mullick <rakib.mullick@gmail.com>
      Reported-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarPaul E. McKenney <paul.mckenney@linaro.org>
      5d180232
  15. 16 Sep, 2012 1 commit
  16. 13 Sep, 2012 2 commits
  17. 04 Sep, 2012 4 commits
  18. 20 Aug, 2012 2 commits
    • Frederic Weisbecker's avatar
      cputime: Consolidate vtime handling on context switch · baa36046
      Frederic Weisbecker authored
      
      
      The archs that implement virtual cputime accounting all
      flush the cputime of a task when it gets descheduled
      and sometimes set up some ground initialization for the
      next task to account its cputime.
      
      These archs all put their own hooks in their context
      switch callbacks and handle the off-case themselves.
      
      Consolidate this by creating a new account_switch_vtime()
      callback called in generic code right after a context switch
      and that these archs must implement to flush the prev task
      cputime and initialize the next task cputime related state.
      Signed-off-by: default avatarFrederic Weisbecker <fweisbec@gmail.com>
      Acked-by: default avatarMartin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      baa36046
    • Frederic Weisbecker's avatar
      sched: Move cputime code to its own file · 73fbec60
      Frederic Weisbecker authored
      
      
      Extract cputime code from the giant sched/core.c and
      put it in its own file. This make it easier to deal with
      this particular area and de-bloat a bit more core.c
      Signed-off-by: default avatarFrederic Weisbecker <fweisbec@gmail.com>
      Acked-by: default avatarMartin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      73fbec60
  19. 13 Aug, 2012 4 commits
  20. 26 Jul, 2012 2 commits