- 12 Jan, 2014 1 commit
-
-
Rik van Riel authored
Thomas Hellstrom bisected a regression where erratic 3D performance is experienced on virtual machines as measured by glxgears. It identified commit 58d081b5 ("sched/numa: Avoid overloading CPUs on a preferred NUMA node") as the problem which had modified the behaviour of effective_load. Effective load calculates the difference to the system-wide load if a scheduling entity was moved to another CPU. The task group is not heavier as a result of the move but overall system load can increase/decrease as a result of the change. Commit 58d081b5 ("sched/numa: Avoid overloading CPUs on a preferred NUMA node") changed effective_load to make it suitable for calculating if a particular NUMA node was compute overloaded. To reduce the cost of the function, it assumed that a current sched entity weight of 0 was uninteresting but that is not the case. wake_affine() uses a weight of 0 for sync wakeups on the grounds that it is assuming the waking task will sleep and not contribute to load in the near future. In this case, we still want to calculate the effective load of the sched entity hierarchy. As effective_load is no longer used by task_numa_compare since commit fb13c7ee (sched/numa: Use a system-wide search to find swap/migration candidates), this patch simply restores the historical behaviour. Reported-and-tested-by:
Thomas Hellstrom <thellstrom@vmware.com> Signed-off-by:
Rik van Riel <riel@redhat.com> [ Wrote changelog] Signed-off-by:
Mel Gorman <mgorman@suse.de> Signed-off-by:
Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20140106113912.GC6178@suse.de Signed-off-by:
Ingo Molnar <mingo@kernel.org>
-
- 19 Dec, 2013 1 commit
-
-
Mel Gorman authored
Inaccessible VMA should not be trapping NUMA hint faults. Skip them. Signed-off-by:
Mel Gorman <mgorman@suse.de> Reviewed-by:
Rik van Riel <riel@redhat.com> Cc: Alex Thorlton <athorlton@sgi.com> Cc: <stable@vger.kernel.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
- 11 Dec, 2013 1 commit
-
-
Peter Zijlstra authored
Christian suffers from a bad BIOS that wrecks his i5's TSC sync. This results in him occasionally seeing time going backwards - which crashes the scheduler ... Most of our time accounting can actually handle that except the most common one; the tick time update of sched_fair. There is a further problem with that code; previously we assumed that because we get a tick every TICK_NSEC our time delta could never exceed 32bits and math was simpler. However, ever since Frederic managed to get NO_HZ_FULL merged; this is no longer the case since now a task can run for a long time indeed without getting a tick. It only takes about ~4.2 seconds to overflow our u32 in nanoseconds. This means we not only need to better deal with time going backwards; but also means we need to be able to deal with large deltas. This patch reworks the entire code and uses mul_u64_u32_shr() as proposed by Andy a long while ago. We express our virtual time scale factor in a u32 multiplier and shift right and the 32bit mul_u64_u32_shr() implementation reduces to a single 32x32->64 multiply if the time delta is still short (common case). For 64bit a 64x64->128 multiply can be used if ARCH_SUPPORTS_INT128. Reported-and-Tested-by:
Christian Engelmayer <cengelma@gmx.at> Signed-off-by:
Peter Zijlstra <peterz@infradead.org> Cc: fweisbec@gmail.com Cc: Paul Turner <pjt@google.com> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20131118172706.GI3866@twins.programming.kicks-ass.net Signed-off-by:
Ingo Molnar <mingo@kernel.org>
-
- 19 Nov, 2013 1 commit
-
-
Srikar Dronamraju authored
After commit 863bffc8 ("sched/fair: Fix group power_orig computation"), we can dereference rq->sd before it is set. Fix this by falling back to power_of() in this case and add a comment explaining things. Signed-off-by:
Srikar Dronamraju <srikar@linux.vnet.ibm.com> [ Added comment and tweaked patch. ] Signed-off-by:
Peter Zijlstra <peterz@infradead.org> Cc: mikey@neuling.org Link: http://lkml.kernel.org/r/20131113151718.GN21461@twins.programming.kicks-ass.net Signed-off-by:
Ingo Molnar <mingo@kernel.org>
-
- 13 Nov, 2013 3 commits
-
-
Michal Nazarewicz authored
sa->runnable_avg_sum is of type u32 but after shifting it by NICE_0_SHIFT bits it is promoted to u64. This of course makes no sense, since the result will never be more then 32-bit long. Casting sa->runnable_avg_sum to u64 before it is shifted, fixes this problem. Reviewed-by:
Ben Segall <bsegall@google.com> Signed-off-by:
Michal Nazarewicz <mina86@mina86.com> Signed-off-by:
Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1384112521-25177-1-git-send-email-mpn@google.com Signed-off-by:
Ingo Molnar <mingo@kernel.org>
-
Peter Zijlstra authored
Because we're completely unserialized against hotplug its well possible to try and generate numa stats for an offlined node. Bail out early (and avoid a /0) in this case. The resulting stats are all 0 which should result in an undesirable balance target -- not to mention that actually trying to migrate to an offline CPU will fail. Reported-by:
Prarit Bhargava <prarit@redhat.com> Signed-off-by:
Peter Zijlstra <peterz@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Link: http://lkml.kernel.org/n/tip-orja0qylcvyhxfsuebcyL5sI@git.kernel.org Signed-off-by:
Ingo Molnar <mingo@kernel.org>
-
Rik van Riel authored
The cpusets code can split up the scheduler's domain tree into smaller domains. Some of those smaller domains may not cross NUMA nodes at all, leading to a NULL pointer dereference on the per-cpu sd_numa pointer. Tasks cannot be migrated out of their domain, so the patch also sets p->numa_preferred_nid to whereever they are, to prevent the migration from being retried over and over again. Reported-by:
Prarit Bhargava <prarit@redhat.com> Signed-off-by:
Rik van Riel <riel@redhat.com> Signed-off-by:
Peter Zijlstra <peterz@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Link: http://lkml.kernel.org/n/tip-oosqomw0Jput0Jkvoowhrqtu@git.kernel.org Signed-off-by:
Ingo Molnar <mingo@kernel.org>
-
- 06 Nov, 2013 2 commits
-
-
Preeti U Murthy authored
nr_busy_cpus parameter is used by nohz_kick_needed() to find out the number of busy cpus in a sched domain which has SD_SHARE_PKG_RESOURCES flag set. Therefore instead of updating nr_busy_cpus at every level of sched domain, since it is irrelevant, we can update this parameter only at the parent domain of the sd which has this flag set. Introduce a per-cpu parameter sd_busy which represents this parent domain. In nohz_kick_needed() we directly query the nr_busy_cpus parameter associated with the groups of sd_busy. By associating sd_busy with the highest domain which has SD_SHARE_PKG_RESOURCES flag set, we cover all lower level domains which could have this flag set and trigger nohz_idle_balancing if any of the levels have more than one busy cpu. sd_busy is irrelevant for asymmetric load balancing. However sd_asym has been introduced to represent the highest sched domain which has SD_ASYM_PACKING flag set so that it can be queried directly when required. While we are at it, we might as well change the nohz_idle parameter to be updated at the sd_busy domain level alone and not the base domain level of a CPU. This will unify the concept of busy cpus at just one level of sched domain where it is currently used. Signed-off-by: Preeti U Murthy<preeti@linux.vnet.ibm.com> Signed-off-by:
Peter Zijlstra <peterz@infradead.org> Cc: svaidy@linux.vnet.ibm.com Cc: vincent.guittot@linaro.org Cc: bitbucket@online.de Cc: benh@kernel.crashing.org Cc: anton@samba.org Cc: Morten.Rasmussen@arm.com Cc: pjt@google.com Cc: peterz@infradead.org Cc: mikey@neuling.org Link: http://lkml.kernel.org/r/20131030031252.23426.4417.stgit@preeti.in.ibm.com Signed-off-by:
Ingo Molnar <mingo@kernel.org>
-
Vaidyanathan Srinivasan authored
Asymmetric scheduling within a core is a scheduler loadbalancing feature that is triggered when SD_ASYM_PACKING flag is set. The goal for the load balancer is to move tasks to lower order idle SMT threads within a core on a POWER7 system. In nohz_kick_needed(), we intend to check if our sched domain (core) is completely busy or we have idle cpu. The following check for SD_ASYM_PACKING: (cpumask_first_and(nohz.idle_cpus_mask, sched_domain_span(sd)) < cpu) already covers the case of checking if the domain has an idle cpu, because cpumask_first_and() will not yield any set bits if this domain has no idle cpu. Hence, nr_busy check against group weight can be removed. Reported-by:
Michael Neuling <michael.neuling@au1.ibm.com> Signed-off-by:
Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Signed-off-by:
Preeti U Murthy <preeti@linux.vnet.ibm.com> Tested-by:
Michael Neuling <mikey@neuling.org> Signed-off-by:
Peter Zijlstra <peterz@infradead.org> Cc: vincent.guittot@linaro.org Cc: bitbucket@online.de Cc: benh@kernel.crashing.org Cc: anton@samba.org Cc: Morten.Rasmussen@arm.com Cc: pjt@google.com Link: http://lkml.kernel.org/r/20131030031242.23426.13019.stgit@preeti.in.ibm.com Signed-off-by:
Ingo Molnar <mingo@kernel.org>
-
- 29 Oct, 2013 5 commits
-
-
Ben Segall authored
throttle_cfs_rq() doesn't check to make sure that period_timer is running, and while update_curr/assign_cfs_runtime does, a concurrently running period_timer on another cpu could cancel itself between this cpu's update_curr and throttle_cfs_rq(). If there are no other cfs_rqs running in the tg to restart the timer, this causes the cfs_rq to be stranded forever. Fix this by calling __start_cfs_bandwidth() in throttle if the timer is inactive. (Also add some sched_debug lines for cfs_bandwidth.) Tested: make a run/sleep task in a cgroup, loop switching the cgroup between 1ms/100ms quota and unlimited, checking for timer_active=0 and throttled=1 as a failure. With the throttle_cfs_rq() change commented out this fails, with the full patch it passes. Signed-off-by:
Ben Segall <bsegall@google.com> Signed-off-by:
Peter Zijlstra <peterz@infradead.org> Cc: pjt@google.com Link: http://lkml.kernel.org/r/20131016181632.22647.84174.stgit@sword-of-the-dawn.mtv.corp.google.com Signed-off-by:
Ingo Molnar <mingo@kernel.org>
-
Paul Turner authored
Currently, group entity load-weights are initialized to zero. This admits some races with respect to the first time they are re-weighted in earlty use. ( Let g[x] denote the se for "g" on cpu "x". ) Suppose that we have root->a and that a enters a throttled state, immediately followed by a[0]->t1 (the only task running on cpu[0]) blocking: put_prev_task(group_cfs_rq(a[0]), t1) put_prev_entity(..., t1) check_cfs_rq_runtime(group_cfs_rq(a[0])) throttle_cfs_rq(group_cfs_rq(a[0])) Then, before unthrottling occurs, let a[0]->b[0]->t2 wake for the first time: enqueue_task_fair(rq[0], t2) enqueue_entity(group_cfs_rq(b[0]), t2) enqueue_entity_load_avg(group_cfs_rq(b[0]), t2) account_entity_enqueue(group_cfs_ra(b[0]), t2) update_cfs_shares(group_cfs_rq(b[0])) < skipped because b is part of a throttled hierarchy > enqueue_entity(group_cfs_rq(a[0]), b[0]) ... We now have b[0] enqueued, yet group_cfs_rq(a[0])->load.weight == 0 which violates invariants in several code-paths. Eliminate the possibility of this by initializing group entity weight. Signed-off-by:
Paul Turner <pjt@google.com> Signed-off-by:
Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20131016181627.22647.47543.stgit@sword-of-the-dawn.mtv.corp.google.com Signed-off-by:
Ingo Molnar <mingo@kernel.org>
-
Ben Segall authored
__start_cfs_bandwidth calls hrtimer_cancel while holding rq->lock, waiting for the hrtimer to finish. However, if sched_cfs_period_timer runs for another loop iteration, the hrtimer can attempt to take rq->lock, resulting in deadlock. Fix this by ensuring that cfs_b->timer_active is cleared only if the _latest_ call to do_sched_cfs_period_timer is returning as idle. Then __start_cfs_bandwidth can just call hrtimer_try_to_cancel and wait for that to succeed or timer_active == 1. Signed-off-by:
Ben Segall <bsegall@google.com> Signed-off-by:
Peter Zijlstra <peterz@infradead.org> Cc: pjt@google.com Link: http://lkml.kernel.org/r/20131016181622.22647.16643.stgit@sword-of-the-dawn.mtv.corp.google.com Signed-off-by:
Ingo Molnar <mingo@kernel.org>
-
Ben Segall authored
hrtimer_expires_remaining does not take internal hrtimer locks and thus must be guarded against concurrent __hrtimer_start_range_ns (but returning HRTIMER_RESTART is safe). Use cfs_b->lock to make it safe. Signed-off-by:
Ben Segall <bsegall@google.com> Signed-off-by:
Peter Zijlstra <peterz@infradead.org> Cc: pjt@google.com Link: http://lkml.kernel.org/r/20131016181617.22647.73829.stgit@sword-of-the-dawn.mtv.corp.google.com Signed-off-by:
Ingo Molnar <mingo@kernel.org>
-
Ben Segall authored
When we transition cfs_bandwidth_used to false, any currently throttled groups will incorrectly return false from cfs_rq_throttled. While tg_set_cfs_bandwidth will unthrottle them eventually, currently running code (including at least dequeue_task_fair and distribute_cfs_runtime) will cause errors. Fix this by turning off cfs_bandwidth_used only after unthrottling all cfs_rqs. Tested: toggle bandwidth back and forth on a loaded cgroup. Caused crashes in minutes without the patch, hasn't crashed with it. Signed-off-by:
Ben Segall <bsegall@google.com> Signed-off-by:
Peter Zijlstra <peterz@infradead.org> Cc: pjt@google.com Link: http://lkml.kernel.org/r/20131016181611.22647.80365.stgit@sword-of-the-dawn.mtv.corp.google.com Signed-off-by:
Ingo Molnar <mingo@kernel.org>
-
- 16 Oct, 2013 1 commit
-
-
Peter Zijlstra authored
There is a subtle race in migrate_swap, when task P, on CPU A, decides to swap places with task T, on CPU B. Task P: - call migrate_swap Task T: - go to sleep, removing itself from the runqueue Task P: - double lock the runqueues on CPU A & B Task T: - get woken up, place itself on the runqueue of CPU C Task P: - see that task T is on a runqueue, and pretend to remove it from the runqueue on CPU B Now CPUs B & C both have corrupted scheduler data structures. This patch fixes it, by holding the pi_lock for both of the tasks involved in the migrate swap. This prevents task T from waking up, and placing itself onto another runqueue, until after migrate_swap has released all locks. This means that, when migrate_swap checks, task T will be either on the runqueue where it was originally seen, or not on any runqueue at all. Migrate_swap deals correctly with of those cases. Tested-by:
Joe Mario <jmario@redhat.com> Acked-by:
Mel Gorman <mgorman@suse.de> Reviewed-by:
Rik van Riel <riel@redhat.com> Signed-off-by:
Peter Zijlstra <peterz@infradead.org> Cc: hannes@cmpxchg.org Cc: aarcange@redhat.com Cc: srikar@linux.vnet.ibm.com Cc: tglx@linutronix.de Cc: hpa@zytor.com Link: http://lkml.kernel.org/r/20131010181722.GO13848@laptop.programming.kicks-ass.net Signed-off-by:
Ingo Molnar <mingo@kernel.org>
-
- 14 Oct, 2013 1 commit
-
-
Kamalesh Babulal authored
- 'load_icx' => 'load_idx' - 'calculcate_imbalance' => 'calculate_imbalance' Signed-off-by:
Kamalesh Babulal <kamalesh@linux.vnet.ibm.com> Cc: peterz@infradead.org Link: http://lkml.kernel.org/r/1381685775-3544-1-git-send-email-kamalesh@linux.vnet.ibm.com [ Also, don't capitalize 'idle' unnecessarily. ] Signed-off-by:
Ingo Molnar <mingo@kernel.org>
-
- 12 Oct, 2013 1 commit
-
-
Ramkumar Ramachandra authored
The balance parameter was removed by 23f0d209 ("sched: Factor out code to should_we_balance()", 2013-08-06). Signed-off-by:
Ramkumar Ramachandra <artagnon@gmail.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381400433-2030-1-git-send-email-artagnon@gmail.com Signed-off-by:
Ingo Molnar <mingo@kernel.org>
-
- 09 Oct, 2013 23 commits
-
-
Peter Zijlstra authored
Reflow the function a bit because GCC gets confused: kernel/sched/fair.c: In function ‘task_numa_fault’: kernel/sched/fair.c:1448:3: warning: ‘my_grp’ may be used uninitialized in this function [-Wmaybe-uninitialized] kernel/sched/fair.c:1463:27: note: ‘my_grp’ was declared here Signed-off-by:
Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/n/tip-6ebt6x7u64pbbonq1khqu2z9@git.kernel.org Signed-off-by:
Ingo Molnar <mingo@kernel.org>
-
Rik van Riel authored
Short spikes of CPU load can lead to a task being migrated away from its preferred node for temporary reasons. It is important that the task is migrated back to where it belongs, in order to avoid migrating too much memory to its new location, and generally disturbing a task's NUMA location. This patch fixes NUMA placement for 4 specjbb instances on a 4 node system. Without this patch, things take longer to converge, and processes are not always completely on their own node. Signed-off-by:
Rik van Riel <riel@redhat.com> Signed-off-by:
Mel Gorman <mgorman@suse.de> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by:
Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-64-git-send-email-mgorman@suse.de Signed-off-by:
Ingo Molnar <mingo@kernel.org>
-
Mel Gorman authored
As Peter says "If you're going to hold locks you can also do away with all that atomic_long_*() nonsense". Lock aquisition moved slightly to protect the updates. Signed-off-by:
Mel Gorman <mgorman@suse.de> Reviewed-by:
Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by:
Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-63-git-send-email-mgorman@suse.de Signed-off-by:
Ingo Molnar <mingo@kernel.org>
-
Rik van Riel authored
Shared faults can lead to lots of unnecessary page migrations, slowing down the system, and causing private faults to hit the per-pgdat migration ratelimit. This patch adds sysctl numa_balancing_migrate_deferred, which specifies how many shared page migrations to skip unconditionally, after each page migration that is skipped because it is a shared fault. This reduces the number of page migrations back and forth in shared fault situations. It also gives a strong preference to the tasks that are already running where most of the memory is, and to moving the other tasks to near the memory. Testing this with a much higher scan rate than the default still seems to result in fewer page migrations than before. Memory seems to be somewhat better consolidated than previously, with multi-instance specjbb runs on a 4 node system. Signed-off-by:
Rik van Riel <riel@redhat.com> Signed-off-by:
Mel Gorman <mgorman@suse.de> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by:
Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-62-git-send-email-mgorman@suse.de Signed-off-by:
Ingo Molnar <mingo@kernel.org>
-
Rik van Riel authored
With the scan rate code working (at least for multi-instance specjbb), the large hammer that is "sched: Do not migrate memory immediately after switching node" can be replaced with something smarter. Revert temporarily migration disabling and all traces of numa_migrate_seq. Signed-off-by:
Rik van Riel <riel@redhat.com> Signed-off-by:
Mel Gorman <mgorman@suse.de> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by:
Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-61-git-send-email-mgorman@suse.de Signed-off-by:
Ingo Molnar <mingo@kernel.org>
-
Mel Gorman authored
With scan rate adaptions based on whether the workload has properly converged or not there should be no need for the scan period reset hammer. Get rid of it. Signed-off-by:
Mel Gorman <mgorman@suse.de> Reviewed-by:
Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by:
Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-60-git-send-email-mgorman@suse.de Signed-off-by:
Ingo Molnar <mingo@kernel.org>
-
Rik van Riel authored
Adjust numa_scan_period in task_numa_placement, depending on how much useful work the numa code can do. The more local faults there are in a given scan window the longer the period (and hence the slower the scan rate) during the next window. If there are excessive shared faults then the scan period will decrease with the amount of scaling depending on whether the ratio of shared/private faults. If the preferred node changes then the scan rate is reset to recheck if the task is properly placed. Signed-off-by:
Rik van Riel <riel@redhat.com> Signed-off-by:
Mel Gorman <mgorman@suse.de> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by:
Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-59-git-send-email-mgorman@suse.de Signed-off-by:
Ingo Molnar <mingo@kernel.org>
-
Mel Gorman authored
Scan rate is altered based on whether shared/private faults dominated. task_numa_group() may detect false sharing but that information is not taken into account when adapting the scan rate. Take it into account. Signed-off-by:
Mel Gorman <mgorman@suse.de> Reviewed-by:
Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by:
Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-58-git-send-email-mgorman@suse.de Signed-off-by:
Ingo Molnar <mingo@kernel.org>
-
Rik van Riel authored
Due to the way the pid is truncated, and tasks are moved between CPUs by the scheduler, it is possible for the current task_numa_fault to group together tasks that do not actually share memory together. This patch adds a few easy sanity checks to task_numa_fault, joining tasks together if they share the same tsk->mm, or if the fault was on a page with an elevated mapcount, in a shared VMA. Signed-off-by:
Rik van Riel <riel@redhat.com> Signed-off-by:
Mel Gorman <mgorman@suse.de> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by:
Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-57-git-send-email-mgorman@suse.de Signed-off-by:
Ingo Molnar <mingo@kernel.org>
-
Peter Zijlstra authored
This patch classifies scheduler domains and runqueues into types depending the number of tasks that are about their NUMA placement and the number that are currently running on their preferred node. The types are regular: There are tasks running that do not care about their NUMA placement. remote: There are tasks running that care about their placement but are currently running on a node remote to their ideal placement all: No distinction To implement this the patch tracks the number of tasks that are optimally NUMA placed (rq->nr_preferred_running) and the number of tasks running that care about their placement (nr_numa_running). The load balancer uses this information to avoid migrating idea placed NUMA tasks as long as better options for load balancing exists. For example, it will not consider balancing between a group whose tasks are all perfectly placed and a group with remote tasks. Signed-off-by:
Peter Zijlstra <peterz@infradead.org> Signed-off-by:
Mel Gorman <mgorman@suse.de> Reviewed-by:
Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Link: http://lkml.kernel.org/r/1381141781-10992-56-git-send-email-mgorman@suse.de Signed-off-by:
Ingo Molnar <mingo@kernel.org>
-
Rik van Riel authored
This patch separately considers task and group affinities when searching for swap candidates during NUMA placement. If tasks are part of the same group, or no group at all, the task weights are considered. Some hysteresis is added to prevent tasks within one group from getting bounced between NUMA nodes due to tiny differences. If tasks are part of different groups, the code compares group weights, in order to favor grouping task groups together. The patch also changes the group weight multiplier to be the same as the task weight multiplier, since the two are no longer added up like before. Signed-off-by:
Rik van Riel <riel@redhat.com> Signed-off-by:
Mel Gorman <mgorman@suse.de> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by:
Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-55-git-send-email-mgorman@suse.de Signed-off-by:
Ingo Molnar <mingo@kernel.org>
-
Rik van Riel authored
This patch separately considers task and group affinities when searching for swap candidates during task NUMA placement. If tasks are not part of a group or the same group then the task weights are considered. Otherwise the group weights are compared. Signed-off-by:
Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by:
Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-54-git-send-email-mgorman@suse.de Signed-off-by:
Ingo Molnar <mingo@kernel.org>
-
Ingo Molnar authored
Signed-off-by:
Ingo Molnar <mingo@kernel.org> Reviewed-by:
Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by:
Peter Zijlstra <peterz@infradead.org> Signed-off-by:
Ingo Molnar <mingo@kernel.org> Link: http://lkml.kernel.org/r/1381141781-10992-53-git-send-email-mgorman@suse.de
-
Mel Gorman authored
Having multiple tasks in a group go through task_numa_placement simultaneously can lead to a task picking a wrong node to run on, because the group stats may be in the middle of an update. This patch avoids parallel updates by holding the numa_group lock during placement decisions. Signed-off-by:
Mel Gorman <mgorman@suse.de> Reviewed-by:
Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by:
Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-52-git-send-email-mgorman@suse.de Signed-off-by:
Ingo Molnar <mingo@kernel.org>
-
Rik van Riel authored
It is possible for a task in a numa group to call exec, and have the new (unrelated) executable inherit the numa group association from its former self. This has the potential to break numa grouping, and is trivial to fix. Signed-off-by:
Rik van Riel <riel@redhat.com> Signed-off-by:
Mel Gorman <mgorman@suse.de> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by:
Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-51-git-send-email-mgorman@suse.de Signed-off-by:
Ingo Molnar <mingo@kernel.org>
-
Mel Gorman authored
This patch uses the fraction of faults on a particular node for both task and group, to figure out the best node to place a task. If the task and group statistics disagree on what the preferred node should be then a full rescan will select the node with the best combined weight. Signed-off-by:
Mel Gorman <mgorman@suse.de> Reviewed-by:
Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by:
Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-50-git-send-email-mgorman@suse.de Signed-off-by:
Ingo Molnar <mingo@kernel.org>
-
Peter Zijlstra authored
And here's a little something to make sure not the whole world ends up in a single group. As while we don't migrate shared executable pages, we do scan/fault on them. And since everybody links to libc, everybody ends up in the same group. Suggested-by:
Rik van Riel <riel@redhat.com> Signed-off-by:
Peter Zijlstra <peterz@infradead.org> Signed-off-by:
Mel Gorman <mgorman@suse.de> Reviewed-by:
Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Link: http://lkml.kernel.org/r/1381141781-10992-47-git-send-email-mgorman@suse.de Signed-off-by:
Ingo Molnar <mingo@kernel.org>
-
Mel Gorman authored
It is desirable to model from userspace how the scheduler groups tasks over time. This patch adds an ID to the numa_group and reports it via /proc/PID/status. Signed-off-by:
Mel Gorman <mgorman@suse.de> Reviewed-by:
Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by:
Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-45-git-send-email-mgorman@suse.de Signed-off-by:
Ingo Molnar <mingo@kernel.org>
-
Peter Zijlstra authored
While parallel applications tend to align their data on the cache boundary, they tend not to align on the page or THP boundary. Consequently tasks that partition their data can still "false-share" pages presenting a problem for optimal NUMA placement. This patch uses NUMA hinting faults to chain tasks together into numa_groups. As well as storing the NID a task was running on when accessing a page a truncated representation of the faulting PID is stored. If subsequent faults are from different PIDs it is reasonable to assume that those two tasks share a page and are candidates for being grouped together. Note that this patch makes no scheduling decisions based on the grouping information. Signed-off-by:
Peter Zijlstra <peterz@infradead.org> Signed-off-by:
Mel Gorman <mgorman@suse.de> Reviewed-by:
Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Link: http://lkml.kernel.org/r/1381141781-10992-44-git-send-email-mgorman@suse.de Signed-off-by:
Ingo Molnar <mingo@kernel.org>
-
Peter Zijlstra authored
Change the per page last fault tracking to use cpu,pid instead of nid,pid. This will allow us to try and lookup the alternate task more easily. Note that even though it is the cpu that is store in the page flags that the mpol_misplaced decision is still based on the node. Signed-off-by:
Peter Zijlstra <peterz@infradead.org> Signed-off-by:
Mel Gorman <mgorman@suse.de> Reviewed-by:
Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Link: http://lkml.kernel.org/r/1381141781-10992-43-git-send-email-mgorman@suse.de [ Fixed build failure on 32-bit systems. ] Signed-off-by:
Ingo Molnar <mingo@kernel.org>
-
Rik van Riel authored
The load balancer will spread workloads across multiple NUMA nodes, in order to balance the load on the system. This means that sometimes a task's preferred node has available capacity, but moving the task there will not succeed, because that would create too large an imbalance. In that case, other NUMA nodes need to be considered. Signed-off-by:
Rik van Riel <riel@redhat.com> Signed-off-by:
Mel Gorman <mgorman@suse.de> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by:
Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-42-git-send-email-mgorman@suse.de Signed-off-by:
Ingo Molnar <mingo@kernel.org>
-
Mel Gorman authored
A tasks preferred node is selected based on the number of faults recorded for a node but the actual task_numa_migate() conducts a global search regardless of the preferred nid. This patch checks if the preferred nid has capacity and if so, searches for a CPU within that node. This avoids a global search when the preferred node is not overloaded. Signed-off-by:
Mel Gorman <mgorman@suse.de> Reviewed-by:
Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by:
Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-41-git-send-email-mgorman@suse.de Signed-off-by:
Ingo Molnar <mingo@kernel.org>
-
Mel Gorman authored
This patch implements a system-wide search for swap/migration candidates based on total NUMA hinting faults. It has a balance limit, however it doesn't properly consider total node balance. In the old scheme a task selected a preferred node based on the highest number of private faults recorded on the node. In this scheme, the preferred node is based on the total number of faults. If the preferred node for a task changes then task_numa_migrate will search the whole system looking for tasks to swap with that would improve both the overall compute balance and minimise the expected number of remote NUMA hinting faults. Not there is no guarantee that the node the source task is placed on by task_numa_migrate() has any relationship to the newly selected task->numa_preferred_nid due to compute overloading. Signed-off-by:
Mel Gorman <mgorman@suse.de> [ Do not swap with tasks that cannot run on source cpu] Reviewed-by:
Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> [ Fixed compiler warning on UP. ] Signed-off-by:
Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-40-git-send-email-mgorman@suse.de Signed-off-by:
Ingo Molnar <mingo@kernel.org>
-