- 03 Apr, 2019 40 commits
-
-
Philippe Gerum authored
This is the basic code enabling alternate control of tasks between the regular kernel and an embedded co-kernel. The changes cover the following aspects: - extend the per-thread information block with a private area usable by the co-kernel for storing additional state information - provide the API enabling a scheduler exchange mechanism, so that tasks can run under the control of either kernel alternatively. This includes a service to move the current task to the head domain under the control of the co-kernel, and the converse service to re-enter the root domain once the co-kernel has released such task. - ensure the generic context switching code can be used from any domain, serializing execution as required. These changes have to be paired with arch-specific code further enabling context switching from the head domain.
-
Philippe Gerum authored
Announce all clock event chips as they are registered to the out-of-band tick device infrastructure, so that we can interpose on key handlers in their descriptors.
-
Philippe Gerum authored
Add the core API for enabling (regular) kernel event notifications to a co-kernel running over the head domain. For instance, such a co-kernel may need to know when a task is about to be resumed upon signal receipt, or when it gets an access fault trap. This commit adds the client-side API for enabling such notification for class of events, but does not provide the notification points per se, which comes later.
-
Philippe Gerum authored
A raw output handler (.write_raw) is added to the console descriptor for writing (short) text output unmodified, without any logging, header or preparation whatsoever, usable from any pipeline domain. The dedicated raw_printk() variant formats the output message then passes it on to the handler holding a hard spinlock, irqs off. This is a very basic debug channel for situations when resorting to the fairly complex printk() handling is not an option. Unlike early consoles, regular consoles can provide a raw output service past the boot sequence. Raw output handlers are typically provided by serial console devices.
-
Philippe Gerum authored
When dumping a stack backtrace, we neither need nor want to disable root stage IRQs over the head stage, where CPU migration can't happen. Conversely, we neither need nor want to disable hard IRQs from the head stage, so that latency won't skyrocket either.
-
Philippe Gerum authored
Just like printk(), dev_printk() cannot run from the head domain and/or with hard IRQs disabled. In such a case, log the output directly into the staging buffer we use for printk(). NOTE: when redirected to the buffer, the output does not include the dev_printk() header text but is merely sent as-is to the log.
-
Philippe Gerum authored
The printk() machinery cannot immediately invoke the console driver(s) when called from the head domain, since such driver code belongs to the root domain and cannot be shared between domains. Output issued from the head domain is formatted then logged into a staging buffer, and a dedicated virtual IRQ is posted to the root domain for notification. When the virtual IRQ handler runs, the contents of the staging buffer is flushed to the printk() interface anew, which may eventually pass the output on to the console drivers from such a context.
-
Philippe Gerum authored
We must not allow out-of-band activity to resume while we are busy suspending the devices in the system, until the PM sleep state has been fully entered. Pair existing virtual IRQ disabling calls which only apply to the root domain with hard ones.
-
Philippe Gerum authored
We might have out-of-band code calling try_module_get() from the head domain, or from a section covered by a hard spinlock where the root domain must not reschedule. This requires the preemption management calls in try_module_get() (and the converse module_put()) to be converted to their hard variant. REVISIT: using try_module_get() from such contexts is questionable, client domains should be fixed.
-
Philippe Gerum authored
Make the KGDB stub runnable over the head domain since we may take traps and interrupts from that context too, by converting the locks to hard spinlocks.
-
Philippe Gerum authored
Context tracking is a prerequisite for FULL_NOHZ, so that the RCU subsystem can detect CPU idleness without relying on the (regular) timer tick. Out-of-band activity running over the head domain should by definition not be involved in such detection logic, as the root domain has no knowledge of what happens - and when - on the head domain whatsoever.
-
Philippe Gerum authored
There can be no CPU migration from the head stage, however the out-of-band code currently running smp_processor_id() might have preempted the regular kernel code from within a preemptible section, which might cause false positive in the end. These are the two reasons why we certainly neither need nor want to do the preemption check in that case.
-
Philippe Gerum authored
Because of the virtualization of interrupt masking for the regular kernel code when the pipeline is enabled, atomic helpers relying on common interrupt disabling helpers such as local_irq_save/restore pairs would not be atomic anymore, leading to data corruption. This commit restores true atomicity for the atomic helpers that would be otherwise affected by interrupt virtualization.
-
Philippe Gerum authored
Some inner code of the interrupt pipeline may have to traverse regular kernel code which manipulates the preemption count, expecting full serialization including with out-of-band contexts. The hard_preempt_*() variants are substituted to the original preempt_{enable, disable}() calls in these cases.
-
Philippe Gerum authored
As described in Documentation/ipipe.rst, irq_chip drivers need to be specifically adapted for dealing with interrupt pipelining safely. The basic issue to address is proper serialization between some irq_chip handlers which may be called from out-of-band context immediately upon receipt of an IRQ, and the rest of the driver which may access the same data / IO registers from the regular - in-band - context on the same CPU. This commit converts the generic irq_chip lock to a hard spinlock, which ensures such serialization.
-
Philippe Gerum authored
The latency tracer is a variant of ftrace's 'function' tracer providing detailed information about the current interrupt state at each function entry (i.e. virtual interrupt flag and CPU interrupt disable bit). This commit introduces the generic tracer code, which builds upon the regular ftrace API. The arch-specific code should provide for ipipe_read_tsc(), a helper routine returning a 64bit monotonic time value for timestamping purpose. HAVE_IPIPE_TRACER_SUPPORT should be selected by the arch-specific code for enabling the tracer, which in turn makes CONFIG_IPIPE_TRACE available from the Kconfig interface.
-
Philippe Gerum authored
The out-of-band tick device manages the timer hardware by interposing on selected clockevent handlers transparently, so that a client domain (e.g. a co-kernel) eventually controls such hardware for scheduling the high-precision timer events it needs to. Those events are delivered to out-of-hand activities running on the head stage, unimpeded by (only virtually) interrupt-free sections of the regular kernel code. This commit introduces the generic API for controlling the out-of-band tick device from a co-kernel. It also provides for the internal API clock event chip drivers should use for enabling high-precision timing for their hardware.
-
Philippe Gerum authored
Hard spinlocks manipulate the CPU interrupt mask, without affecting the kernel preemption state in locking/unlocking operations. This type of spinlock is useful for implementing a critical section to serialize concurrent accesses from both in-band and out-of-band contexts, i.e. from root and head stages. Hard spinlocks exclusively depend on the pre-existing arch-specific bits which implement regular spinlocks. They can be seen as basic spinlocks still affecting the CPU's interrupt state when all other spinlock types only deal with the virtual interrupt flag managed by the pipeline core - i.e. only disable interrupts for the regular in-band kernel activity.
-
Philippe Gerum authored
This commit provides the arch-independent bits for implementing the interrupt pipeline core, a lightweight layer introducing a separate, high-priority execution stage for handling all IRQs in pseudo-NMI mode, which cannot be delayed by the regular kernel code. See Documentation/ipipe.rst for details about interrupt pipelining. Architectures which support interrupt pipelining should select HAVE_IPIPE_SUPPORT, along with implementing the required arch-specific code. In such a case, CONFIG_IPIPE becomes available to the user via the Kconfig interface for enabling the feature.
-
Greg Kroah-Hartman authored
-
Heikki Krogerus authored
commit 148b0aa7 upstream. USB Type-C class driver now expects the muxes to be always assigned to the ports and not controllers, so the connections for the mux and fusb302 can be removed. Acked-by:
Andy Shevchenko <andy.shevchenko@gmail.com> Acked-by:
Hans de Goede <hdegoede@redhat.com> Tested-by:
Hans de Goede <hdegoede@redhat.com> Signed-off-by:
Heikki Krogerus <heikki.krogerus@linux.intel.com> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Heikki Krogerus authored
commit 23481121 upstream. It is not possible to use the parent of the port device when requesting mux handles as the parent may be a multiport USB Type-C or PD controller. The muxes must be assigned to the ports, not the controllers. This will also move the requesting of the muxes after the port device is initialized. Acked-by:
Hans de Goede <hdegoede@redhat.com> Tested-by:
Hans de Goede <hdegoede@redhat.com> Signed-off-by:
Heikki Krogerus <heikki.krogerus@linux.intel.com> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Heikki Krogerus authored
commit 495965a1 upstream. Assigning the mux to the USB Type-C port on top of fusb302. That will prepare this driver for the change in the USB Type-C class code, where the class driver will assume the muxes to be always assigned to the ports and not the controllers. Once the USB Type-C class driver has been updated, the connections between the mux and fusb302 can be dropped. Acked-by:
Andy Shevchenko <andy.shevchenko@gmail.com> Acked-by:
Hans de Goede <hdegoede@redhat.com> Tested-by:
Hans de Goede <hdegoede@redhat.com> Signed-off-by:
Heikki Krogerus <heikki.krogerus@linux.intel.com> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Heikki Krogerus authored
commit 78d2b54b upstream. Adding a connection for the DisplayPort alternate mode. PI3USB30532 is used for muxing the port to DisplayPort on CHT platforms. The connection allows the alternate mode device to get handle to the mux, and therefore make it possible to use the USB Type-C connector as DisplayPort. Acked-by:
Andy Shevchenko <andy.shevchenko@gmail.com> Acked-by:
Hans de Goede <hdegoede@redhat.com> Tested-by:
Hans de Goede <hdegoede@redhat.com> Signed-off-by:
Heikki Krogerus <heikki.krogerus@linux.intel.com> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Heikki Krogerus authored
commit 140a4ec4 upstream. We can register all device connection descriptors with a single call to device_connections_add(). Acked-by:
Andy Shevchenko <andy.shevchenko@gmail.com> Acked-by:
Hans de Goede <hdegoede@redhat.com> Tested-by:
Hans de Goede <hdegoede@redhat.com> Signed-off-by:
Heikki Krogerus <heikki.krogerus@linux.intel.com> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Heikki Krogerus authored
commit cd7753d3 upstream. Introducing helpers for adding and removing multiple device connection descriptions at once. Acked-by:
Hans de Goede <hdegoede@redhat.com> Tested-by:
Hans de Goede <hdegoede@redhat.com> Signed-off-by:
Heikki Krogerus <heikki.krogerus@linux.intel.com> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Xu Yu authored
commit 0803278b upstream. Syzkaller hit 'KASAN: use-after-free Write in sanitize_ptr_alu' bug. Call trace: dump_stack+0xbf/0x12e print_address_description+0x6a/0x280 kasan_report+0x237/0x360 sanitize_ptr_alu+0x85a/0x8d0 adjust_ptr_min_max_vals+0x8f2/0x1ca0 adjust_reg_min_max_vals+0x8ed/0x22e0 do_check+0x1ca6/0x5d00 bpf_check+0x9ca/0x2570 bpf_prog_load+0xc91/0x1030 __se_sys_bpf+0x61e/0x1f00 do_syscall_64+0xc8/0x550 entry_SYSCALL_64_after_hwframe+0x49/0xbe Fault injection trace: kfree+0xea/0x290 free_func_state+0x4a/0x60 free_verifier_state+0x61/0xe0 push_stack+0x216/0x2f0 <- inject failslab sanitize_ptr_alu+0x2b1/0x8d0 adjust_ptr_min_max_vals+0x8f2/0x1ca0 adjust_reg_min_max_vals+0x8ed/0x22e0 do_check+0x1ca6/0x5d00 bpf_check+0x9ca/0x2570 bpf_prog_load+0xc91/0x1030 __se_sys_bpf+0x61e/0x1f00 do_syscall_64+0xc8/0x550 entry_SYSCALL_64_after_hwframe+0x49/0xbe When kzalloc() fails in push_stack(), free_verifier_state() will free current verifier state. As push_stack() returns, dst_reg was restored if ptr_is_dst_reg is false. However, as member of the cur_state, dst_reg is also freed, and error occurs when dereferencing dst_reg. Simply fix it by testing ret of push_stack() before restoring dst_reg. Fixes: 979d63d5 ("bpf: prevent out of bounds speculation on pointer arithmetic") Signed-off-by:
Xu Yu <xuyu@linux.alibaba.com> Signed-off-by:
Daniel Borkmann <daniel@iogearbox.net> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Gao Xiang authored
commit 33bac912 upstream. After commit 419d6efc, kernel cannot be crashed in the namei path. However, corrupted nameoff can do harm in the process of readdir for scenerios without dm-verity as well. Fix it now. Fixes: 3aa8ec71 ("staging: erofs: add directory operations") Cc: <stable@vger.kernel.org> # 4.19+ Signed-off-by:
Gao Xiang <gaoxiang25@huawei.com> Reviewed-by:
Chao Yu <yuchao0@huawei.com> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Gao Xiang authored
commit b6391ac7 upstream. Complete read error handling paths for all three kinds of compressed pages: 1) For cache-managed pages, PG_uptodate will be checked since read_endio will unlock and SetPageUptodate for these pages; 2) For inplaced pages, read_endio cannot SetPageUptodate directly since it should be used to mark the final decompressed data, PG_error will be set with page locked for IO error instead; 3) For staging pages, PG_error is used, which is similar to what we do for inplaced pages. Fixes: 3883a79a ("staging: erofs: introduce VLE decompression support") Cc: <stable@vger.kernel.org> # 4.19+ Reviewed-by:
Chao Yu <yuchao0@huawei.com> Signed-off-by:
Gao Xiang <gaoxiang25@huawei.com> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Sean Christopherson authored
commit 0cf9135b upstream. The CPUID flag ARCH_CAPABILITIES is unconditioinally exposed to host userspace for all x86 hosts, i.e. KVM advertises ARCH_CAPABILITIES regardless of hardware support under the pretense that KVM fully emulates MSR_IA32_ARCH_CAPABILITIES. Unfortunately, only VMX hosts handle accesses to MSR_IA32_ARCH_CAPABILITIES (despite KVM_GET_MSRS also reporting MSR_IA32_ARCH_CAPABILITIES for all hosts). Move the MSR_IA32_ARCH_CAPABILITIES handling to common x86 code so that it's emulated on AMD hosts. Fixes: 1eaafe91 ("kvm: x86: IA32_ARCH_CAPABILITIES is always supported") Cc: stable@vger.kernel.org Reported-by:
Xiaoyao Li <xiaoyao.li@linux.intel.com> Cc: Jim Mattson <jmattson@google.com> Signed-off-by:
Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Sean Christopherson authored
commit 45def77e upstream. Most (all?) x86 platforms provide a port IO based reset mechanism, e.g. OUT 92h or CF9h. Userspace may emulate said mechanism, i.e. reset a vCPU in response to KVM_EXIT_IO, without explicitly announcing to KVM that it is doing a reset, e.g. Qemu jams vCPU state and resumes running. To avoid corruping %rip after such a reset, commit 0967b7bf ("KVM: Skip pio instruction when it is emulated, not executed") changed the behavior of PIO handlers, i.e. today's "fast" PIO handling to skip the instruction prior to exiting to userspace. Full emulation doesn't need such tricks becase re-emulating the instruction will naturally handle %rip being changed to point at the reset vector. Updating %rip prior to executing to userspace has several drawbacks: - Userspace sees the wrong %rip on the exit, e.g. if PIO emulation fails it will likely yell about the wrong address. - Single step exits to userspace for are effectively dropped as KVM_EXIT_DEBUG is overwritten with KVM_EXIT_IO. - Behavior of PIO emulation is different depending on whether it goes down the fast path or the slow path. Rather than skip the PIO instruction before exiting to userspace, snapshot the linear %rip and cancel PIO completion if the current value does not match the snapshot. For a 64-bit vCPU, i.e. the most common scenario, the snapshot and comparison has negligible overhead as VMCS.GUEST_RIP will be cached regardless, i.e. there is no extra VMREAD in this case. All other alternatives to snapshotting the linear %rip that don't rely on an explicit reset announcenment suffer from one corner case or another. For example, canceling PIO completion on any write to %rip fails if userspace does a save/restore of %rip, and attempting to avoid that issue by canceling PIO only if %rip changed then fails if PIO collides with the reset %rip. Attempting to zero in on the exact reset vector won't work for APs, which means adding more hooks such as the vCPU's MP_STATE, and so on and so forth. Checking for a linear %rip match technically suffers from corner cases, e.g. userspace could theoretically rewrite the underlying code page and expect a different instruction to execute, or the guest hardcodes a PIO reset at 0xfffffff0, but those are far, far outside of what can be considered normal operation. Fixes: 432baf60 ("KVM: VMX: use kvm_fast_pio_in for handling IN I/O") Cc: <stable@vger.kernel.org> Reported-by:
Jim Mattson <jmattson@google.com> Signed-off-by:
Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Sean Christopherson authored
commit ddba9180 upstream. KVM's API requires thats ioctls must be issued from the same process that created the VM. In other words, userspace can play games with a VM's file descriptors, e.g. fork(), SCM_RIGHTS, etc..., but only the creator can do anything useful. Explicitly reject device ioctls that are issued by a process other than the VM's creator, and update KVM's API documentation to extend its requirements to device ioctls. Fixes: 852b6d57 ("kvm: add device control API") Cc: <stable@vger.kernel.org> Signed-off-by:
Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Thomas Gleixner authored
commit bebd024e upstream. The SMT disable 'nosmt' command line argument is not working properly when CONFIG_HOTPLUG_CPU is disabled. The teardown of the sibling CPUs which are required to be brought up due to the MCE issues, cannot work. The CPUs are then kept in a half dead state. As the 'nosmt' functionality has become popular due to the speculative hardware vulnerabilities, the half torn down state is not a proper solution to the problem. Enforce CONFIG_HOTPLUG_CPU=y when SMP is enabled so the full operation is possible. Reported-by:
Tianyu Lan <Tianyu.Lan@microsoft.com> Signed-off-by:
Thomas Gleixner <tglx@linutronix.de> Acked-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Konrad Wilk <konrad.wilk@oracle.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Mukesh Ojha <mojha@codeaurora.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Rik van Riel <riel@surriel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Micheal Kelley <michael.h.kelley@microsoft.com> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Borislav Petkov <bp@alien8.de> Cc: K. Y. Srinivasan <kys@microsoft.com> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20190326163811.598166056@linutronix.deSigned-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Thomas Gleixner authored
commit 206b9235 upstream. Tianyu reported a crash in a CPU hotplug teardown callback when booting a kernel which has CONFIG_HOTPLUG_CPU disabled with the 'nosmt' boot parameter. It turns out that the SMP=y CONFIG_HOTPLUG_CPU=n case has been broken forever in case that a bringup callback fails. Unfortunately this issue was not recognized when the CPU hotplug code was reworked, so the shortcoming just stayed in place. When a bringup callback fails, the CPU hotplug code rolls back the operation and takes the CPU offline. The 'nosmt' command line argument uses a bringup failure to abort the bringup of SMT sibling CPUs. This partial bringup is required due to the MCE misdesign on Intel CPUs. With CONFIG_HOTPLUG_CPU=y the rollback works perfectly fine, but CONFIG_HOTPLUG_CPU=n lacks essential mechanisms to exercise the low level teardown of a CPU including the synchronizations in various facilities like RCU, NOHZ and others. As a consequence the teardown callbacks which must be executed on the outgoing CPU within stop machine with interrupts disabled are executed on the control CPU in interrupt enabled and preemptible context causing the kernel to crash and burn. The pre state machine code has a different failure mode which is more subtle and resulting in a less obvious use after free crash because the control side frees resources which are still in use by the undead CPU. But this is not a x86 only problem. Any architecture which supports the SMP=y HOTPLUG_CPU=n combination suffers from the same issue. It's just less likely to be triggered because in 99.99999% of the cases all bringup callbacks succeed. The easy solution of making HOTPLUG_CPU mandatory for SMP is not working on all architectures as the following architectures have either no hotplug support at all or not all subarchitectures support it: alpha, arc, hexagon, openrisc, riscv, sparc (32bit), mips (partial). Crashing the kernel in such a situation is not an acceptable state either. Implement a minimal rollback variant by limiting the teardown to the point where all regular teardown callbacks have been invoked and leave the CPU in the 'dead' idle state. This has the following consequences: - the CPU is brought down to the point where the stop_machine takedown would happen. - the CPU stays there forever and is idle - The CPU is cleared in the CPU active mask, but not in the CPU online mask which is a legit state. - Interrupts are not forced away from the CPU - All facilities which only look at online mask would still see it, but that is the case during normal hotplug/unplug operations as well. It's just a (way) longer time frame. This will expose issues, which haven't been exposed before or only seldom, because now the normally transient state of being non active but online is a permanent state. In testing this exposed already an issue vs. work queues where the vmstat code schedules work on the almost dead CPU which ends up in an unbound workqueue and triggers 'preemtible context' warnings. This is not a problem of this change, it merily exposes an already existing issue. Still this is better than crashing fully without a chance to debug it. This is mainly thought as workaround for those architectures which do not support HOTPLUG_CPU. All others should enforce HOTPLUG_CPU for SMP. Fixes: 2e1a3483 ("cpu/hotplug: Split out the state walk into functions") Reported-by:
Tianyu Lan <Tianyu.Lan@microsoft.com> Signed-off-by:
Thomas Gleixner <tglx@linutronix.de> Tested-by:
Tianyu Lan <Tianyu.Lan@microsoft.com> Acked-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Konrad Wilk <konrad.wilk@oracle.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Mukesh Ojha <mojha@codeaurora.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Rik van Riel <riel@surriel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Micheal Kelley <michael.h.kelley@microsoft.com> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Borislav Petkov <bp@alien8.de> Cc: K. Y. Srinivasan <kys@microsoft.com> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20190326163811.503390616@linutronix.deSigned-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Thomas Gleixner authored
commit 7dd47617 upstream. The rework of the watchdog core to use cpu_stop_work broke the watchdog cpumask on CPU hotplug. The watchdog_enable/disable() functions are now called unconditionally from the hotplug callback, i.e. even on CPUs which are not in the watchdog cpumask. As a consequence the watchdog can become unstoppable. Only invoke them when the plugged CPU is in the watchdog cpumask. Fixes: 9cf57731 ("watchdog/softlockup: Replace "watchdog/%u" threads with cpu_stop_work") Reported-by:
Maxime Coquelin <maxime.coquelin@redhat.com> Signed-off-by:
Thomas Gleixner <tglx@linutronix.de> Tested-by:
Maxime Coquelin <maxime.coquelin@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Don Zickus <dzickus@redhat.com> Cc: Ricardo Neri <ricardo.neri-calderon@linux.intel.com> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1903262245490.1789@nanos.tec.linutronix.deSigned-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Michael Ellerman authored
commit d9470757 upstream. Chandan reported that fstests' generic/026 test hit a crash: BUG: Unable to handle kernel data access at 0xc00000062ac40000 Faulting instruction address: 0xc000000000092240 Oops: Kernel access of bad area, sig: 11 [#1] LE SMP NR_CPUS=2048 DEBUG_PAGEALLOC NUMA pSeries CPU: 0 PID: 27828 Comm: chacl Not tainted 5.0.0-rc2-next-20190115-00001-g6de6dba64dda #1 NIP: c000000000092240 LR: c00000000066a55c CTR: 0000000000000000 REGS: c00000062c0c3430 TRAP: 0300 Not tainted (5.0.0-rc2-next-20190115-00001-g6de6dba64dda) MSR: 8000000002009033 <SF,VEC,EE,ME,IR,DR,RI,LE> CR: 44000842 XER: 20000000 CFAR: 00007fff7f3108ac DAR: c00000062ac40000 DSISR: 40000000 IRQMASK: 0 GPR00: 0000000000000000 c00000062c0c36c0 c0000000017f4c00 c00000000121a660 GPR04: c00000062ac3fff9 0000000000000004 0000000000000020 00000000275b19c4 GPR08: 000000000000000c 46494c4500000000 5347495f41434c5f c0000000026073a0 GPR12: 0000000000000000 c0000000027a0000 0000000000000000 0000000000000000 GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 GPR20: c00000062ea70020 c00000062c0c38d0 0000000000000002 0000000000000002 GPR24: c00000062ac3ffe8 00000000275b19c4 0000000000000001 c00000062ac30000 GPR28: c00000062c0c38d0 c00000062ac30050 c00000062ac30058 0000000000000000 NIP memcmp+0x120/0x690 LR xfs_attr3_leaf_lookup_int+0x53c/0x5b0 Call Trace: xfs_attr3_leaf_lookup_int+0x78/0x5b0 (unreliable) xfs_da3_node_lookup_int+0x32c/0x5a0 xfs_attr_node_addname+0x170/0x6b0 xfs_attr_set+0x2ac/0x340 __xfs_set_acl+0xf0/0x230 xfs_set_acl+0xd0/0x160 set_posix_acl+0xc0/0x130 posix_acl_xattr_set+0x68/0x110 __vfs_setxattr+0xa4/0x110 __vfs_setxattr_noperm+0xac/0x240 vfs_setxattr+0x128/0x130 setxattr+0x248/0x600 path_setxattr+0x108/0x120 sys_setxattr+0x28/0x40 system_call+0x5c/0x70 Instruction dump: 7d201c28 7d402428 7c295040 38630008 38840008 408201f0 4200ffe8 2c050000 4182ff6c 20c50008 54c61838 7d201c28 <7d402428> 7d293436 7d4a3436 7c295040 The instruction dump decodes as: subfic r6,r5,8 rlwinm r6,r6,3,0,28 ldbrx r9,0,r3 ldbrx r10,0,r4 <- Which shows us doing an 8 byte load from c00000062ac3fff9, which crosses the page boundary at c00000062ac40000 and faults. It's not OK for memcmp to read past the end of the source or destination buffers if that would cross a page boundary, because we don't know that the next page is mapped. As pointed out by Segher, we can read past the end of the source or destination as long as we don't cross a 4K boundary, because that's our minimum page size on all platforms. The bug is in the code at the .Lcmp_rest_lt8bytes label. When we get there we know that s1 is 8-byte aligned and we have at least 1 byte to read, so a single 8-byte load won't read past the end of s1 and cross a page boundary. But we have to be more careful with s2. So check if it's within 8 bytes of a 4K boundary and if so go to the byte-by-byte loop. Fixes: 2d9ee327 ("powerpc/64: Align bytes before fall back to .Lshort in powerpc64 memcmp()") Cc: stable@vger.kernel.org # v4.19+ Reported-by:
Chandan Rajendra <chandan@linux.ibm.com> Signed-off-by:
Michael Ellerman <mpe@ellerman.id.au> Reviewed-by:
Segher Boessenkool <segher@kernel.crashing.org> Tested-by:
Chandan Rajendra <chandan@linux.ibm.com> Signed-off-by:
Michael Ellerman <mpe@ellerman.id.au> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Gautham R. Shenoy authored
commit ce9afe08 upstream. In cpu_to_drc_index() in the case when FW_FEATURE_DRC_INFO is absent, we currently use of_read_property() to obtain the pointer to the array corresponding to the property "ibm,drc-indexes". The elements of this array are of type __be32, but are accessed without any conversion to the OS-endianness, which is buggy on a Little Endian OS. Fix this by using of_property_read_u32_index() accessor function to safely read the elements of the array. Fixes: e83636ac ("pseries/drc-info: Search DRC properties for CPU indexes") Cc: stable@vger.kernel.org # v4.16+ Reported-by:
Pavithra R. Prakash <pavrampu@in.ibm.com> Signed-off-by:
Gautham R. Shenoy <ego@linux.vnet.ibm.com> Reviewed-by:
Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> [mpe: Make the WARN_ON a WARN_ON_ONCE so it's not retriggerable] Signed-off-by:
Michael Ellerman <mpe@ellerman.id.au> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Rolf Eike Beer authored
commit 056d28d1 upstream. If it is not in the default location, compilation fails at several points. Signed-off-by:
Rolf Eike Beer <eb@emlix.com> Signed-off-by:
Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by:
Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/91a25e992566a7968fedc89ec80e7f4c83ad0548.1553622500.git.jpoimboe@redhat.comSigned-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Adrian Hunter authored
commit f3b4e06b upstream. A TSC packet can slip past MTC packets so that the timestamp appears to go backwards. One estimate is that can be up to about 40 CPU cycles, which is certainly less than 0x1000 TSC ticks, but accept slippage an order of magnitude more to be on the safe side. Signed-off-by:
Adrian Hunter <adrian.hunter@intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: stable@vger.kernel.org Fixes: 79b58424 ("perf tools: Add Intel PT support for decoding MTC packets") Link: http://lkml.kernel.org/r/20190325135135.18348-1-adrian.hunter@intel.comSigned-off-by:
Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Kan Liang authored
commit e94d6b7f upstream. Perf fails to parse uncore event alias, for example: # perf stat -e unc_m_clockticks -a --no-merge sleep 1 event syntax error: 'unc_m_clockticks' \___ parser error Current code assumes that the event alias is from one specific PMU. To find the PMU, perf strcmps the PMU name of event alias with the real PMU name on the system. However, the uncore event alias may be from multiple PMUs with common prefix. The PMU name of uncore event alias is the common prefix. For example, UNC_M_CLOCKTICKS is clock event for iMC, which include 6 PMUs with the same prefix "uncore_imc" on a skylake server. The real PMU names on the system for iMC are uncore_imc_0 ... uncore_imc_5. The strncmp is used to only check the common prefix for uncore event alias. With the patch: # perf stat -e unc_m_clockticks -a --no-merge sleep 1 Performance counter stats for 'system wide': 723,594,722 unc_m_clockticks [uncore_imc_5] 724,001,954 unc_m_clockticks [uncore_imc_3] 724,042,655 unc_m_clockticks [uncore_imc_1] 724,161,001 unc_m_clockticks [uncore_imc_4] 724,293,713 unc_m_clockticks [uncore_imc_2] 724,340,901 unc_m_clockticks [uncore_imc_0] 1.002090060 seconds time elapsed Signed-off-by:
Kan Liang <kan.liang@linux.intel.com> Acked-by:
Jiri Olsa <jolsa@kernel.org> Cc: Andi Kleen <ak@linux.intel.com> Cc: Thomas Richter <tmricht@linux.ibm.com> Cc: stable@vger.kernel.org Fixes: ea1fa48c ("perf stat: Handle different PMU names with common prefix") Link: http://lkml.kernel.org/r/1552672814-156173-1-git-send-email-kan.liang@linux.intel.comSigned-off-by:
Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-