1. 24 Jul, 2008 2 commits
    • Roland McGrath's avatar
      x86_64 syscall audit fast-path · 86a1c34a
      Roland McGrath authored
      
      
      This adds a fast path for 64-bit syscall entry and exit when
      TIF_SYSCALL_AUDIT is set, but no other kind of syscall tracing.
      This path does not need to save and restore all registers as
      the general case of tracing does.  Avoiding the iret return path
      when syscall audit is enabled helps performance a lot.
      Signed-off-by: default avatarRoland McGrath <roland@redhat.com>
      86a1c34a
    • Roland McGrath's avatar
      x86_64: remove bogus optimization in sysret_signal · 15e8f348
      Roland McGrath authored
      
      
      This short-circuit path in sysret_signal looks wrong to me.
      AFAICT, in practice the branch is never taken--and if it were,
      it would go wrong.  To wit, try loading a module whose init
      function does set_thread_flag(TIF_IRET), and see insmod crash
      (presumably with a wrong user stack pointer).
      
      This is because the FIXUP_TOP_OF_STACK work hasn't been done yet
      when we jump around the call to ptregscall_common and get to
      int_with_check--where it expects the user RSP,SS,CS and EFLAGS to
      have been stored by FIXUP_TOP_OF_STACK.
      
      I don't think it's normally possible to get to sysret_signal with no
      _TIF_DO_NOTIFY_MASK bits set anyway, so these two instructions are
      already superfluous.  If it ever did happen, it is harmless to call
      do_notify_resume with nothing for it to do.
      Signed-off-by: default avatarRoland McGrath <roland@redhat.com>
      15e8f348
  2. 16 Jul, 2008 4 commits
  3. 12 Jul, 2008 1 commit
    • Roland McGrath's avatar
      x86_64: fix delayed signals · eca91e78
      Roland McGrath authored
      
      
      On three of the several paths in entry_64.S that call
      do_notify_resume() on the way back to user mode, we fail to properly
      check again for newly-arrived work that requires another call to
      do_notify_resume() before going to user mode.  These paths set the
      mask to check only _TIF_NEED_RESCHED, but this is wrong.  The other
      paths that lead to do_notify_resume() do this correctly already, and
      entry_32.S does it correctly in all cases.
      
      All paths back to user mode have to check all the _TIF_WORK_MASK
      flags at the last possible stage, with interrupts disabled.
      Otherwise, we miss any flags (TIF_SIGPENDING for example) that were
      set any time after we entered do_notify_resume().  More work flags
      can be set (or left set) synchronously inside do_notify_resume(), as
      TIF_SIGPENDING can be, or asynchronously by interrupts or other CPUs
      (which then send an asynchronous interrupt).
      
      There are many different scenarios that could hit this bug, most of
      them races.  The simplest one to demonstrate does not require any
      race: when one signal has done handler setup at the check before
      returning from a syscall, and there is another signal pending that
      should be handled.  The second signal's handler should interrupt the
      first signal handler before it actually starts (so the interrupted PC
      is still at the handler's entry point).  Instead, it runs away until
      the next kernel entry (next syscall, tick, etc).
      
      This test behaves correctly on 32-bit kernels, and fails on 64-bit
      (either 32-bit or 64-bit test binary).  With this fix, it works.
      
          #define _GNU_SOURCE
          #include <stdio.h>
          #include <signal.h>
          #include <string.h>
          #include <sys/ucontext.h>
      
          #ifndef REG_RIP
          #define REG_RIP REG_EIP
          #endif
      
          static sig_atomic_t hit1, hit2;
      
          static void
          handler (int sig, siginfo_t *info, void *ctx)
          {
            ucontext_t *uc = ctx;
      
            if ((void *) uc->uc_mcontext.gregs[REG_RIP] == &handler)
              {
                if (sig == SIGUSR1)
                  hit1 = 1;
                else
                  hit2 = 1;
              }
      
            printf ("%s at %#lx\n", strsignal (sig),
                    uc->uc_mcontext.gregs[REG_RIP]);
          }
      
          int
          main (void)
          {
            struct sigaction sa;
            sigset_t set;
      
            sigemptyset (&sa.sa_mask);
            sa.sa_flags = SA_SIGINFO;
            sa.sa_sigaction = &handler;
      
            if (sigaction (SIGUSR1, &sa, NULL)
                || sigaction (SIGUSR2, &sa, NULL))
              return 2;
      
            sigemptyset (&set);
            sigaddset (&set, SIGUSR1);
            sigaddset (&set, SIGUSR2);
            if (sigprocmask (SIG_BLOCK, &set, NULL))
              return 3;
      
            printf ("main at %p, handler at %p\n", &main, &handler);
      
            raise (SIGUSR1);
            raise (SIGUSR2);
      
            if (sigprocmask (SIG_UNBLOCK, &set, NULL))
              return 4;
      
            if (hit1 + hit2 == 1)
              {
                puts ("PASS");
                return 0;
              }
      
            puts ("FAIL");
            return 1;
          }
      Signed-off-by: default avatarRoland McGrath <roland@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      eca91e78
  4. 09 Jul, 2008 1 commit
  5. 08 Jul, 2008 7 commits
  6. 27 Jun, 2008 1 commit
    • Vegard Nossum's avatar
      x86: don't destroy %rbp on kernel-mode faults · 9d8ad5d6
      Vegard Nossum authored
      
      
      From the code:
      
          "B stepping K8s sometimes report an truncated RIP for IRET exceptions
          returning to compat mode. Check for these here too."
      
      The code then proceeds to truncate the upper 32 bits of %rbp. This means
      that when do_page_fault() is finally called, its prologue,
      
          do_page_fault:
              push %rbp
              movl %rsp, %rbp
      
      will put the truncated base pointer on the stack. This means that the
      stack tracer will not be able to follow the base-pointer changes and
      will see all subsequent stack frames as unreliable.
      
      This patch changes the code to use a different register (%rcx) for the
      checking and leaves %rbp untouched.
      Signed-off-by: default avatarVegard Nossum <vegard.nossum@gmail.com>
      Signed-off-by: default avatarPekka Enberg <penberg@cs.helsinki.fi>
      Acked-by: default avatarArjan van de Ven <arjan@linux.intel.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      9d8ad5d6
  7. 26 Jun, 2008 1 commit
  8. 23 Jun, 2008 1 commit
  9. 19 Jun, 2008 1 commit
  10. 25 May, 2008 1 commit
  11. 23 May, 2008 2 commits
    • Steven Rostedt's avatar
      ftrace: use dynamic patching for updating mcount calls · d61f82d0
      Steven Rostedt authored
      
      
      This patch replaces the indirect call to the mcount function
      pointer with a direct call that will be patched by the
      dynamic ftrace routines.
      
      On boot up, the mcount function calls the ftace_stub function.
      When the dynamic ftrace code is initialized, the ftrace_stub
      is replaced with a call to the ftrace_record_ip, which records
      the instruction pointers of the locations that call it.
      
      Later, the ftraced daemon will call kstop_machine and patch all
      the locations to nops.
      
      When a ftrace is enabled, the original calls to mcount will now
      be set top call ftrace_caller, which will do a direct call
      to the registered ftrace function. This direct call is also patched
      when the function that should be called is updated.
      
      All patching is performed by a kstop_machine routine to prevent any
      type of race conditions that is associated with modifying code
      on the fly.
      Signed-off-by: default avatarSteven Rostedt <srostedt@redhat.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      d61f82d0
    • Arnaldo Carvalho de Melo's avatar
      ftrace: add basic support for gcc profiler instrumentation · 16444a8a
      Arnaldo Carvalho de Melo authored
      
      
      If CONFIG_FTRACE is selected and /proc/sys/kernel/ftrace_enabled is
      set to a non-zero value the ftrace routine will be called everytime
      we enter a kernel function that is not marked with the "notrace"
      attribute.
      
      The ftrace routine will then call a registered function if a function
      happens to be registered.
      
      [ This code has been highly hacked by Steven Rostedt and Ingo Molnar,
        so don't blame Arnaldo for all of this ;-) ]
      
      Update:
        It is now possible to register more than one ftrace function.
        If only one ftrace function is registered, that will be the
        function that ftrace calls directly. If more than one function
        is registered, then ftrace will call a function that will loop
        through the functions to call.
      Signed-off-by: default avatarArnaldo Carvalho de Melo <acme@ghostprotocols.net>
      Signed-off-by: default avatarSteven Rostedt <srostedt@redhat.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      16444a8a
  12. 17 Apr, 2008 1 commit
    • Roland McGrath's avatar
      x86: ptrace vs -ENOSYS · a31f8dd7
      Roland McGrath authored
      
      
      When we're stopped at syscall entry tracing, ptrace can change the %rax
      value from -ENOSYS to something else.  If no system call is actually made
      because the syscall number (now in orig_rax) is bad, then we now always
      reset %rax to -ENOSYS again.
      
      This changes it to leave the return value alone after entry tracing.
      That way, the %rax value set by ptrace is there to be seen in user mode
      (or in syscall exit tracing).  This is consistent with what the 32-bit
      kernel does.
      Signed-off-by: default avatarRoland McGrath <roland@redhat.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      a31f8dd7
  13. 26 Feb, 2008 1 commit
    • Ingo Molnar's avatar
      x86: fix execve with -fstack-protect · 5d119b2c
      Ingo Molnar authored
      
      
      pointed out by pageexec@freemail.hu:
      
      > what happens here is that gcc treats the argument area as owned by the
      > callee, not the caller and is allowed to do certain tricks. for ssp it
      > will make a copy of the struct passed by value into the local variable
      > area and pass *its* address down, and it won't copy it back into the
      > original instance stored in the argument area.
      >
      > so once sys_execve returns, the pt_regs passed by value hasn't at all
      > changed and its default content will cause a nice double fault (FWIW,
      > this part took me the longest to debug, being down with cold didn't
      > help it either ;).
      
      To fix this we pass in pt_regs by pointer.
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      5d119b2c
  14. 19 Feb, 2008 1 commit
  15. 09 Feb, 2008 1 commit
  16. 06 Feb, 2008 2 commits
  17. 30 Jan, 2008 1 commit
  18. 25 Jan, 2008 1 commit
    • Peter Zijlstra's avatar
      sched: high-res preemption tick · 8f4d37ec
      Peter Zijlstra authored
      
      
      Use HR-timers (when available) to deliver an accurate preemption tick.
      
      The regular scheduler tick that runs at 1/HZ can be too coarse when nice
      level are used. The fairness system will still keep the cpu utilisation 'fair'
      by then delaying the task that got an excessive amount of CPU time but try to
      minimize this by delivering preemption points spot-on.
      
      The average frequency of this extra interrupt is sched_latency / nr_latency.
      Which need not be higher than 1/HZ, its just that the distribution within the
      sched_latency period is important.
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      8f4d37ec
  19. 17 Oct, 2007 1 commit
  20. 11 Oct, 2007 3 commits
  21. 31 Jul, 2007 1 commit
  22. 22 Jul, 2007 1 commit
    • Tim Hockin's avatar
      x86_64: support poll() on /dev/mcelog · e02e68d3
      Tim Hockin authored
      
      
      Background:
       /dev/mcelog is typically polled manually.  This is less than optimal for
       situations where accurate accounting of MCEs is important.  Calling
       poll() on /dev/mcelog does not work.
      
      Description:
       This patch adds support for poll() to /dev/mcelog.  This results in
       immediate wakeup of user apps whenever the poller finds MCEs.  Because
       the exception handler can not take any locks, it can not call the wakeup
       itself.  Instead, it uses a thread_info flag (TIF_MCE_NOTIFY) which is
       caught at the next return from interrupt or exit from idle, calling the
       mce_user_notify() routine.  This patch also disables the "fake panic"
       path of the mce_panic(), because it results in printk()s in the exception
       handler and crashy systems.
      
       This patch also does some small cleanup for essentially unused variables,
       and moves the user notification into the body of the poller, so it is
       only called once per poll, rather than once per CPU.
      
      Result:
       Applications can now poll() on /dev/mcelog.  When an error is logged
       (whether through the poller or through an exception) the applications are
       woken up promptly.  This should not affect any previous behaviors.  If no
       MCEs are being logged, there is no overhead.
      
      Alternatives:
       I considered simply supporting poll() through the poller and not using
       TIF_MCE_NOTIFY at all.  However, the time between an uncorrectable error
       happening and the user application being notified is *the*most* critical
       window for us.  Many uncorrectable errors can be logged to the network if
       given a chance.
      
       I also considered doing the MCE poll directly from the idle notifier, but
       decided that was overkill.
      
      Testing:
       I used an error-injecting DIMM to create lots of correctable DRAM errors
       and verified that my user app is woken up in sync with the polling interval.
       I also used the northbridge to inject uncorrectable ECC errors, and
       verified (printk() to the rescue) that the notify routine is called and the
       user app does wake up.  I built with PREEMPT on and off, and verified
       that my machine survives MCEs.
      
      [wli@holomorphy.com: build fix]
      Signed-off-by: default avatarTim Hockin <thockin@google.com>
      Signed-off-by: default avatarWilliam Irwin <bill.irwin@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarAndi Kleen <ak@suse.de>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e02e68d3
  23. 23 Jun, 2007 1 commit
  24. 02 May, 2007 1 commit
  25. 26 Feb, 2007 1 commit
    • Eric W. Biederman's avatar
      [PATCH] x86_64 irq: Safely cleanup an irq after moving it. · 61014292
      Eric W. Biederman authored
      
      
      The problem:  After moving an interrupt when is it safe to teardown
      the data structures for receiving the interrupt at the old location?
      
      With a normal pci device it is possible to issue a read to a device
      to flush all posted writes.  This does not work for the oldest ioapics
      because they are on a 3-wire apic bus which is a completely different
      data path.  For some more modern ioapics when everything is using
      front side bus delivery you can flush interrupts by simply issuing a
      read to the ioapic.  For other modern ioapics emperical testing has
      shown that this does not work.
      
      So it appears the only reliable way to know the last of the irqs from an
      ioapic have been received from before the ioapic was reprogrammed is to
      received the first irq from the ioapic from after it was reprogrammed.
      
      Once we know the last irq message has been received from an ioapic
      into a local apic we then need to know that irq message has been
      processed through the local apics.
      Signed-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      61014292
  26. 15 Dec, 2006 1 commit
    • Linus Torvalds's avatar
      Remove stack unwinder for now · d1526e2c
      Linus Torvalds authored
      
      
      It has caused more problems than it ever really solved, and is
      apparently not getting cleaned up and fixed.  We can put it back when
      it's stable and isn't likely to make warning or bug events worse.
      
      In the meantime, enable frame pointers for more readable stack traces.
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      d1526e2c