1. 14 Oct, 2008 14 commits
    • Steven Rostedt's avatar
      ftrace: use ftrace_release for all dynamic ftrace functions · c0719e5a
      Steven Rostedt authored
      
      
      ftrace_release is necessary for all uses of dynamic ftrace and not just
      the archs that have CONFIG_FTRACE_MCOUNT_RECORD defined.
      Signed-off-by: default avatarSteven Rostedt <srostedt@redhat.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      c0719e5a
    • Huang Ying's avatar
      ftrace: fix incorrect comment style of __ftrace_enabled_save() · 37002735
      Huang Ying authored
      
      
      This patch fixes incorrect comment style of __ftrace_enabled_save().
      Signed-off-by: default avatarHuang Ying <ying.huang@intel.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      37002735
    • Ingo Molnar's avatar
      ftrace: ftrace_kill_atomic() build fix · c5131ad6
      Ingo Molnar authored
      
      
      fix:
      
       kernel/built-in.o: In function `ftrace_dump':
       (.text+0x2e2ea): undefined reference to `ftrace_kill_atomic'
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      c5131ad6
    • Ingo Molnar's avatar
      ftrace: build fix · 7b928c23
      Ingo Molnar authored
      
      
      fix:
      
       In file included from init/main.c:65:
       include/linux/ftrace.h:166: error: expected ‘,' or ‘;' before ‘{' token
       make[1]: *** [init/main.o] Error 1
       make: *** [init/main.o] Error 2
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      7b928c23
    • Steven Rostedt's avatar
      ftrace: dump out ftrace buffers to console on panic · 3f5a54e3
      Steven Rostedt authored
      
      
      At OLS I had a lot of interest to be able to have the ftrace buffers
      dumped on panic.  Usually one would expect to uses kexec and examine
      the buffers after a new kernel is loaded. But sometimes the resources
      do not permit kdump and kexec, so having an option to still see the
      sequence of events up to the crash is very advantageous.
      
      This patch adds the option to have the ftrace buffers dumped to the
      console in the latency_trace format on a panic. When the option is set,
      the default entries per CPU buffer are lowered to 16384, since the writing
      to the serial (if that is the console) may take an awful long time
      otherwise.
      
      [
       Changes since -v1:
        Got alpine to send correctly (as well as spell check working).
        Removed config option.
        Moved the static variables into ftrace_dump itself.
        Gave printk a log level.
      ]
      Signed-off-by: default avatarSteven Rostedt <srostedt@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      3f5a54e3
    • Steven Rostedt's avatar
      ftrace: ftrace_printk doc moved · 2f2c99db
      Steven Rostedt authored
      
      
      Based on Randy Dunlap's suggestion, the ftrace_printk kernel-doc belongs
      with the ftrace_printk macro that should be used. Not with the
      __ftrace_printk internal function.
      Signed-off-by: default avatarSteven Rostedt <srostedt@redhat.com>
      Acked-by: default avatarRandy Dunlap <randy.dunlap@oracle.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      2f2c99db
    • Steven Rostedt's avatar
      ftrace: printk formatting infrastructure · dd0e545f
      Steven Rostedt authored
      
      
      This patch adds a feature that can help kernel developers debug their
      code using ftrace.
      
        int ftrace_printk(const char *fmt, ...);
      
      This records into the ftrace buffer using printf formatting. The entry
      size in the buffers are still a fixed length. A new type has been added
      that allows for more entries to be used for a single recording.
      
      The start of the print is still the same as the other entries.
      
      It returns the number of characters written to the ftrace buffer.
      
      For example:
      
      Having a module with the following code:
      
      static int __init ftrace_print_test(void)
      {
              ftrace_printk("jiffies are %ld\n", jiffies);
              return 0;
      }
      
      Gives me:
      
        insmod-5441  3...1 7569us : ftrace_print_test: jiffies are 4296626666
      
      for the latency_trace file and:
      
                insmod-5441  [03]  1959.370498: ftrace_print_test jiffies are 4296626666
      
      for the trace file.
      
      Note: Only the infrastructure should go into the kernel. It is to help
      facilitate debugging for other kernel developers. Calls to ftrace_printk
      is not intended to be left in the kernel, and should be frowned upon just
      like scattering printks around in the code.
      
      But having this easily at your fingertips helps the debugging go faster
      and bugs be solved quicker.
      
      Maybe later on, we can hook this with markers and have their printf format
      be sucked into ftrace output.
      Signed-off-by: default avatarSteven Rostedt <srostedt@redhat.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      dd0e545f
    • Steven Rostedt's avatar
      ftrace: remove old pointers to mcount · fed1939c
      Steven Rostedt authored
      
      
      When a mcount pointer is recorded into a table, it is used to add or
      remove calls to mcount (replacing them with nops). If the code is removed
      via removing a module, the pointers still exist.  At modifying the code
      a check is always made to make sure the code being replaced is the code
      expected. In-other-words, the code being replaced is compared to what
      it is expected to be before being replaced.
      
      There is a very small chance that the code being replaced just happens
      to look like code that calls mcount (very small since the call to mcount
      is relative). To remove this chance, this patch adds ftrace_release to
      allow module unloading to remove the pointers to mcount within the module.
      
      Another change for init calls is made to not trace calls marked with
      __init. The tracing can not be started until after init is done anyway.
      Signed-off-by: default avatarSteven Rostedt <srostedt@redhat.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      fed1939c
    • Steven Rostedt's avatar
      ftrace: move notrace to compiler.h · 28614889
      Steven Rostedt authored
      
      
      The notrace define belongs in compiler.h so that it can be used in
      init.h
      Signed-off-by: default avatarSteven Rostedt <srostedt@redhat.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      28614889
    • Steven Rostedt's avatar
      ftrace: rebuild everything on change to FTRACE_MCOUNT_RECORD · 29e71abf
      Steven Rostedt authored
      
      
      When enabling or disabling CONFIG_FTRACE_MCOUNT_RECORD, we want a full
      kernel compile to handle the adding of the __mcount_loc sections.
      Signed-off-by: default avatarSteven Rostedt <srostedt@redhat.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      29e71abf
    • Steven Rostedt's avatar
      ftrace: enable mcount recording for modules · 90d595fe
      Steven Rostedt authored
      
      
      This patch enables the loading of the __mcount_section of modules and
      changing all the callers of mcount into nops.
      
      The modification is done before the init_module function is called, so
      again, we do not need to use kstop_machine to make these changes.
      Signed-off-by: default avatarSteven Rostedt <srostedt@redhat.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      90d595fe
    • Steven Rostedt's avatar
      ftrace: mcount call site on boot nops core · 68bf21aa
      Steven Rostedt authored
      
      
      This is the infrastructure to the converting the mcount call sites
      recorded by the __mcount_loc section into nops on boot. It also allows
      for using these sites to enable tracing as normal. When the __mcount_loc
      section is used, the "ftraced" kernel thread is disabled.
      
      This uses the current infrastructure to record the mcount call sites
      as well as convert them to nops. The mcount function is kept as a stub
      on boot up and not converted to the ftrace_record_ip function. We use the
      ftrace_record_ip to only record from the table.
      
      This patch does not handle modules. That comes with a later patch.
      Signed-off-by: default avatarSteven Rostedt <srostedt@redhat.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      68bf21aa
    • Ingo Molnar's avatar
      ftrace: ignore functions that cannot be kprobe-ed · 36dcd67a
      Ingo Molnar authored
      
      
      kprobes already has an extensive list of annotations for functions
      that should not be instrumented. Add notrace annotations to these
      functions as well.
      
      This is particularly useful for functions called by the NMI path.
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      36dcd67a
    • Mathieu Desnoyers's avatar
      tracing: Kernel Tracepoints · 97e1c18e
      Mathieu Desnoyers authored
      Implementation of kernel tracepoints. Inspired from the Linux Kernel
      Markers. Allows complete typing verification by declaring both tracing
      statement inline functions and probe registration/unregistration static
      inline functions within the same macro "DEFINE_TRACE". No format string
      is required. See the tracepoint Documentation and Samples patches for
      usage examples.
      
      Taken from the documentation patch :
      
      "A tracepoint placed in code provides a hook to call a function (probe)
      that you can provide at runtime. A tracepoint can be "on" (a probe is
      connected to it) or "off" (no probe is attached). When a tracepoint is
      "off" it has no effect, except for adding a tiny time penalty (checking
      a condition for a branch) and space penalty (adding a few bytes for the
      function call at the end of the instrumented function and adds a data
      structure in a separate section).  When a tracepoint is "on", the
      function you provide is called each time the tracepoint is executed, in
      the execution context of the caller. When the function provided ends its
      execution, it returns to the caller (continuing from the tracepoint
      site).
      
      You can put tracepoints at important locations in the code. They are
      lightweight hooks that can pass an arbitrary number of parameters, which
      prototypes are described in a tracepoint declaration placed in a header
      file."
      
      Addition and removal of tracepoints is synchronized by RCU using the
      scheduler (and preempt_disable) as guarantees to find a quiescent state
      (this is really RCU "classic"). The update side uses rcu_barrier_sched()
      with call_rcu_sched() and the read/execute side uses
      "preempt_disable()/preempt_enable()".
      
      We make sure the previous array containing probes, which has been
      scheduled for deletion by the rcu callback, is indeed freed before we
      proceed to the next update. It therefore limits the rate of modification
      of a single tracepoint to one update per RCU period. The objective here
      is to permit fast batch add/removal of probes on _different_
      tracepoints.
      
      Changelog :
      - Use #name ":" #proto as string to identify the tracepoint in the
        tracepoint table. This will make sure not type mismatch happens due to
        connexion of a probe with the wrong type to a tracepoint declared with
        the same name in a different header.
      - Add tracepoint_entry_free_old.
      - Change __TO_TRACE to get rid of the 'i' iterator.
      
      Masami Hiramatsu <mhiramat@redhat.com> :
      Tested on x86-64.
      
      Performance impact of a tracepoint : same as markers, except that it
      adds about 70 bytes of instructions in an unlikely branch of each
      instrumented function (the for loop, the stack setup and the function
      call). It currently adds a memory read, a test and a conditional branch
      at the instrumentation site (in the hot path). Immediate values will
      eventually change this into a load immediate, test and branch, which
      removes the memory read which will make the i-cache impact smaller
      (changing the memory read for a load immediate removes 3-4 bytes per
      site on x86_32 (depending on mov prefixes), or 7-8 bytes on x86_64, it
      also saves the d-cache hit).
      
      About the performance impact of tracepoints (which is comparable to
      markers), even without immediate values optimizations, tests done by
      Hideo Aoki on ia64 show no regression. His test case was using hackbench
      on a kernel where scheduler instrumentation (about 5 events in code
      scheduler code) was added.
      
      Quoting Hideo Aoki about Markers :
      
      I evaluated overhead of kernel marker using linux-2.6-sched-fixes git
      tree, which includes several markers for LTTng, using an ia64 server.
      
      While the immediate trace mark feature isn't implemented on ia64, there
      is no major performance regression. So, I think that we don't have any
      issues to propose merging marker point patches into Linus's tree from
      the viewpoint of performance impact.
      
      I prepared two kernels to evaluate. The first one was compiled without
      CONFIG_MARKERS. The second one was enabled CONFIG_MARKERS.
      
      I downloaded the original hackbench from the following URL:
      http://devresources.linux-foundation.org/craiger/hackbench/src/hackbench.c
      
      
      
      I ran hackbench 5 times in each condition and calculated the average and
      difference between the kernels.
      
          The parameter of hackbench: every 50 from 50 to 800
          The number of CPUs of the server: 2, 4, and 8
      
      Below is the results. As you can see, major performance regression
      wasn't found in any case. Even if number of processes increases,
      differences between marker-enabled kernel and marker- disabled kernel
      doesn't increase. Moreover, if number of CPUs increases, the differences
      doesn't increase either.
      
      Curiously, marker-enabled kernel is better than marker-disabled kernel
      in more than half cases, although I guess it comes from the difference
      of memory access pattern.
      
      * 2 CPUs
      
      Number of | without      | with         | diff     | diff    |
      processes | Marker [Sec] | Marker [Sec] |   [Sec]  |   [%]   |
      --------------------------------------------------------------
             50 |      4.811   |       4.872  |  +0.061  |  +1.27  |
            100 |      9.854   |      10.309  |  +0.454  |  +4.61  |
            150 |     15.602   |      15.040  |  -0.562  |  -3.6   |
            200 |     20.489   |      20.380  |  -0.109  |  -0.53  |
            250 |     25.798   |      25.652  |  -0.146  |  -0.56  |
            300 |     31.260   |      30.797  |  -0.463  |  -1.48  |
            350 |     36.121   |      35.770  |  -0.351  |  -0.97  |
            400 |     42.288   |      42.102  |  -0.186  |  -0.44  |
            450 |     47.778   |      47.253  |  -0.526  |  -1.1   |
            500 |     51.953   |      52.278  |  +0.325  |  +0.63  |
            550 |     58.401   |      57.700  |  -0.701  |  -1.2   |
            600 |     63.334   |      63.222  |  -0.112  |  -0.18  |
            650 |     68.816   |      68.511  |  -0.306  |  -0.44  |
            700 |     74.667   |      74.088  |  -0.579  |  -0.78  |
            750 |     78.612   |      79.582  |  +0.970  |  +1.23  |
            800 |     85.431   |      85.263  |  -0.168  |  -0.2   |
      --------------------------------------------------------------
      
      * 4 CPUs
      
      Number of | without      | with         | diff     | diff    |
      processes | Marker [Sec] | Marker [Sec] |   [Sec]  |   [%]   |
      --------------------------------------------------------------
             50 |      2.586   |       2.584  |  -0.003  |  -0.1   |
            100 |      5.254   |       5.283  |  +0.030  |  +0.56  |
            150 |      8.012   |       8.074  |  +0.061  |  +0.76  |
            200 |     11.172   |      11.000  |  -0.172  |  -1.54  |
            250 |     13.917   |      14.036  |  +0.119  |  +0.86  |
            300 |     16.905   |      16.543  |  -0.362  |  -2.14  |
            350 |     19.901   |      20.036  |  +0.135  |  +0.68  |
            400 |     22.908   |      23.094  |  +0.186  |  +0.81  |
            450 |     26.273   |      26.101  |  -0.172  |  -0.66  |
            500 |     29.554   |      29.092  |  -0.461  |  -1.56  |
            550 |     32.377   |      32.274  |  -0.103  |  -0.32  |
            600 |     35.855   |      35.322  |  -0.533  |  -1.49  |
            650 |     39.192   |      38.388  |  -0.804  |  -2.05  |
            700 |     41.744   |      41.719  |  -0.025  |  -0.06  |
            750 |     45.016   |      44.496  |  -0.520  |  -1.16  |
            800 |     48.212   |      47.603  |  -0.609  |  -1.26  |
      --------------------------------------------------------------
      
      * 8 CPUs
      
      Number of | without      | with         | diff     | diff    |
      processes | Marker [Sec] | Marker [Sec] |   [Sec]  |   [%]   |
      --------------------------------------------------------------
             50 |      2.094   |       2.072  |  -0.022  |  -1.07  |
            100 |      4.162   |       4.273  |  +0.111  |  +2.66  |
            150 |      6.485   |       6.540  |  +0.055  |  +0.84  |
            200 |      8.556   |       8.478  |  -0.078  |  -0.91  |
            250 |     10.458   |      10.258  |  -0.200  |  -1.91  |
            300 |     12.425   |      12.750  |  +0.325  |  +2.62  |
            350 |     14.807   |      14.839  |  +0.032  |  +0.22  |
            400 |     16.801   |      16.959  |  +0.158  |  +0.94  |
            450 |     19.478   |      19.009  |  -0.470  |  -2.41  |
            500 |     21.296   |      21.504  |  +0.208  |  +0.98  |
            550 |     23.842   |      23.979  |  +0.137  |  +0.57  |
            600 |     26.309   |      26.111  |  -0.198  |  -0.75  |
            650 |     28.705   |      28.446  |  -0.259  |  -0.9   |
            700 |     31.233   |      31.394  |  +0.161  |  +0.52  |
            750 |     34.064   |      33.720  |  -0.344  |  -1.01  |
            800 |     36.320   |      36.114  |  -0.206  |  -0.57  |
      --------------------------------------------------------------
      Signed-off-by: default avatarMathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
      Acked-by: default avatarMasami Hiramatsu <mhiramat@redhat.com>
      Acked-by: default avatar'Peter Zijlstra' <peterz@infradead.org>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      97e1c18e
  2. 13 Oct, 2008 17 commits
  3. 12 Oct, 2008 4 commits
  4. 11 Oct, 2008 2 commits
    • Hidehiro Kawai's avatar
      ext4: add an option to control error handling on file data · 5bf5683a
      Hidehiro Kawai authored
      If the journal doesn't abort when it gets an IO error in file data
      blocks, the file data corruption will spread silently.  Because
      most of applications and commands do buffered writes without fsync(),
      they don't notice the IO error.  It's scary for mission critical
      systems.  On the other hand, if the journal aborts whenever it gets
      an IO error in file data blocks, the system will easily become
      inoperable.  So this patch introduces a filesystem option to
      determine whether it aborts the journal or just call printk() when
      it gets an IO error in file data.
      
      If you mount an ext4 fs with data_err=abort option, it aborts on file
      data write error.  If you mount it with data_err=ignore, it doesn't
      abort, just call printk().  data_err=ignore is the default.
      
      Here is the corresponding patch of the ext3 version:
      http://kerneltrap.org/mailarchive/linux-kernel/2008/9/9/3239374
      
      Signed-off-by: default avatarHidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      5bf5683a
    • Hidehiro Kawai's avatar
      jbd2: fix error handling for checkpoint io · 44519faf
      Hidehiro Kawai authored
      
      
      When a checkpointing IO fails, current JBD2 code doesn't check the
      error and continue journaling.  This means latest metadata can be
      lost from both the journal and filesystem.
      
      This patch leaves the failed metadata blocks in the journal space
      and aborts journaling in the case of jbd2_log_do_checkpoint().
      To achieve this, we need to do:
      
      1. don't remove the failed buffer from the checkpoint list where in
         the case of __try_to_free_cp_buf() because it may be released or
         overwritten by a later transaction
      2. jbd2_log_do_checkpoint() is the last chance, remove the failed
         buffer from the checkpoint list and abort the journal
      3. when checkpointing fails, don't update the journal super block to
         prevent the journaled contents from being cleaned.  For safety,
         don't update j_tail and j_tail_sequence either
      4. when checkpointing fails, notify this error to the ext4 layer so
         that ext4 don't clear the needs_recovery flag, otherwise the
         journaled contents are ignored and cleaned in the recovery phase
      5. if the recovery fails, keep the needs_recovery flag
      6. prevent jbd2_cleanup_journal_tail() from being called between
         __jbd2_journal_drop_transaction() and jbd2_journal_abort()
         (a possible race issue between jbd2_log_do_checkpoint()s called by
         jbd2_journal_flush() and __jbd2_log_wait_for_space())
      Signed-off-by: default avatarHidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      44519faf
  5. 10 Oct, 2008 3 commits