1. 21 Jun, 2013 1 commit
    • Steven Rostedt's avatar
      tracing: Add DEFINE_EVENT_FN() macro · f5abaa1b
      Steven Rostedt authored
      
      
      Each TRACE_EVENT() adds several helper functions. If two or more trace events
      share the same structure and print format, they can also share most of these
      helper functions and save a lot of space from duplicate code. This is why the
      DECLARE_EVENT_CLASS() and DEFINE_EVENT() were created.
      
      Some events require a trigger to be called at registering and unregistering of
      the event and to do so they use TRACE_EVENT_FN().
      
      If multiple events require a trigger, they currently have no choice but to use
      TRACE_EVENT_FN() as there's no DEFINE_EVENT_FN() available. This unfortunately
      causes a lot of wasted duplicate code created.
      
      By adding a DEFINE_EVENT_FN(), these events can still use a
      DECLARE_EVENT_CLASS() and then define their own triggers.
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Link: http://lkml.kernel.org/r/51C3236C.8030508@hds.com
      
      Signed-off-by: default avatarSeiji Aguchi <seiji.aguchi@hds.com>
      Signed-off-by: default avatarH. Peter Anvin <hpa@linux.intel.com>
      f5abaa1b
  2. 03 May, 2013 1 commit
    • Yan, Zheng's avatar
      ext4: fix fio regression · e30b5dca
      Yan, Zheng authored
      We (Linux Kernel Performance project) found a regression introduced
      by commit:
      
        f7fec032
      
       ext4: track all extent status in extent status tree
      
      The commit causes about 20% performance decrease in fio random write
      test. Profiler shows that rb_next() uses a lot of CPU time. The call
      stack is:
      
        rb_next
        ext4_es_find_delayed_extent
        ext4_map_blocks
        _ext4_get_block
        ext4_get_block_write
        __blockdev_direct_IO
        ext4_direct_IO
        generic_file_direct_write
        __generic_file_aio_write
        ext4_file_write
        aio_rw_vect_retry
        aio_run_iocb
        do_io_submit
        sys_io_submit
        system_call_fastpath
        io_submit
        td_io_getevents
        io_u_queued_complete
        thread_main
        main
        __libc_start_main
      
      The cause is that ext4_es_find_delayed_extent() doesn't have an
      upper bound, it keeps searching until a delayed extent is found.
      When there are a lots of non-delayed entries in the extent state
      tree, ext4_es_find_delayed_extent() may uses a lot of CPU time.
      Reported-by: default avatarLKP project <lkp@linux.intel.com>
      Signed-off-by: default avatarYan, Zheng <zheng.z.yan@intel.com>
      Signed-off-by: default avatarZheng Liu <wenqing.lz@taobao.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      e30b5dca
  3. 30 Apr, 2013 1 commit
  4. 29 Apr, 2013 2 commits
  5. 26 Apr, 2013 1 commit
  6. 24 Apr, 2013 1 commit
    • Frederic Weisbecker's avatar
      nohz: Fix unavailable tick_stop tracepoint in dynticks idle · 2c82d1be
      Frederic Weisbecker authored
      
      
      The trace_tick_stop() tracepoint is only available in full
      dynticks. But it's also used by dynticks-idle so let's build
      it for the latter config as well.
      
      This fixes:
      
           kernel/time/tick-sched.c: In function tick_nohz_stop_sched_tick:
           kernel/time/tick-sched.c:644: error: implicit declaration of function trace_tick_stop
           make[2]: *** [kernel/time/tick-sched.o] Erreur 1
      Reported-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Geoff Levand <geoff@infradead.org>
      Cc: Gilad Ben Yossef <gilad@benyossef.com>
      Cc: Hakan Akkan <hakanakkan@gmail.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Kevin Hilman <khilman@linaro.org>
      Cc: Li Zhong <zhong@linux.vnet.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      2c82d1be
  7. 23 Apr, 2013 7 commits
  8. 22 Apr, 2013 3 commits
  9. 21 Apr, 2013 1 commit
    • Theodore Ts'o's avatar
      jbd2: trace when lock_buffer in do_get_write_access takes a long time · f783f091
      Theodore Ts'o authored
      
      
      While investigating interactivity problems it was clear that processes
      sometimes stall for long periods of times if an attempt is made to
      lock a buffer which is undergoing writeback.  It would stall in
      a trace looking something like
      
      [<ffffffff811a39de>] __lock_buffer+0x2e/0x30
      [<ffffffff8123a60f>] do_get_write_access+0x43f/0x4b0
      [<ffffffff8123a7cb>] jbd2_journal_get_write_access+0x2b/0x50
      [<ffffffff81220f79>] __ext4_journal_get_write_access+0x39/0x80
      [<ffffffff811f3198>] ext4_reserve_inode_write+0x78/0xa0
      [<ffffffff811f3209>] ext4_mark_inode_dirty+0x49/0x220
      [<ffffffff811f57d1>] ext4_dirty_inode+0x41/0x60
      [<ffffffff8119ac3e>] __mark_inode_dirty+0x4e/0x2d0
      [<ffffffff8118b9b9>] update_time+0x79/0xc0
      [<ffffffff8118ba98>] file_update_time+0x98/0x100
      [<ffffffff81110ffc>] __generic_file_aio_write+0x17c/0x3b0
      [<ffffffff811112aa>] generic_file_aio_write+0x7a/0xf0
      [<ffffffff811ea853>] ext4_file_write+0x83/0xd0
      [<ffffffff81172b23>] do_sync_write+0xa3/0xe0
      [<ffffffff811731ae>] vfs_write+0xae/0x180
      [<ffffffff8117361d>] sys_write+0x4d/0x90
      [<ffffffff8159d62d>] system_call_fastpath+0x1a/0x1f
      [<ffffffffffffffff>] 0xffffffffffffffff
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      f783f091
  10. 18 Apr, 2013 1 commit
  11. 13 Apr, 2013 1 commit
  12. 12 Apr, 2013 1 commit
    • Thomas Gleixner's avatar
      kthread: Prevent unpark race which puts threads on the wrong cpu · f2530dc7
      Thomas Gleixner authored
      
      
      The smpboot threads rely on the park/unpark mechanism which binds per
      cpu threads on a particular core. Though the functionality is racy:
      
      CPU0	       	 	CPU1  	     	    CPU2
      unpark(T)				    wake_up_process(T)
        clear(SHOULD_PARK)	T runs
      			leave parkme() due to !SHOULD_PARK  
        bind_to(CPU2)		BUG_ON(wrong CPU)						    
      
      We cannot let the tasks move themself to the target CPU as one of
      those tasks is actually the migration thread itself, which requires
      that it starts running on the target cpu right away.
      
      The solution to this problem is to prevent wakeups in park mode which
      are not from unpark(). That way we can guarantee that the association
      of the task to the target cpu is working correctly.
      
      Add a new task state (TASK_PARKED) which prevents other wakeups and
      use this state explicitly for the unpark wakeup.
      
      Peter noticed: Also, since the task state is visible to userspace and
      all the parked tasks are still in the PID space, its a good hint in ps
      and friends that these tasks aren't really there for the moment.
      
      The migration thread has another related issue.
      
      CPU0	      	     	 CPU1
      Bring up CPU2
      create_thread(T)
      park(T)
       wait_for_completion()
      			 parkme()
      			 complete()
      sched_set_stop_task()
      			 schedule(TASK_PARKED)
      
      The sched_set_stop_task() call is issued while the task is on the
      runqueue of CPU1 and that confuses the hell out of the stop_task class
      on that cpu. So we need the same synchronizaion before
      sched_set_stop_task().
      Reported-by: default avatarDave Jones <davej@redhat.com>
      Reported-and-tested-by: default avatarDave Hansen <dave@sr71.net>
      Reported-and-tested-by: default avatarBorislav Petkov <bp@alien8.de>
      Acked-by: default avatarPeter Ziljstra <peterz@infradead.org>
      Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
      Cc: dhillf@gmail.com
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: stable@vger.kernel.org
      Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1304091635430.21884@ionos
      
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      f2530dc7
  13. 10 Apr, 2013 1 commit
  14. 03 Apr, 2013 1 commit
  15. 02 Apr, 2013 1 commit
    • Tejun Heo's avatar
      writeback: replace custom worker pool implementation with unbound workqueue · 839a8e86
      Tejun Heo authored
      
      
      Writeback implements its own worker pool - each bdi can be associated
      with a worker thread which is created and destroyed dynamically.  The
      worker thread for the default bdi is always present and serves as the
      "forker" thread which forks off worker threads for other bdis.
      
      there's no reason for writeback to implement its own worker pool when
      using unbound workqueue instead is much simpler and more efficient.
      This patch replaces custom worker pool implementation in writeback
      with an unbound workqueue.
      
      The conversion isn't too complicated but the followings are worth
      mentioning.
      
      * bdi_writeback->last_active, task and wakeup_timer are removed.
        delayed_work ->dwork is added instead.  Explicit timer handling is
        no longer necessary.  Everything works by either queueing / modding
        / flushing / canceling the delayed_work item.
      
      * bdi_writeback_thread() becomes bdi_writeback_workfn() which runs off
        bdi_writeback->dwork.  On each execution, it processes
        bdi->work_list and reschedules itself if there are more things to
        do.
      
        The function also handles low-mem condition, which used to be
        handled by the forker thread.  If the function is running off a
        rescuer thread, it only writes out limited number of pages so that
        the rescuer can serve other bdis too.  This preserves the flusher
        creation failure behavior of the forker thread.
      
      * INIT_LIST_HEAD(&bdi->bdi_list) is used to tell
        bdi_writeback_workfn() about on-going bdi unregistration so that it
        always drains work_list even if it's running off the rescuer.  Note
        that the original code was broken in this regard.  Under memory
        pressure, a bdi could finish unregistration with non-empty
        work_list.
      
      * The default bdi is no longer special.  It now is treated the same as
        any other bdi and bdi_cap_flush_forker() is removed.
      
      * BDI_pending is no longer used.  Removed.
      
      * Some tracepoints become non-applicable.  The following TPs are
        removed - writeback_nothread, writeback_wake_thread,
        writeback_wake_forker_thread, writeback_thread_start,
        writeback_thread_stop.
      
      Everything, including devices coming and going away and rescuer
      operation under simulated memory pressure, seems to work fine in my
      test setup.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      839a8e86
  16. 26 Mar, 2013 2 commits
  17. 23 Mar, 2013 2 commits
  18. 18 Mar, 2013 1 commit
  19. 15 Mar, 2013 6 commits
    • Steven Rostedt (Red Hat)'s avatar
      tracing: Add a way to soft disable trace events · 417944c4
      Steven Rostedt (Red Hat) authored
      
      
      In order to let triggers enable or disable events, we need a 'soft'
      method for doing so. For example, if a function probe is added that
      lets a user enable or disable events when a function is called, that
      change must be done without taking locks or a mutex, and definitely
      it can't sleep. But the full enabling of a tracepoint is expensive.
      
      By adding a 'SOFT_DISABLE' flag, and converting the flags to be updated
      without the protection of a mutex (using set/clear_bit()), this soft
      disable flag can be used to allow critical sections to enable or disable
      events from being traced (after the event has been placed into "SOFT_MODE").
      
      Some caveats though: The comm recorder (to map pids with a comm) can not
      be soft disabled (yet). If you disable an event with with a "soft"
      disable and wait a while before reading the trace, the comm cache may be
      replaced and you'll get a bunch of <...> for comms in the trace.
      
      Reading the "enable" file for an event that is disabled will now give
      you "0*" where the '*' denotes that the tracepoint is still active but
      the event itself is "disabled".
      
      [ fixed _BIT used in & operation : thanks to Dan Carpenter and smatch ]
      
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: Tom Zanussi <tom.zanussi@linux.intel.com>
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      417944c4
    • Li Zefan's avatar
      tracing: Fix some section mismatch warnings · 523c8113
      Li Zefan authored
      As we've added __init annotation to field-defining functions, we should
      add __refdata annotation to event_call variables, which reference those
      functions.
      
      Link: http://lkml.kernel.org/r/51343C1F.2050502@huawei.com
      
      Reported-by: default avatarFengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: default avatarLi Zefan <lizefan@huawei.com>
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      523c8113
    • Li Zefan's avatar
      tracing: Annotate event field-defining functions with __init · 7e4f44b1
      Li Zefan authored
      Those functions are called either during kernel boot or module init.
      
      Before:
      
      $ dmesg | grep 'Freeing unused kernel memory'
      Freeing unused kernel memory: 1208k freed
      Freeing unused kernel memory: 1360k freed
      Freeing unused kernel memory: 1960k freed
      
      After:
      
      $ dmesg | grep 'Freeing unused kernel memory'
      Freeing unused kernel memory: 1236k freed
      Freeing unused kernel memory: 1388k freed
      Freeing unused kernel memory: 1960k freed
      
      Link: http://lkml.kernel.org/r/5125877D.5000201@huawei.com
      
      Signed-off-by: default avatarLi Zefan <lizefan@huawei.com>
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      7e4f44b1
    • Li Zefan's avatar
      tracing: Add a helper function for event print functions · f71130de
      Li Zefan authored
      Move duplicate code in event print functions to a helper function.
      
      This shrinks the size of the kernel by ~13K.
      
         text    data     bss     dec     hex filename
      6596137 1743966 10138672        18478775        119f6b7 vmlinux.o.old
      6583002 1743849 10138672        18465523        119c2f3 vmlinux.o.new
      
      Link: http://lkml.kernel.org/r/51258746.2060304@huawei.com
      
      Signed-off-by: default avatarLi Zefan <lizefan@huawei.com>
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      f71130de
    • Steven Rostedt's avatar
      tracing: Pass the ftrace_file to the buffer lock reserve code · ccb469a1
      Steven Rostedt authored
      
      
      Pass the struct ftrace_event_file *ftrace_file to the
      trace_event_buffer_lock_reserve() (new function that replaces the
      trace_current_buffer_lock_reserver()).
      
      The ftrace_file holds a pointer to the trace_array that is in use.
      In the case of multiple buffers with different trace_arrays, this
      allows different events to be recorded into different buffers.
      
      Also fixed some of the stale comments in include/trace/ftrace.h
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      ccb469a1
    • Steven Rostedt's avatar
      tracing: Separate out trace events from global variables · ae63b31e
      Steven Rostedt authored
      
      
      The trace events for ftrace are all defined via global variables.
      The arrays of events and event systems are linked to a global list.
      This prevents multiple users of the event system (what to enable and
      what not to).
      
      By adding descriptors to represent the event/file relation, as well
      as to which trace_array descriptor they are associated with, allows
      for more than one set of events to be defined. Once the trace events
      files have a link between the trace event and the trace_array they
      are associated with, we can create multiple trace_arrays that can
      record separate events in separate buffers.
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      ae63b31e
  20. 04 Mar, 2013 1 commit
  21. 01 Mar, 2013 1 commit
    • Theodore Ts'o's avatar
      ext4: optimize ext4_es_shrink() · 24630774
      Theodore Ts'o authored
      
      
      When the system is under memory pressure, ext4_es_srhink() will get
      called very often.  So optimize returning the number of items in the
      file system's extent status cache by keeping a per-filesystem count,
      instead of calculating it each time by scanning all of the inodes in
      the extent status cache.
      
      Also rename the slab used for the extent status cache to be
      "ext4_extent_status" so it's obviousl the slab in question is created
      by ext4.
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Cc: Zheng Liu <gnehzuil.liu@gmail.com>
      24630774
  22. 18 Feb, 2013 3 commits
    • Zheng Liu's avatar
      ext4: reclaim extents from extent status tree · 74cd15cd
      Zheng Liu authored
      
      
      Although extent status is loaded on-demand, we also need to reclaim
      extent from the tree when we are under a heavy memory pressure because
      in some cases fragmented extent tree causes status tree costs too much
      memory.
      
      Here we maintain a lru list in super_block.  When the extent status of
      an inode is accessed and changed, this inode will be move to the tail
      of the list.  The inode will be dropped from this list when it is
      cleared.  In the inode, a counter is added to count the number of
      cached objects in extent status tree.  Here only written/unwritten/hole
      extent is counted because delayed extent doesn't be reclaimed due to
      fiemap, bigalloc and seek_data/hole need it.  The counter will be
      increased as a new extent is allocated, and it will be decreased as a
      extent is freed.
      
      In this commit we use normal shrinker framework to reclaim memory from
      the status tree.  ext4_es_reclaim_extents_count() traverses the lru list
      to count the number of reclaimable extents.  ext4_es_shrink() tries to
      reclaim written/unwritten/hole extents from extent status tree.  The
      inode that has been shrunk is moved to the tail of lru list.
      Signed-off-by: default avatarZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Cc: Jan kara <jack@suse.cz>
      74cd15cd
    • Zheng Liu's avatar
      ext4: lookup block mapping in extent status tree · d100eef2
      Zheng Liu authored
      
      
      After tracking all extent status, we already have a extent cache in
      memory.  Every time we want to lookup a block mapping, we can first
      try to lookup it in extent status tree to avoid a potential disk I/O.
      
      A new function called ext4_es_lookup_extent is defined to finish this
      work.  When we try to lookup a block mapping, we always call
      ext4_map_blocks and/or ext4_da_map_blocks.  So in these functions we
      first try to lookup a block mapping in extent status tree.
      
      A new flag EXT4_GET_BLOCKS_NO_PUT_HOLE is used in ext4_da_map_blocks
      in order not to put a hole into extent status tree because this hole
      will be converted to delayed extent in the tree immediately.
      Signed-off-by: default avatarZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Cc: Jan kara <jack@suse.cz>
      d100eef2
    • Zheng Liu's avatar
      ext4: rename and improbe ext4_es_find_extent() · be401363
      Zheng Liu authored
      
      
      This commit renames ext4_es_find_extent with ext4_es_find_delayed_extent
      and improve this function.  First, we split input and output parameter.
      Second, this function never return the first block of the next delayed
      extent after 'es'.
      Signed-off-by: default avatarZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Cc: Jan kara <jack@suse.cz>
      be401363