      This is the basic code enabling alternate control of tasks between the
      regular kernel and an embedded co-kernel. The changes cover the
      following aspects:
      - extend the per-thread information block with a private area usable
        by the co-kernel for storing additional state information
      - provide the API enabling a scheduler exchange mechanism, so that
        tasks can run under the control of either kernel alternatively. This
        includes a service to move the current task to the head domain under
        the control of the co-kernel, and the converse service to re-enter
        the root domain once the co-kernel has released such task.
      - ensure the generic context switching code can be used from any
        domain, serializing execution as required.
      These changes have to be paired with arch-specific code further
      enabling context switching from the head domain.
      Some architectures (such as Alpha) rely on include/linux/sched.h definitions
      in their mmu_context.h files.
      So include sched.h before mmu_context.h.
      Bunch of performance improvements and cleanups Zach Brown and I have
      been working on.  The code should be pretty solid at this point, though
      it could of course use more review and testing.
      The results in my testing are pretty impressive, particularly when an
      ioctx is being shared between multiple threads.  In my crappy synthetic
      benchmark, with 4 threads submitting and one thread reaping completions,
      I saw overhead in the aio code go from ~50% (mostly ioctx lock
      contention) to low single digits.  Performance with ioctx per thread
      improved too, but I'd have to rerun those benchmarks.
      The reason I've been focused on performance when the ioctx is shared is
      that for a fair number of real world completions, userspace needs the
      completions aggregated somehow - in practice people just end up
      implementing this aggregation in userspace today, but if it's done right
      we can do it much more efficiently in the kernel.
      Performance wise, the end result of this patch series is that submitting
      a kiocb writes to _no_ shared cachelines - the penalty for sharing an
      ioctx is gone there.  There's still going to be some cacheline
      contention when we deliver the completions to the aio ringbuffer (at
      least if you have interrupts being delivered on multiple cores, which
      for high end stuff you do) but I have a couple more patches not in this
      series that implement coalescing for that (by taking advantage of
      interrupt coalescing).  With that, there's basically no bottlenecks or
      performance issues to speak of in the aio code.
      This patch:
      use_mm() is used in more places than just aio.  There's no need to mention
      callers when describing the function.
      In 2.6.34-rc1, removing vhost_net module causes an oops in sync_mm_rss
      (called from do_exit) when workqueue is destroyed.  This does not happen
      on net-next, or with vhost on top of to 2.6.33.
      The issue seems to be introduced by
      34e55232 ("mm: avoid false sharing of
      mm_counter) which added sync_mm_rss() that is passed task->mm, and
      dereferences it without checking.  If task is a kernel thread, mm might be
      NULL.  I think this might also happen e.g.  with aio.
      This patch fixes the oops by calling sync_mm_rss when task->mm is set to
      NULL.  I also added BUG_ON to detect any other cases where counters get
      incremented while mm is NULL.
      The oops I observed looks like this:
      BUG: unable to handle kernel NULL pointer dereference at 00000000000002a8
      IP: [<ffffffff810b436d>] sync_mm_rss+0x33/0x6f
      PGD 0
      Oops: 0002 [#1] SMP
      last sysfs file: /sys/devices/system/cpu/cpu7/cache/index2/shared_cpu_map
      CPU 2
      Modules linked in: vhost_net(-) tun bridge stp sunrpc ipv6 cpufreq_ondemand acpi_cpufreq freq_table kvm_intel kvm i5000_edac edac_core rtc_cmos bnx2 button i2c_i801 i2c_core rtc_core e1000e sg joydev ide_cd_mod serio_raw pcspkr rtc_lib cdrom virtio_net virtio_blk virtio_pci virtio_ring virtio af_packet e1000 shpchp aacraid uhci_hcd ohci_hcd ehci_hcd [last unloaded: microcode]
      Pid: 2046, comm: vhost Not tainted 2.6.34-rc1-vhost #25 System Planar/IBM System x3550 -[7978B3G]-
      RIP: 0010:[<ffffffff810b436d>]  [<ffffffff810b436d>] sync_mm_rss+0x33/0x6f
      RSP: 0018:ffff8802379b7e60  EFLAGS: 00010202
      RAX: 0000000000000008 RBX: ffff88023f2390c0 RCX: 0000000000000000
      RDX: ffff88023f2396b0 RSI: 0000000000000000 RDI: ffff88023f2390c0
      RBP: ffff8802379b7e60 R08: 0000000000000000 R09: 0000000000000000
      R10: ffff88023aecfbc0 R11: 0000000000013240 R12: 0000000000000000
      R13: ffffffff81051a6c R14: ffffe8ffffc0f540 R15: 0000000000000000
      FS:  0000000000000000(0000) GS:ffff880001e80000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      CR2: 00000000000002a8 CR3: 000000023af23000 CR4: 00000000000406e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      Process vhost (pid: 2046, threadinfo ffff8802379b6000, task ffff88023f2390c0)
       ffff8802379b7ee0 ffffffff81040687 ffffe8ffffc0f558 ffffffffa00a3e2d
      <0> 0000000000000000 ffff88023f2390c0 ffffffff81055817 ffff8802379b7e98
      <0> ffff8802379b7e98 0000000100000286 ffff8802379b7ee0 ffff88023ad47d78
      Call Trace:
       [<ffffffff81040687>] do_exit+0x147/0x6c4
       [<ffffffffa00a3e2d>] ? handle_rx_net+0x0/0x17 [vhost_net]
       [<ffffffff81055817>] ? autoremove_wake_function+0x0/0x39
       [<ffffffff81051a6c>] ? worker_thread+0x0/0x229
       [<ffffffff810553c9>] kthreadd+0x0/0xf2
       [<ffffffff810038d4>] kernel_thread_helper+0x4/0x10
       [<ffffffff81055342>] ? kthread+0x0/0x87
       [<ffffffff810038d0>] ? kernel_thread_helper+0x0/0x10
      Code: 00 8b 87 6c 02 00 00 85 c0 74 14 48 98 f0 48 01 86 a0 02 00 00 c7 87 6c 02 00 00 00 00 00 00 8b 87 70 02 00 00 85 c0 74 14 48 98 <f0> 48 01 86 a8 02 00 00 c7 87 70 02 00 00 00 00 00 00 8b 87 74
      RIP  [<ffffffff810b436d>] sync_mm_rss+0x33/0x6f
       RSP <ffff8802379b7e60>
      CR2: 00000000000002a8
      ---[ end trace 41603ba922beddd2 ]---
      Fixing recursive fault but reboot is needed!
      (note: handle_rx_net is a work item using workqueue in question).
      sync_mm_rss+0x33/0x6f gave me a hint. I also tried reverting
       and the oops goes away.
      The module in question calls use_mm and later unuse_mm from a kernel
      thread.  It is when this kernel thread is destroyed that the crash
