Skip to content
  • Hidetoshi Seto's avatar
    perf_event, x86, mce: Use TRACE_EVENT() for MCE logging · 8968f9d3
    Hidetoshi Seto authored
    
    
    This approach is the first baby step towards solving many of the
    structural problems the x86 MCE logging code is having today:
    
     - It has a private ring-buffer implementation that has a number
       of limitations and has been historically fragile and buggy.
    
     - It is using a quirky /dev/mcelog ioctl driven ABI that is MCE
       specific. /dev/mcelog is not part of any larger logging
       framework and hence has remained on the fringes for many years.
    
     - The MCE logging code is still very unclean partly due to its ABI
       limitations. Fields are being reused for multiple purposes, and
       the whole message structure is limited and x86 specific to begin
       with.
    
    All in one, the x86 tree would like to move away from this private
    implementation of an event logging facility to a broader framework.
    
    By using perf events we gain the following advantages:
    
     - Multiple user-space agents can access MCE events. We can have an
       mcelog daemon running but also a system-wide tracer capturing
       important events in flight-recorder mode.
    
     - Sampling support: the kernel and the user-space call-chain of MCE
       events can be stored and analyzed as well. This way actual patterns
       of bad behavior can be matched to precisely what kind of activity
       happened in the kernel (and/or in the app) around that moment in
       time.
    
     - Coupling with other hardware and software events: the PMU can track a
       number of other anomalies - monitoring software might chose to
       monitor those plus the MCE events as well - in one coherent stream of
       events.
    
     - Discovery of MCE sources - tracepoints are enumerated and tools can
       act upon the existence (or non-existence) of various channels of MCE
       information.
    
     - Filtering support: we just subscribe to and act upon the events we
       are interested in. Then even on a per event source basis there's
       in-kernel filter expressions available that can restrict the amount
       of data that hits the event channel.
    
     - Arbitrary deep per cpu buffering of events - we can buffer 32
       entries or we can buffer as much as we want, as long as we have
       the RAM.
    
     - An NMI-safe ring-buffer implementation - mappable to user-space.
    
     - Built-in support for timestamping of events, PID markers, CPU
       markers, etc.
    
     - A rich ABI accessible over system call interface. Per cpu, per task
       and per workload monitoring of MCE events can be done this way. The
       ABI itself has a nice, meaningful structure.
    
     - Extensible ABI: new fields can be added without breaking tooling.
       New tracepoints can be added as the hardware side evolves. There's
       various parsers that can be used.
    
     - Lots of scheduling/buffering/batching modes of operandi for MCE
       events. poll() support. mmap() support. read() support. You name it.
    
     - Rich tooling support: even without any MCE specific extensions added
       the 'perf' tool today offers various views of MCE data: perf report,
       perf stat, perf trace can all be used to view logged MCE events and
       perhaps correlate them to certain user-space usage patterns. But it
       can be used directly as well, for user-space agents and policy action
       in mcelog, etc.
    
    With this we hope to achieve significant code cleanup and feature
    improvements in the MCE code, and we hope to be able to drop the
    /dev/mcelog facility in the end.
    
    This patch is just a plain dumb dump of mce_log() records to
    the tracepoints / perf events framework - a first proof of
    concept step.
    
    Signed-off-by: default avatarHidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
    Cc: Huang Ying <ying.huang@intel.com>
    Cc: Andi Kleen <ak@linux.intel.com>
    LKML-Reference: <4AD42A0D.7050104@jp.fujitsu.com>
    Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
    8968f9d3