              The perfmon2 hardware monitoring interface
              ------------------------------------------
			     Stephane Eranian
			<stephane.eranian@hp.com>

I/ Introduction

   The perfmon2 interface provides access to the hardware performance counters of
   major processors. Nowadays, all processors implement some flavors of performance
   counters which capture micro-architectural level information such as the number
   of elapsed cycles, number of cache misses, and so on.

   The interface is implemented as a set of new system calls and a set of config files
   in /sys.

   It is possible to monitoring a single thread or a CPU. In either mode, applications
   can count or collect samples. System-wide monitoring is supported by running a
   monitoring session on each CPU. The interface support event-based sampling where the
   sampling period is expressed as the number of occurrences of event, instead of just a
   timeout.  This approach provides a much better granularity and flexibility.

   For performance reason, it is possible to use a kernel-level sampling buffer to minimize
   the overhead incurred by sampling. The format of the buffer, i.e., what is recorded, how
   it is recorded, and how it is exported to user-land is controlled by a kernel module called
   a custom sampling format. The current implementation comes with a default format but
   it is possible to create additional formats. There is an in-kernel registration
   interface for formats. Each format is identified by a simple string which a tool
   can pass when a monitoring session is created.

   The interface also provides support for event set and multiplexing to work around
   hardware limitations in the number of available counters or in how events can be 
   combined. Each set defines as many counters as the hardware can support. The kernel
   then multiplexes the sets. The interface supports time-base switching but also
   overflow based switching, i.e., after n overflows of designated counters.

   Applications never manipulates the actual performance counter registers. Instead they see
   a logical Performance Monitoring Unit (PMU) composed of a set of config register (PMC)
   and a set of data registers (PMD). Note that PMD are not necessarily counters, they
   can be buffers. The logical PMU is then mapped onto the actual PMU using a mapping
   table which is implemented as a kernel module. The mapping is chosen once for each
   new processor. It is visible in /sys/kernel/perfmon/pmu_desc. The kernel module
   is automatically loaded on first use.

   A monitoring session, or context, is uniquely identified by a file descriptor
   obtained when the context is created. File sharing semantics apply to access
   the context inside a process. A context is never inherited across fork. The file
   descriptor can be used to received counter overflow notifications or when the
   sampling buffer is full. It is possible to use poll/select on the descriptor 
   to wait for notifications from multiplex contexts. Similarly, the descriptor
   supports asynchronous notification via SIGIO.

   Counters are always exported as being 64-bit wide regardless of what the underlying
   hardware implements.

II/ Kernel compilation

    To enable perfmon2, you need to enable CONFIG_PERFMON

III/ OProfile interactions

    The set of features offered by perfmon2 is rich enough to support migrating
    Oprofile on top of it. That means that PMU programming and low-level interrupt
    handling could be done by perfmon2. The Oprofile sampling buffer management code
    in the kernel as well as how samples are exported to users could remain through
    the use of a custom sampling buffer format. This is how Oprofile work on Itanium.

    The current interactions with Oprofile are:
	- on X86: Both subsystems can be compiled into the same kernel. There is enforced
	          mutual exclusion between the two subsystems. When there is an Oprofile
		  session, no perfmon2 session can exist and vice-versa. Perfmon2 session
		  encapsulates both per-thread and system-wide sessions here.

	- On IA-64: Oprofile works on top of perfmon2. Oprofile being a system-wide monitoring
		    tool, the regular per-thread vs. system-wide session restrictions apply.

	- on PPC: no integration yet. You need to enable/disble one of the two subsystems
	- on MIPS: no integration yet. You need to enable/disble one of the two subsystems

IV/ User tools

    We have released a simple monitoring tool to demonstrate the feature of the
    interface. The tool is called pfmon and it comes with a simple helper library
    called libpfm. The library comes with a set of examples to show how to use the
    kernel perfmon2 interface. Visit http://perfmon2.sf.net for details.

    There maybe other tools available for perfmon2.

V/ How to program?

   The best way to learn how to program perfmon2, is to take a look at the source
   code for the examples in libpfm. The source code is available from:
		http://perfmon2.sf.net

VI/ System calls overview

   The interface is implemented by the following system calls:

   * int pfm_create_context(pfarg_ctx_t *ctx, char *fmt, void *arg, size_t arg_size)

      This function create a perfmon2 context. The type of context is per-thread by
      default unless PFM_FL_SYSTEM_WIDE is passed in ctx. The sampling format name
      is passed in fmt. Arguments to the format are passed in arg which is of size
      arg_size. Upon successful return, the file descriptor identifying the context
      is returned.

   * int pfm_write_pmds(int fd, pfarg_pmd_t *pmds, int n)
   
      This function is used to program the PMD registers. It is possible to pass
      vectors of PMDs.

   * int pfm_write_pmcs(int fd, pfarg_pmc_t *pmds, int n)
   
      This function is used to program the PMC registers. It is possible to pass
      vectors of PMDs.

   * int pfm_read_pmds(int fd, pfarg_pmd_t *pmds, int n)
   
      This function is used to read the PMD registers. It is possible to pass
      vectors of PMDs.

   * int pfm_load_context(int fd, pfarg_load_t *load)
   
      This function is used to attach the context to a thread or CPU.
      Thread means kernel-visible thread (NPTL). The thread identification
      as obtained by gettid must be passed to load->load_target.

      To operate on another thread (not self), it is mandatory that the thread
      be stopped via ptrace().

      To attach to a CPU, the CPU number must be specified in load->load_target
      AND the call must be issued on that CPU. To monitor a CPU, a thread MUST
      be pinned on that CPU.

      Until the context is attached, the actual counters are not accessed.

   * int pfm_unload_context(int fd)

     The context is detached for the thread or CPU is was attached to.
     As a consequence monitoring is stopped.

     When monitoring another thread, the thread MUST be stopped via ptrace()
     for this function to succeed.

   * int pfm_start(int fd, pfarg_start_t *st)

     Start monitoring. The context must be attached for this function to succeed.
     Optionally, it is possible to specify the event set on which to start using the
     st argument, otherwise just pass NULL.

     When monitoring another thread, the thread MUST be stopped via ptrace()
     for this function to succeed.

   * int pfm_stop(int fd)

     Stop monitoring. The context must be attached for this function to succeed.

     When monitoring another thread, the thread MUST be stopped via ptrace()
     for this function to succeed.


   * int pfm_create_evtsets(int fd, pfarg_setdesc_t *sets, int n)

     This function is used to create or change event sets. By default set 0 exists.
     It is possible to create/change multiple sets in one call.

     The context must be detached for this call to succeed.

     Sets are identified by a 16-bit integer. They are sorted based on this
     set and switching occurs in a round-robin fashion.

   * int pfm_delete_evtsets(int fd, pfarg_setdesc_t *sets, int n)
   
     Delete event sets. The context must be detached for this call to succeed.


   * int pfm_getinfo_evtsets(int fd, pfarg_setinfo_t *sets, int n)

     Retrieve information about event sets. In particular it is possible
     to get the number of activation of a set. It is possible to retrieve
     information about multiple sets in one call.


   * int pfm_restart(int fd)

     Indicate to the kernel that the application is done processing an overflow
     notification. A consequence of this call could be that monitoring resumes.

   * int read(fd, pfm_msg_t *msg, sizeof(pfm_msg_t))

   the regular read() system call can be used with the context file descriptor to
   receive overflow notification messages. Non-blocking read() is supported.

   Each message carry information about the overflow such as which counter overflowed
   and where the program was (interrupted instruction pointer).

   * int close(int fd)

   To destroy a context, the regular close() system call is used.


VII/ /sys interface overview

   Refer to Documentation/ABI/testing/sysfs-perfmon-* for a detailed description
   of the sysfs interface of perfmon2.

VIII/ Documentation

   Visit http://perfmon2.sf.net
