6.11. Process & Resource Statistics (Monitoring)
This document describes how PMIx answers a request for
resource-utilization statistics — per-process CPU/memory usage, node
load and memory, disk counters, and network counters — from the
PMIx_Process_monitor family of APIs down through the MCA pstat
(Process Statistics) framework and its components. It also covers
the periodic monitoring mode, in which the server keeps sampling on a
timer and streams each sample back to the requestor as a PMIx event.
For a code-oriented orientation aimed at contributors working inside
the framework, each pstat directory also carries an AGENTS.md
(with a CLAUDE.md symlink): src/mca/pstat/AGENTS.md for the
framework as a whole, plus one each in plinux/, pmacos/, and
test/. This
document explains how the pieces fit into the wider library; those explain
each piece’s internals in detail.
6.11.1. The Problem
A resource manager, a tool such as an activity monitor, or an application
runtime often needs to know how much CPU, memory, disk, or network a set
of processes — or a whole node — is consuming. In an HPC system those
processes are spread across many nodes, and only the PMIx server on each
node can actually read that node’s operating system. A client or tool
cannot read /proc for a process on some other node.
PMIx therefore exposes a single generic monitoring API and routes each request to wherever the data lives:
a local portion — processes or a node the receiving server owns — is satisfied by reading the local OS; and
a remote portion — anything on other nodes — is handed up to the host resource manager, which relays it to the relevant servers.
The pstat framework owns the local half: reading the local operating
system and packaging the numbers as PMIx attributes. The routing decision
itself lives just above it, in src/common/pmix_monitor.c.
6.11.2. Public API
Three functions are exported from include/pmix.h:
pmix_status_t PMIx_Process_monitor(const pmix_info_t *monitor,
pmix_status_t error,
const pmix_info_t directives[],
size_t ndirs,
pmix_info_t **results,
size_t *nresults);
pmix_status_t PMIx_Process_monitor_nb(const pmix_info_t *monitor,
pmix_status_t error,
const pmix_info_t directives[],
size_t ndirs,
pmix_info_cbfunc_t cbfunc,
void *cbdata);
void PMIx_Heartbeat(void);
PMIx_Process_monitor is a thin blocking wrapper: it constructs a
pmix_cb_t, calls PMIx_Process_monitor_nb, and sleeps on the
condition variable inside the callback object until the non-blocking path
completes. Both are implemented in src/common/pmix_monitor.c.
The single monitor argument is the request. Its key names what
kind of statistics are wanted; its value is a PMIX_DATA_ARRAY of
pmix_info_t naming the specific fields. The directives array
qualifies the request — which processes/nodes to target, whether to sample
once or periodically, an ID handle, and so on. As with every PMIx API, new
capability is added by defining new attributes rather than new entry
points.
PMIx_Heartbeat and the PMIX_SEND_HEARTBEAT monitor key are a
distinct, lighter-weight use of the same API used for liveness detection;
they are handled entirely in pmix_monitor.c and never reach the
pstat framework, so they are not discussed further here.
6.11.3. Request Attributes
The monitor keys (the monitor->key) that pstat understands:
Monitor key |
Meaning |
|---|---|
|
per-process statistics |
|
node load / memory (may nest network + disk) |
|
disk read/write/io counters |
|
network byte/packet/error counters |
|
cancel a previously-started periodic monitor (value = its ID) |
The directives that shape a request:
Directive |
Effect |
|---|---|
|
caller-supplied string handle; required to later cancel a periodic monitor |
|
sample periodically every N seconds (uint32) instead of once |
|
array of |
|
array of |
|
restrict to named nodes / node IDs (used for scope resolution) |
|
do not forward any part of the request to the host RM |
The results come back under one attribute per category, each a
PMIX_DATA_ARRAY of the individual measurements:
PMIX_PROC_RESOURCE_USAGE, PMIX_NODE_RESOURCE_USAGE,
PMIX_DISK_RESOURCE_USAGE, and PMIX_NETWORK_RESOURCE_USAGE. Each
inner array is tagged with a sample time (PMIX_PROC_SAMPLE_TIME,
PMIX_NODE_SAMPLE_TIME, and so on). Because every component emits the
same attributes, a caller parses the answer identically regardless of
which component (or node) produced it.
6.11.4. From API to Framework
The path from the public call to the framework is:
Client / tool side. A non-server process cannot read the OS, so
PMIx_Process_monitor_nbpacks themonitor, error code, and directives into aPMIX_MONITOR_CMDmessage and sends it to its server. The reply is unpacked and handed to the caller’s callback.Server side, entry. A server calling the API locally, or receiving the command from a client (
pmix_server_monitorinsrc/server/pmix_server_ops.c), builds apmix_cb_tcarrying the monitor, directives, and requestor identity, and thread-shifts it onto the progress thread viaPMIX_THREADSHIFT(cb, pmix_monitor_processing).Scope resolution.
pmix_monitor_processingwalks the target directives to classify the request as local, remote, or both, by comparing the requested procs/nodes/pids againstpmix_server_globals.clientsand the local hostname/nodeid:purely local — call
pmix_pstat.query(...)and return its results directly;purely remote — call
pmix_host_server.monitor(...)(the host RM); if the host provides no monitor entry point, returnPMIX_ERR_NOT_SUPPORTED;both — call
pmix_pstat.queryfor the local contribution, then pass the request up to the host and combine the local results with whatever the host returns (seehostcb/hostprocess).
Into ``pstat``.
pmix_pstat.queryis the selected component’squeryfunction — the boundary into the framework. It is synchronous and already on the progress thread, so it may read shared server state directly.
The framework itself is opened and a component selected during server
startup, in pmix_server_init (src/server/pmix_server.c), right
after the pgpu framework and before the listener starts.
6.11.5. Inside the Framework
pstat is a single-select framework: exactly one component runs.
Component |
Priority |
Role |
|---|---|---|
|
80 |
Reads real statistics from the Linux |
|
80 |
Reads real statistics on macOS (Darwin) through the native mach,
libproc, sysctl, and IOKit interfaces. Built only on Apple hosts;
never coexists with |
|
20 |
Returns fixed, canned values with no OS access. Always built; the fallback where no native reader is present and the vehicle for CI. |
If no component can run, the base leaves an unsupported module in place
whose query returns PMIX_ERR_NOT_SUPPORTED, so the monitor API
degrades cleanly and reports the capability as absent rather than
crashing.
Selection (pstat_base_select.c) is a textbook
pmix_mca_base_select: pick the highest-priority runnable component,
cache its module in the global pmix_pstat, and call its init.
6.11.5.1. The request object and its two collection modes
Each request becomes a pmix_pstat_op_t (defined in
src/mca/pstat/base/base.h) that records the target peers, the
selected fields (four all-bool “which fields?” structs —
pmix_procstats_t, pmix_ndstats_t, pmix_netstats_t,
pmix_dkstats_t), the optional device-ID filters, and — for a periodic
monitor — a libevent timer and interval.
A component’s query fills the op (using the base parse helpers
pmix_pstat_parse_procstats / _ndstats / _netstats /
_dkstats to translate the requested-field array into the bool
structs), selects the target peers with
PMIX_PSTAT_APPEND_PEER_UNIQUE, and then drives a single collection
function, update(), which runs in one of two modes distinguished by
whether op->cb is set:
Synchronous (``op->cb != NULL``).
querypointsop->cbat a stack callback object holding an info-list builder, callsupdate()directly, and converts the accumulated list into the*resultsarray returned to the caller. This produces the immediate answer for a one-shot request — and the first sample of a periodic one.Periodic (``op->cb == NULL``). When a rate was given,
queryappends the op topmix_pstat_base.opsand arms the timer (PMIX_PSTAT_OP_START). Each time the timer fires, the sameupdate()runs withop->cb == NULL: it builds a fresh result, and instead of returning it, delivers it asynchronously withPMIx_Notify_event(op->eventcode, ...)targeted at the requestor, then re-arms the timer.
A one-shot op is released immediately after the synchronous pass; a
periodic op lives on pmix_pstat_base.ops until a
PMIX_MONITOR_CANCEL naming its PMIX_MONITOR_ID removes and
releases it (which also deletes the timer).
By default the periodic timer runs on the library’s main progress thread.
Setting the framework MCA parameter pstat_base_use_separate_thread
moves sampling onto a dedicated "PSTAT" progress thread so it cannot
perturb the main thread.
6.11.5.2. Where the numbers come from
The plinux component reads the kernel’s /proc filesystem:
Category |
Sources |
|---|---|
per-process |
|
node |
|
disk |
|
network |
|
Only the fields the caller requested are read and emitted. Values that
/proc reports in kB are normalized to MB. A PMIX_MONITOR_NODE_...
request can nest network and disk sub-arrays inside the returned node data
array, so one call can retrieve a full node picture.
The test component mirrors plinux’s control flow exactly but
substitutes fixed constants for the OS reads (two fabricated disks, three
fabricated interfaces, canned per-process and node figures), which makes
it both a portable fallback and a deterministic test double. To force it
during development:
export PMIX_MCA_pstat=test
6.11.6. Threading
Everything in pstat runs on a PMIx progress thread. query is
reached only through the thread-shift performed by the monitor API, so a
component may touch pmix_server_globals.clients and other shared
server state directly. Periodic update() calls fire from the same
event base (the main progress thread, or the dedicated "PSTAT" thread
if pstat_base_use_separate_thread is set). Result arrays are assembled
with the PMIx_Info_list_* builder helpers, and periodic samples are
delivered with the standard PMIx_Notify_event mechanism.
6.11.7. Summary
PMIx_Process_monitor/_nbare the public entry points;PMIx_Heartbeatis a related liveness call handled outside the framework.src/common/pmix_monitor.csplits each request into a local part (handled bypstat) and a remote part (handed to the host RM).pstatis single-select:plinuxreads/procon Linux;testreturns canned values everywhere else; an unsupported stub covers “no component.”A
pmix_pstat_op_tdrives both one-shot collection (synchronous, results returned) and periodic monitoring (timer-driven, results pushed as events), distinguished by theop->cbfield.Results are returned as
PMIX_*_RESOURCE_USAGEdata arrays, uniform across components and nodes.