Skip to content

Commit 9e906a9

Browse files
committed
Merge tag 'perf-tools-for-v6.19-2025-12-06' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools
Pull perf tools updates from Namhyung Kim: "Perf event/metric description: Unify all event and metric descriptions in JSON format. Now event parsing and handling is greatly simplified by that. From users point of view, perf list will provide richer information about hardware events like the following. $ perf list hw List of pre-defined events (to be used in -e or -M): legacy hardware: branch-instructions [Retired branch instructions [This event is an alias of branches]. Unit: cpu] branch-misses [Mispredicted branch instructions. Unit: cpu] branches [Retired branch instructions [This event is an alias of branch-instructions]. Unit: cpu] bus-cycles [Bus cycles,which can be different from total cycles. Unit: cpu] cache-misses [Cache misses. Usually this indicates Last Level Cache misses; this is intended to be used in conjunction with the PERF_COUNT_HW_CACHE_REFERENCES event to calculate cache miss rates. Unit: cpu] cache-references [Cache accesses. Usually this indicates Last Level Cache accesses but this may vary depending on your CPU. This may include prefetches and coherency messages; again this depends on the design of your CPU. Unit: cpu] cpu-cycles [Total cycles. Be wary of what happens during CPU frequency scaling [This event is an alias of cycles]. Unit: cpu] cycles [Total cycles. Be wary of what happens during CPU frequency scaling [This event is an alias of cpu-cycles]. Unit: cpu] instructions [Retired instructions. Be careful,these can be affected by various issues,most notably hardware interrupt counts. Unit: cpu] ref-cycles [Total cycles; not affected by CPU frequency scaling. Unit: cpu] But most notable changes would be in the perf stat. On the right side, the default metrics are better named and aligned. :) $ perf stat -- perf test -w noploop Performance counter stats for 'perf test -w noploop': 11 context-switches # 10.8 cs/sec cs_per_second 0 cpu-migrations # 0.0 migrations/sec migrations_per_second 3,612 page-faults # 3532.5 faults/sec page_faults_per_second 1,022.51 msec task-clock # 1.0 CPUs CPUs_utilized 110,466 branch-misses # 0.0 % branch_miss_rate (88.66%) 6,934,452,104 branches # 6781.8 M/sec branch_frequency (88.66%) 4,657,032,590 cpu-cycles # 4.6 GHz cycles_frequency (88.65%) 27,755,874,218 instructions # 6.0 instructions insn_per_cycle (89.03%) TopdownL1 # 0.3 % tma_backend_bound # 9.3 % tma_bad_speculation (89.05%) # 9.7 % tma_frontend_bound (77.86%) # 80.7 % tma_retiring (88.81%) 1.025318171 seconds time elapsed 1.013248000 seconds user 0.012014000 seconds sys Deferred unwinding support: With the kernel support (commit c69993e: "perf: Support deferred user unwind"), perf can use deferred callchains for userspace stack trace with frame pointers like below: $ perf record --call-graph fp,defer ... This will be transparent to users when it comes to other commands like perf report and perf script. They will merge the deferred callchains to the previous samples as if they were collected together. ARM SPE updates - Extensive enhancements to support various kinds of memory operations including GCS, MTE allocation tags, memcpy/memset, register access, and SIMD operations. - Add inverted data source filter (inv_data_src_filter) support to exclude certain data sources. - Improve documentation. Vendor event updates: - Intel: Updated event files for Sierra Forest, Panther Lake, Meteor Lake, Lunar Lake, Granite Rapids, and others. - Arm64: Added metrics for i.MX94 DDR PMU and Cortex-A720AE definitions. - RISC-V: Added JSON support for T-HEAD C920V2. Misc: - Improve pointer tracking in data type profiling. It'd give better output when the variable is using container_of() to convert type. - Annotation support for perf c2c report in TUI. Press 'a' key to enter annotation view from cacheline browser window. This will show which instruction is causing the cacheline contention. - Lots of fixes and test coverage improvements!" * tag 'perf-tools-for-v6.19-2025-12-06' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools: (214 commits) libperf: Use 'extern' in LIBPERF_API visibility macro perf stat: Improve handling of termination by signal perf tests stat: Add test for error for an offline CPU perf stat: When no events, don't report an error if there is none perf tests stat: Add "--null" coverage perf cpumap: Add "any" CPU handling to cpu_map__snprint_mask libperf cpumap: Fix perf_cpu_map__max for an empty/NULL map perf stat: Allow no events to open if this is a "--null" run perf test kvm: Add some basic perf kvm test coverage perf tests evlist: Add basic evlist test perf tests script dlfilter: Add a dlfilter test perf tests kallsyms: Add basic kallsyms test perf tests timechart: Add a perf timechart test perf tests top: Add basic perf top coverage test perf tests buildid: Add purge and remove testing perf tests c2c: Add a basic c2c perf c2c: Clean up some defensive gets and make asan clean perf jitdump: Fix missed dso__put perf mem-events: Don't leak online CPU map perf hist: In init, ensure mem_info is put on error paths ...
2 parents 37bb2e7 + 2eeb09f commit 9e906a9

File tree

246 files changed

+11130
-3670
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

246 files changed

+11130
-3670
lines changed

tools/arch/arm64/include/asm/cputype.h

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -96,6 +96,7 @@
9696
#define ARM_CPU_PART_NEOVERSE_V3 0xD84
9797
#define ARM_CPU_PART_CORTEX_X925 0xD85
9898
#define ARM_CPU_PART_CORTEX_A725 0xD87
99+
#define ARM_CPU_PART_CORTEX_A720AE 0xD89
99100
#define ARM_CPU_PART_NEOVERSE_N3 0xD8E
100101

101102
#define APM_CPU_PART_XGENE 0x000
@@ -185,6 +186,7 @@
185186
#define MIDR_NEOVERSE_V3 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_NEOVERSE_V3)
186187
#define MIDR_CORTEX_X925 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_X925)
187188
#define MIDR_CORTEX_A725 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A725)
189+
#define MIDR_CORTEX_A720AE MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A720AE)
188190
#define MIDR_NEOVERSE_N3 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_NEOVERSE_N3)
189191
#define MIDR_THUNDERX MIDR_CPU_MODEL(ARM_CPU_IMP_CAVIUM, CAVIUM_CPU_PART_THUNDERX)
190192
#define MIDR_THUNDERX_81XX MIDR_CPU_MODEL(ARM_CPU_IMP_CAVIUM, CAVIUM_CPU_PART_THUNDERX_81XX)

tools/build/Makefile.feature

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -90,7 +90,6 @@ FEATURE_TESTS_BASIC := \
9090
timerfd \
9191
zlib \
9292
lzma \
93-
get_cpuid \
9493
bpf \
9594
scandirat \
9695
sched_getcpu \
@@ -146,7 +145,6 @@ FEATURE_DISPLAY ?= \
146145
llvm-perf \
147146
zlib \
148147
lzma \
149-
get_cpuid \
150148
bpf \
151149
libaio \
152150
libzstd

tools/build/feature/Makefile

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -56,7 +56,6 @@ FILES= \
5656
test-lzma.bin \
5757
test-bpf.bin \
5858
test-libbpf.bin \
59-
test-get_cpuid.bin \
6059
test-sdt.bin \
6160
test-cxx.bin \
6261
test-gettid.bin \
@@ -318,9 +317,6 @@ $(OUTPUT)test-zlib.bin:
318317
$(OUTPUT)test-lzma.bin:
319318
$(BUILD) -llzma
320319

321-
$(OUTPUT)test-get_cpuid.bin:
322-
$(BUILD)
323-
324320
$(OUTPUT)test-bpf.bin:
325321
$(BUILD)
326322

tools/build/feature/test-all.c

Lines changed: 0 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -114,10 +114,6 @@
114114
# include "test-lzma.c"
115115
#undef main
116116

117-
#define main main_test_get_cpuid
118-
# include "test-get_cpuid.c"
119-
#undef main
120-
121117
#define main main_test_bpf
122118
# include "test-bpf.c"
123119
#undef main
@@ -168,7 +164,6 @@ int main(int argc, char *argv[])
168164
main_test_pthread_attr_setaffinity_np();
169165
main_test_pthread_barrier();
170166
main_test_lzma();
171-
main_test_get_cpuid();
172167
main_test_bpf();
173168
main_test_scandirat();
174169
main_test_sched_getcpu();

tools/build/feature/test-get_cpuid.c

Lines changed: 0 additions & 8 deletions
This file was deleted.

tools/include/uapi/linux/perf_event.h

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -382,6 +382,7 @@ enum perf_event_read_format {
382382
#define PERF_ATTR_SIZE_VER6 120 /* Add: aux_sample_size */
383383
#define PERF_ATTR_SIZE_VER7 128 /* Add: sig_data */
384384
#define PERF_ATTR_SIZE_VER8 136 /* Add: config3 */
385+
#define PERF_ATTR_SIZE_VER9 144 /* add: config4 */
385386

386387
/*
387388
* 'struct perf_event_attr' contains various attributes that define
@@ -545,6 +546,7 @@ struct perf_event_attr {
545546
__u64 sig_data;
546547

547548
__u64 config3; /* extension of config2 */
549+
__u64 config4; /* extension of config3 */
548550
};
549551

550552
/*

tools/lib/perf/cpumap.c

Lines changed: 25 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -368,10 +368,12 @@ struct perf_cpu perf_cpu_map__max(const struct perf_cpu_map *map)
368368
.cpu = -1
369369
};
370370

371-
// cpu_map__trim_new() qsort()s it, cpu_map__default_new() sorts it as well.
372-
return __perf_cpu_map__nr(map) > 0
373-
? __perf_cpu_map__cpu(map, __perf_cpu_map__nr(map) - 1)
374-
: result;
371+
if (!map)
372+
return result;
373+
374+
// The CPUs are always sorted and nr is always > 0 as 0 length map is
375+
// encoded as NULL.
376+
return __perf_cpu_map__cpu(map, __perf_cpu_map__nr(map) - 1);
375377
}
376378

377379
/** Is 'b' a subset of 'a'. */
@@ -453,21 +455,33 @@ int perf_cpu_map__merge(struct perf_cpu_map **orig, struct perf_cpu_map *other)
453455
struct perf_cpu_map *perf_cpu_map__intersect(struct perf_cpu_map *orig,
454456
struct perf_cpu_map *other)
455457
{
456-
struct perf_cpu *tmp_cpus;
457-
int tmp_len;
458458
int i, j, k;
459-
struct perf_cpu_map *merged = NULL;
459+
struct perf_cpu_map *merged;
460460

461461
if (perf_cpu_map__is_subset(other, orig))
462462
return perf_cpu_map__get(orig);
463463
if (perf_cpu_map__is_subset(orig, other))
464464
return perf_cpu_map__get(other);
465465

466-
tmp_len = max(__perf_cpu_map__nr(orig), __perf_cpu_map__nr(other));
467-
tmp_cpus = malloc(tmp_len * sizeof(struct perf_cpu));
468-
if (!tmp_cpus)
466+
i = j = k = 0;
467+
while (i < __perf_cpu_map__nr(orig) && j < __perf_cpu_map__nr(other)) {
468+
if (__perf_cpu_map__cpu(orig, i).cpu < __perf_cpu_map__cpu(other, j).cpu)
469+
i++;
470+
else if (__perf_cpu_map__cpu(orig, i).cpu > __perf_cpu_map__cpu(other, j).cpu)
471+
j++;
472+
else { /* CPUs match. */
473+
i++;
474+
j++;
475+
k++;
476+
}
477+
}
478+
if (k == 0) /* Maps are completely disjoint. */
469479
return NULL;
470480

481+
merged = perf_cpu_map__alloc(k);
482+
if (!merged)
483+
return NULL;
484+
/* Entries are added to merged in sorted order, so no need to sort again. */
471485
i = j = k = 0;
472486
while (i < __perf_cpu_map__nr(orig) && j < __perf_cpu_map__nr(other)) {
473487
if (__perf_cpu_map__cpu(orig, i).cpu < __perf_cpu_map__cpu(other, j).cpu)
@@ -476,11 +490,8 @@ struct perf_cpu_map *perf_cpu_map__intersect(struct perf_cpu_map *orig,
476490
j++;
477491
else {
478492
j++;
479-
tmp_cpus[k++] = __perf_cpu_map__cpu(orig, i++);
493+
RC_CHK_ACCESS(merged)->map[k++] = __perf_cpu_map__cpu(orig, i++);
480494
}
481495
}
482-
if (k)
483-
merged = cpu_map__trim_new(k, tmp_cpus);
484-
free(tmp_cpus);
485496
return merged;
486497
}

tools/lib/perf/include/perf/core.h

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55
#include <stdarg.h>
66

77
#ifndef LIBPERF_API
8-
#define LIBPERF_API __attribute__((visibility("default")))
8+
#define LIBPERF_API extern __attribute__((visibility("default")))
99
#endif
1010

1111
enum libperf_print_level {

tools/lib/perf/include/perf/event.h

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -151,6 +151,18 @@ struct perf_record_switch {
151151
__u32 next_prev_tid;
152152
};
153153

154+
struct perf_record_callchain_deferred {
155+
struct perf_event_header header;
156+
/*
157+
* This is to match kernel and (deferred) user stacks together.
158+
* The kernel part will be in the sample callchain array after
159+
* the PERF_CONTEXT_USER_DEFERRED entry.
160+
*/
161+
__u64 cookie;
162+
__u64 nr;
163+
__u64 ips[];
164+
};
165+
154166
struct perf_record_header_attr {
155167
struct perf_event_header header;
156168
struct perf_event_attr attr;
@@ -523,6 +535,7 @@ union perf_event {
523535
struct perf_record_read read;
524536
struct perf_record_throttle throttle;
525537
struct perf_record_sample sample;
538+
struct perf_record_callchain_deferred callchain_deferred;
526539
struct perf_record_bpf_event bpf;
527540
struct perf_record_ksymbol ksymbol;
528541
struct perf_record_text_poke_event text_poke;

tools/perf/Documentation/perf-arm-spe.txt

Lines changed: 95 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -141,27 +141,65 @@ Config parameters
141141
These are placed between the // in the event and comma separated. For example '-e
142142
arm_spe/load_filter=1,min_latency=10/'
143143

144-
branch_filter=1 - collect branches only (PMSFCR.B)
145-
event_filter=<mask> - filter on specific events (PMSEVFR) - see bitfield description below
144+
event_filter=<mask> - logical AND filter on specific events (PMSEVFR) - see bitfield description below
145+
inv_event_filter=<mask> - logical OR to filter out specific events (PMSNEVFR, FEAT_SPEv1p2) - see bitfield description below
146146
jitter=1 - use jitter to avoid resonance when sampling (PMSIRR.RND)
147-
load_filter=1 - collect loads only (PMSFCR.LD)
148147
min_latency=<n> - collect only samples with this latency or higher* (PMSLATFR)
149148
pa_enable=1 - collect physical address (as well as VA) of loads/stores (PMSCR.PA) - requires privilege
150149
pct_enable=1 - collect physical timestamp instead of virtual timestamp (PMSCR.PCT) - requires privilege
151-
store_filter=1 - collect stores only (PMSFCR.ST)
152150
ts_enable=1 - enable timestamping with value of generic timer (PMSCR.TS)
153151
discard=1 - enable SPE PMU events but don't collect sample data - see 'Discard mode' (PMBLIMITR.FM = DISCARD)
152+
inv_data_src_filter=<mask> - mask to filter from 0-63 possible data sources (PMSDSFR, FEAT_SPE_FDS) - See 'Data source filtering'
154153

155154
+++*+++ Latency is the total latency from the point at which sampling started on that instruction, rather
156155
than only the execution latency.
157156

158-
Only some events can be filtered on; these include:
159-
160-
bit 1 - instruction retired (i.e. omit speculative instructions)
157+
Only some events can be filtered on using 'event_filter' bits. The overall
158+
filter is the logical AND of these bits, for example if bits 3 and 5 are set
159+
only samples that have both 'L1D cache refill' AND 'TLB walk' are recorded. When
160+
FEAT_SPEv1p2 is implemented 'inv_event_filter' can also be used to exclude
161+
events that have any (OR) of the filter's bits set. For example setting bits 3
162+
and 5 in 'inv_event_filter' will exclude any events that are either L1D cache
163+
refill OR TLB walk. If the same bit is set in both filters it's UNPREDICTABLE
164+
whether the sample is included or excluded. Filter bits for both event_filter
165+
and inv_event_filter are:
166+
167+
bit 1 - Instruction retired (i.e. omit speculative instructions)
168+
bit 2 - L1D access (FEAT_SPEv1p4)
161169
bit 3 - L1D refill
170+
bit 4 - TLB access (FEAT_SPEv1p4)
162171
bit 5 - TLB refill
163-
bit 7 - mispredict
164-
bit 11 - misaligned access
172+
bit 6 - Not taken event (FEAT_SPEv1p2)
173+
bit 7 - Mispredict
174+
bit 8 - Last level cache access (FEAT_SPEv1p4)
175+
bit 9 - Last level cache miss (FEAT_SPEv1p4)
176+
bit 10 - Remote access (FEAT_SPEv1p4)
177+
bit 11 - Misaligned access (FEAT_SPEv1p1)
178+
bit 12-15 - IMPLEMENTATION DEFINED events (when implemented)
179+
bit 16 - Transaction (FEAT_TME)
180+
bit 17 - Partial or empty SME or SVE predicate (FEAT_SPEv1p1)
181+
bit 18 - Empty SME or SVE predicate (FEAT_SPEv1p1)
182+
bit 19 - L2D access (FEAT_SPEv1p4)
183+
bit 20 - L2D miss (FEAT_SPEv1p4)
184+
bit 21 - Cache data modified (FEAT_SPEv1p4)
185+
bit 22 - Recently fetched (FEAT_SPEv1p4)
186+
bit 23 - Data snooped (FEAT_SPEv1p4)
187+
bit 24 - Streaming SVE mode event (when FEAT_SPE_SME is implemented), or
188+
IMPLEMENTATION DEFINED event 24 (when implemented, only versions
189+
less than FEAT_SPEv1p4)
190+
bit 25 - SMCU or external coprocessor operation event when FEAT_SPE_SME is
191+
implemented, or IMPLEMENTATION DEFINED event 25 (when implemented,
192+
only versions less than FEAT_SPEv1p4)
193+
bit 26-31 - IMPLEMENTATION DEFINED events (only versions less than FEAT_SPEv1p4)
194+
bit 48-63 - IMPLEMENTATION DEFINED events (when implemented)
195+
196+
For IMPLEMENTATION DEFINED bits, refer to the CPU TRM if these bits are
197+
implemented.
198+
199+
The driver will reject events if requested filter bits require unimplemented SPE
200+
versions, but will not reject filter bits for unimplemented IMPDEF bits or when
201+
their related feature is not present (e.g. SME). For example, if FEAT_SPEv1p2 is
202+
not implemented, filtering on "Not taken event" (bit 6) will be rejected.
165203

166204
So to sample just retired instructions:
167205

@@ -171,6 +209,31 @@ or just mispredicted branches:
171209

172210
perf record -e arm_spe/event_filter=0x80/ -- ./mybench
173211

212+
When set, the following filters can be used to select samples that match any of
213+
the operation types (OR filtering). If only one is set then only samples of that
214+
type are collected:
215+
216+
branch_filter=1 - Collect branches (PMSFCR.B)
217+
load_filter=1 - Collect loads (PMSFCR.LD)
218+
store_filter=1 - Collect stores (PMSFCR.ST)
219+
220+
When extended filtering is supported (FEAT_SPE_EFT), SIMD and float
221+
pointer operations can also be selected:
222+
223+
simd_filter=1 - Collect SIMD loads, stores and operations (PMSFCR.SIMD)
224+
float_filter=1 - Collect floating point loads, stores and operations (PMSFCR.FP)
225+
226+
When extended filtering is supported (FEAT_SPE_EFT), operation type filters can
227+
be changed to AND using _mask fields. For example samples could be selected if
228+
they are store AND SIMD by setting 'store_filter=1,simd_filter=1,
229+
store_filter_mask=1,simd_filter_mask=1'. The new masks are as follows:
230+
231+
branch_filter_mask=1 - Change branch filter behavior from OR to AND (PMSFCR.Bm)
232+
load_filter_mask=1 - Change load filter behavior from OR to AND (PMSFCR.LDm)
233+
store_filter_mask=1 - Change store filter behavior from OR to AND (PMSFCR.STm)
234+
simd_filter_mask=1 - Change SIMD filter behavior from OR to AND (PMSFCR.SIMDm)
235+
float_filter_mask=1 - Change floating point filter behavior from OR to AND (PMSFCR.FPm)
236+
174237
Viewing the data
175238
~~~~~~~~~~~~~~~~~
176239

@@ -210,6 +273,10 @@ Memory access details are also stored on the samples and this can be viewed with
210273

211274
perf report --mem-mode
212275

276+
The latency value from the SPE sample is stored in the 'weight' field of the
277+
Perf samples and can be displayed in Perf script and report outputs by enabling
278+
its display from the command line.
279+
213280
Common errors
214281
~~~~~~~~~~~~~
215282

@@ -253,6 +320,25 @@ to minimize output. Then run perf stat:
253320
perf record -e arm_spe/discard/ -a -N -B --no-bpf-event -o - > /dev/null &
254321
perf stat -e SAMPLE_FEED_LD
255322

323+
Data source filtering
324+
~~~~~~~~~~~~~~~~~~~~~
325+
326+
When FEAT_SPE_FDS is present, 'inv_data_src_filter' can be used as a mask to
327+
filter on a subset (0 - 63) of possible data source IDs. The full range of data
328+
sources is 0 - 65535 although these are unlikely to be used in practice. Data
329+
sources are IMPDEF so refer to the TRM for the mappings. Each bit N of the
330+
filter maps to data source N. The filter is an OR of all the bits, and the value
331+
provided inv_data_src_filter is inverted before writing to PMSDSFR_EL1 so that
332+
set bits exclude that data source and cleared bits include that data source.
333+
Therefore the default value of 0 is equivalent to no filtering (all data sources
334+
included).
335+
336+
For example, to include only data sources 0 and 3, clear bits 0 and 3
337+
(0xFFFFFFFFFFFFFFF6)
338+
339+
When 'inv_data_src_filter' is set to 0xFFFFFFFFFFFFFFFF, any samples with any
340+
data source set are excluded.
341+
256342
SEE ALSO
257343
--------
258344

0 commit comments

Comments
 (0)