perf和火焰图使用方法

perf是linux上的性能分析工具，perf可以对event进行统计得到event的发生次数，或者对event进行采样，得到每次event发生时的相关数据(cpu、进程id、运行栈等)，利用这些数据来对程序性能进行分析。

perf可以统计或采样的event有很多，如果我们要分析cpu，那么我们可以使用cpu-cycles、cpu-clock来衡量占用cpu的程序的分布情况，还可以通过cache-misses、page-faults、branch-misses等event来分析造成cpu占用高的底层原因，确定原因后方便优化。

如果我们要分析内存、io、网络等，也可以通过其他event来进行分析，perf可以使用的event非常多，如果要使用perf来分析问题，就需要了解问题相关的event有哪些，作用是什么，这是使用perf的一个门槛。

perf工作大致可以分成三种模式：

counter

sampling

采样频率

perf可以使用的event非常多，上图是Brendan Gregg的文章中找到的一张图，画出了perf可以使用的event的结构图，大致可以分为以下几类：

CPU

软件

安装

x86安装

sudo apt install linux-tools-common
sudo apt install linux-tools-generic
sudo apt install linux-tools-5.4.0-137-generic

交叉编译

由于我们经常是在自己编译的内核上进行开发工作，进入linux内核源码目录linux/tools/perf。

➜  tools git:(firefly) ✗ make CROSS_COMPILE=/home/zhongyi/code/rk3399_linux_release_v2.5.1_20210301/prebuilts/gcc/linux-x86/aarch64/gcc-linaro-6.3.1-2017.05-x86_64_aarch64-linux-gnu/bin/aarch64-linux-gnu- ARCH=arm WERROR=0 perf V=1

可能在编译的时候，有报错大概是由于平台问题，数据类型不匹配，导致所有的warning都被当作error对待：出现这问题的原因是-Werror这个gcc编译选项。只要在makefile中找到包含这个-Werror选项的句子，将-Werror删除，或是注释掉就行了

编译完成后将会在当前目录下生成perf可执行文件，拷贝到设备上即可运行。

root@firefly:~/mnt# ./perf --help

 usage: perf [--version] [--help] [OPTIONS] COMMAND [ARGS]

 The most commonly used perf commands are:
   annotate        Read perf.data (created by perf record) and display annotated code
   archive         Create archive with object files with build-ids found in perf.data file
   bench           General framework for benchmark suites
   buildid-cache   Manage build-id cache.
   buildid-list    List the buildids in a perf.data file
   data            Data file related processing
   diff            Read perf.data files and display the differential profile
   evlist          List the event names in a perf.data file
   inject          Filter to augment the events stream with additional information
   kmem            Tool to trace/measure kernel memory properties
   kvm             Tool to trace/measure kvm guest os
   list            List all symbolic event types
   lock            Analyze lock events
   mem             Profile memory accesses
   record          Run a command and record its profile into perf.data
   report          Read perf.data (created by perf record) and display the profile
   sched           Tool to trace/measure scheduler properties (latencies)
   script          Read perf.data (created by perf record) and display trace output
   stat            Run a command and gather performance counter statistics
   test            Runs sanity tests.
   timechart       Tool to visualize total system behavior during a workload
   top             System profiling tool.
   trace           strace inspired tool

 See 'perf help COMMAND' for more information on a specific command.

使用方法

总览

上图整理了perf的子命令之间的关系，常用的有：

perf record —— 采样，生成perf.data二进制文件perf annotate/perf report/perf script —— 分析perf.data文件，annotate可以查看代码，report可以统计分析，script是直接转化成文本格式perf stat —— counter，统计event的出现次数perf top —— 整个系统的分析，类似于top命令，但可以具体到函数，可以指定event

下面我们介绍一些常用的使用方法。

help

perf --help之后可以看到perf的一级命令。

root@firefly:~/mnt# ./perf --help

 usage: perf [--version] [--help] [OPTIONS] COMMAND [ARGS]

 The most commonly used perf commands are:
   annotate        Read perf.data (created by perf record) and display annotated code
   archive         Create archive with object files with build-ids found in perf.data file
   bench           General framework for benchmark suites
   buildid-cache   Manage build-id cache.
   buildid-list    List the buildids in a perf.data file
   data            Data file related processing
   diff            Read perf.data files and display the differential profile
   evlist          List the event names in a perf.data file
   inject          Filter to augment the events stream with additional information
   kmem            Tool to trace/measure kernel memory properties
   kvm             Tool to trace/measure kvm guest os
   list            List all symbolic event types
   lock            Analyze lock events
   mem             Profile memory accesses
   record          Run a command and record its profile into perf.data
   report          Read perf.data (created by perf record) and display the profile
   sched           Tool to trace/measure scheduler properties (latencies)
   script          Read perf.data (created by perf record) and display trace output
   stat            Run a command and gather performance counter statistics
   test            Runs sanity tests.
   timechart       Tool to visualize total system behavior during a workload
   top             System profiling tool.
   trace           strace inspired tool

 See 'perf help COMMAND' for more information on a specific command.

perf command --help 可以看到二级命令的帮助命令。

root@firefly:~/mnt# ./perf stat -h

 Usage: perf stat [<options>] [<command>]

    -a, --all-cpus        system-wide collection from all CPUs
    -A, --no-aggr         disable CPU count aggregation
    -B, --big-num         print large numbers with thousands' separators
    -C, --cpu <cpu>       list of cpus to monitor in system-wide
    -c, --scale           scale/normalize counters
    -D, --delay <n>       ms to wait before starting measurement after program s
    -d, --detailed        detailed run - start a lot of events
    -e, --event <event>   event selector. use 'perf list' to list available even
    -G, --cgroup <name>   monitor event in cgroup name only
    -g, --group           put the counters into a counter group
    -I, --interval-print <n>
                          print counts at regular interval in ms (>= 10)
    -i, --no-inherit      child tasks do not inherit counters
    -n, --null            null run - dont start any counters
    -o, --output <file>   output file name
    -p, --pid <pid>       stat events on existing process id
    -r, --repeat <n>      repeat command and print average + stddev (max: 100, f
    -S, --sync            call sync() before starting a run
    -t, --tid <tid>       stat events on existing thread id
    -T, --transaction     hardware transaction statistics

下面对一级命令作一个解释

序号	命令	作用
1	annotate	解析perf record生成的perf.data文件，显示被注释的代码。
2	archive	根据数据文件记录的build-id，将所有被采样到的elf文件打包。利用此压缩包，可以再任何机器上分析数据文件中记录的采样数据。
3	bench	perf中内置的benchmark，目前包括两套针对调度器和内存管理子系统的benchmark。
4	buildid-cache	管理perf的buildid缓存，每个elf文件都有一个独一无二的buildid。buildid被perf用来关联性能数据与elf文件。
5	buildid-list	列出数据文件中记录的所有buildid。
6	diff	对比两个数据文件的差异。能够给出每个符号（函数）在热点分析上的具体差异。
7	evlist	列出数据文件perf.data中所有性能事件。
8	inject	该工具读取perf record工具记录的事件流，并将其定向到标准输出。在被分析代码中的任何一点，都可以向事件流中注入其它事件。
9	kmem	针对内核内存（slab）子系统进行追踪测量的工具
10	kvm	用来追踪测试运行在KVM虚拟机上的Guest OS。
11	list	列出当前系统支持的所有性能事件。包括硬件性能事件、软件性能事件以及检查点。
12	lock	分析内核中的锁信息，包括锁的争用情况，等待延迟等。
13	mem	内存存取情况
14	record	收集采样信息，并将其记录在数据文件中。随后可通过其它工具对数据文件进行分析。
15	report	读取perf record创建的数据文件，并给出热点分析结果。
16	sched	针对调度器子系统的分析工具。
17	script	执行perl或python写的功能扩展脚本、生成脚本框架、读取数据文件中的数据信息等。
18	stat	执行某个命令，收集特定进程的性能概况，包括CPI、Cache丢失率等。
19	test	perf对当前软硬件平台进行健全性测试，可用此工具测试当前的软硬件平台是否能支持perf的所有功能。
20	timechart	针对测试期间系统行为进行可视化的工具
21	top	类似于linux的top命令，对系统性能进行实时分析。
22	trace	关于syscall的工具。
23	probe	用于定义动态检查点。

全局性概况：

perf list查看当前系统支持的性能事件；

perf bench对系统性能进行摸底；

perf test对系统进行健全性测试；

perf stat对全局性能进行统计；

全局细节：

perf top可以实时查看当前系统进程函数占用率情况；

perf probe可以自定义动态事件；

特定功能分析：

perf kmem针对slab子系统性能分析；

perf kvm针对kvm虚拟化分析；

perf lock分析锁性能；

perf mem分析内存slab性能；

perf sched分析内核调度器性能；

perf trace记录系统调用轨迹；

最常用功能perf record，可以系统全局，也可以具体到某个进程，更甚具体到某一进程某一事件；可宏观，也可以很微观。

pref record记录信息到perf.data；

perf report生成报告；

perf diff对两个记录进行diff；

perf evlist列出记录的性能事件；

perf annotate显示perf.data函数代码；

perf archive将相关符号打包，方便在其它机器进行分析；

perf script将perf.data输出可读性文本；

可视化工具perf timechart

perf timechart record记录事件；

perf timechart生成output.svg文档；

list

使用perf之前肯定要知道perf能监控哪些性能指标吧？那么就要使用perf list进行查看，通常使用的指标是cpu-clock/task-clock等，具体要根据需要来判断。

root@firefly:~/mnt# perf list

List of pre-defined events (to be used in -e):

  rNNN                                               [Raw hardware event descrip
  cpu/t1=v1[,t2=v2,t3 ...]/modifier                  [Raw hardware event descrip
   (see 'man perf-list' on how to encode it)

  mem:<addr>[/len][:access]                          [Hardware breakpoint]

  android_fs:android_fs_dataread_end                 [Tracepoint event]
  android_fs:android_fs_dataread_start               [Tracepoint event]
  android_fs:android_fs_datawrite_end                [Tracepoint event]
  android_fs:android_fs_datawrite_start              [Tracepoint event]
  asoc:snd_soc_bias_level_done                       [Tracepoint event]
  asoc:snd_soc_bias_level_start                      [Tracepoint event]
  asoc:snd_soc_dapm_connected                        [Tracepoint event]
  asoc:snd_soc_dapm_done                             [Tracepoint event]
  asoc:snd_soc_dapm_path                             [Tracepoint event]
  asoc:snd_soc_dapm_start                            [Tracepoint event]
  asoc:snd_soc_dapm_walk_done                        [Tracepoint event]
  asoc:snd_soc_dapm_widget_event_done                [Tracepoint event]
  asoc:snd_soc_dapm_widget_event_start               [Tracepoint event]
  asoc:snd_soc_dapm_widget_power                     [Tracepoint event]
  asoc:snd_soc_jack_irq                              [Tracepoint event]
  asoc:snd_soc_jack_notify                           [Tracepoint event]
  asoc:snd_soc_jack_report                           [Tracepoint event]
  block:block_bio_backmerge                          [Tracepoint event]
  block:block_bio_bounce                             [Tracepoint event]
  block:block_bio_complete                           [Tracepoint event]
  block:block_bio_frontmerge                         [Tracepoint event]
  block:block_bio_queue                              [Tracepoint event]
  block:block_bio_remap                              [Tracepoint event]
  block:block_dirty_buffer                           [Tracepoint event]
  block:block_getrq                                  [Tracepoint event]
  ......

具体监控哪个变量的话，譬如使用后面的perf report工具，则加**-e 监控指标**，如监控运行ls命令时的cpu时钟占用：

perf report -e cpu-clock ls

event

不同内核版本列出的结果不一样多。不过基本是够用的，但是无论多少，我们可以基本将其分为三类

一些事件只是单纯的内核计数器，这种情况下，他们被称为software events。例如，上下文切换。

事件的另一个来源是处理器本身及其性能监控单元(Performance Monitoring Unit，PMU)。它提供了一个事件列表来衡量微架构事件，如周期数、指令异常、L1缓存未命中等。这些事件被称为PMU硬件事件（ PMU hardware events）或简称为硬件事件（hardware events）。这些事件因每种处理器类型和型号而异。

perf_events接口还提供了一小组常见的硬件事件名字对象。在每个处理器上，这些事件被映射到CPU提供的实际事件上，只有映射成立即实际事件存在时，这些事件才能被使用。这些事件也被称为硬件事件（hardware events）和硬件缓存事件（ hardware cache events）。

还有一些 tracepoint events 是依赖于ftrace架构实现的，这些只有在2.6.3x以上的内核才可以使用。

一个事件可以有子事件(或 unit masks)。在某些处理器上，对于某些事件，可以将 unit masks组合使用并测量任一子事件发生的时间。

/sys/kernel/debug/tracing/available_events，可查看当前系统的所有tracepoint分成了几大类：

ext4：文件系统的tracepoint events，如果是其它文件系统，比如XFS，也有对应的tracepoint event;
jbd2 ：文件日志的tracepoint events;
skb： 内存的tracepoint events;
net,napi,sock,udp：网络的tracepoint events;
scsi, block, writeback：磁盘IO
kmem：内存
sched： 调度
syscalls： 系统调用

属性

用户如果想要使用高精度采样，需要在指定性能事件时，在事件名后添加后缀:p或:pp。Perf在采样精度上定义了4个级别，如下所示。

0 ：无精度保证
1 ：采样指令与触发性能事件的指令之间的偏差为常数(:p)
2 ：需要尽量保证采样指令与触发性能事件的指令之间的偏差为0(:pp)
3 ：保证采样指令与触发性能事件的指令之间的偏差必须为0(:ppp)

目前的X86处理器，包括Intel处理器与AMD处理器均仅能实现前 3 个精度级别。

除了精度级别以外，性能事件还具有其它几个属性，均可以通过event:X的方式予以指定。

u 仅统计用户空间程序触发的性能事件
k 仅统计内核触发的性能事件
h 仅统计Hypervisor触发的性能事件
G 在KVM虚拟机中，仅统计Guest系统触发的性能事件
H 仅统计 Host 系统触发的性能事件
p 精度级别

stat

perf stat 分析系统/进程的整体性能概况。

命令解析

-a, --all-cpus  采集所有CPU的信息
-A, --no-aggr   不要在system-wide(-a)模式下汇集所有CPU的计数信息
-B, --big-num   保留三位小数
-C, --cpu <cpu>  指定某个cpu
-D, --delay <n>  在测试程序开始后，在测量前等等 n ms
-d, --detailed   打印更详细的统计数据，最多可以指定3次
     -d:          detailed events, L1 and LLC data cache
        -d -d:     more detailed events, dTLB and iTLB events
     -d -d -d:     very detailed events, adding prefetch events
     
-e, --event <event>  事件选择。可以参考perf list。
-G, --cgroup <name>  仅在name为cgroup时有效。
-g, --group        将计数器放到一个计数组中   
-I, --interval-print <n>  每n毫秒打印计数增量(最小值:10ms).在某些情况下，开销可能很高，例如小于100毫秒的间隔。
-i, --no-inherit  禁止子任务继承父任务的性能计数器。
-M, --metrics <metric/metric group list>  监视指定的 metrics 或   metric groups，以逗号分隔。
-n, --null   仅输出目标程序的执行时间，而不开启任何性能计数器。
-o, --output <file>   输出文件的名字
-p, --pid <pid>      指定待分析的进程id
-r, --repeat <n>  重复执行 n 次目标程序，并给出性能指标在n 次执行中的变化范围。
-S, --sync        在开始前调用sync() 
-t, --tid <tid>   指定待分析的线程id
-T, --transaction    如果支持，打印事务执行的统计数据。
-v, --verbose      显示详细信息
-x, --field-separator <separator>   使用CSV样式的输出打印计数，以便直接导入表格。列由SEP中指定的字符串分隔。

举例

ubuntu# perf stat -B dd if=/dev/zero of=/dev/null count=1000000
1000000+0 records in
1000000+0 records out
512000000 bytes (512 MB, 488 MiB) copied, 0.868718 s, 589 MB/s

 Performance counter stats for 'dd if=/dev/zero of=/dev/null count=1000000':

            869.31 msec task-clock                #    0.999 CPUs utilized          
                 2      context-switches          #    0.002 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
                71      page-faults               #    0.082 K/sec                  
   <not supported>      cycles                                                      
   <not supported>      instructions                                                
   <not supported>      branches                                                    
   <not supported>      branch-misses                                               

       0.870022180 seconds time elapsed

       0.450870000 seconds user
       0.418950000 seconds sys

如果没有指定那个事件，perf stat将收集上面列出的常见事件。比如，上下文切换，CPU迁移次数，缺页故障等。

task‐clock：事件表示目标任务真正占用处理器的时间，单位是毫秒。也称任务执行时间。CPUs utilized = task-clock / time elapsed，CPU的占用率。

context-switches：程序在运行过程中上下文的切换次数。

CPU-migrations：程序在运行过程中发生的处理器迁移次数。Linux为了维持多个处理器的负载均衡，在特定条件下会将某个任务从一个CPU迁移到另一个CPU。

CPU迁移和上下文切换：发生上下文切换不一定会发生CPU迁移，而发生CPU迁移时肯定会发生上下文切换。发生上下文切换有可能只是把上下文从当前CPU中换出，下一次调度器还是将进程安排在这个CPU上执行。

page-faults：缺页异常的次数。当应用程序请求的页面尚未建立、请求的页面不在内存中，或者请求的页面虽然在内存中，但物理地址和虚拟地址的映射关系尚未建立时，都会触发一次缺页异常。另外TLB不命中，页面访问权限不匹配等情况也会触发缺页异常。

cycles：消耗的处理器周期数。如果把被ls使用的cpu cycles看成是一个处理器的，那么它的主频为2.486GHz。可以用cycles / task-clock算出。

stalled-cycles-frontend：指令读取或解码的质量步骤，未能按理想状态发挥并行左右，发生停滞的时钟周期。

stalled-cycles-backend：指令执行步骤，发生停滞的时钟周期。

instructions：执行了多少条指令。IPC为平均每个cpu cycle执行了多少条指令。

branches：遇到的分支指令数。branch-misses是预测错误的分支指令数。

branch‐misses：是预测错误的分支指令数。

XXX seconds time elapsed：系程序持续时间

每次运行性能工具时，可以测量一个或多个事件。事件使用其符号名称，后跟可选的单元掩码和修饰符来指定。事件名称、单元掩码和修饰符不区分大小写。

perf stat -e cpu-clock dd if=/dev/zero of=/dev/null count=100000

默认情况下，会在用户和内核级别测量事件。如果仅在用户级别进行测量，需要传递一个修饰符:

perf stat -e cpu-clock:u dd if=/dev/zero of=/dev/null count=100000

如果即在用户态测量，又在内核态测量，则可以同时传递uk参数

perf stat -e cpu-clock:uk dd if=/dev/zero of=/dev/null count=100000

ls命令执行了多少次系统调用

perf stat -e syscalls:sys_enter_exit ls

只显示任务执行时间，不显示性能计数器

 perf stat -n ls > /dev/null

record

记录一段时间内系统/进程的性能时间。

命令解析

-a, --all-cpus        system-wide collection from all CPUs
-b, --branch-any      sample any taken branches
-B, --no-buildid      do not collect buildids in perf.data
-c, --count <n>     事件的采样周期
-C, --cpu <cpu>      只采集指定CPU数据
-d, --data           记录采样地址
-D, --delay <n>      在测试程序开始后，在测量前等等 n ms
-F      指定采样频率
-e, --event <event>  选择性能事件
-F, --freq <freq or 'max'> 指定频率
-g        记录函数间的调用关系
-G, --cgroup <name>   仅仅监视指定的cgroup name
-I, --intr-regs[=<any register>]  每n毫秒打印计数增量(最小值:10ms).在某些情况下，开销可能很高，例如小于100毫秒的间隔。
-i, --no-inherit     禁止子任务继承父任务的性能计数器。
-j, --branch-filter <branch filter mask>  启用分支堆栈采样。每个样本捕获一系列连续的采样分支。
-k, --clockid <clockid> 设置用于perf_event_type中各种时间字段的时钟id记录。请参见clock_gettime()。
-m, --mmap-pages <pages[,pages]> mmap数据页面和辅助区域跟踪mmap页面的数量
-N, --no-buildid-cache 不要更新buildid缓存
-n, --no-samples     不要采样
-o, --output <file>  指定输出文件，默认为perf.data
-P, --period         采样周期
-p, --pid <pid>     指定进程id
-q, --quiet         不打印任何信息
-R, --raw-samples   从所有打开的计数器收集原始样本记录
-r, --realtime <n>   以 SCHED_FIFO 优先级实时收集数据
-S, --snapshot[=<opts>] 快照模式
-s, --stat         记录每个线程的事件计数，使用perf report -T 查看可选值
-t, --tid <tid>    在现有线程ID上记录事件(逗号分隔列表)
-T, --timestamp     记录采样时间戳。使用 perf report -D查看更详细信息
-u, --uid <user>   指定用户id
-W, --weight       启用加权采样

举例

记录执行ls时的性能数据

perf record ls -g

记录执行ls时的系统调用，可以知道哪些系统调用最频繁

perf record -e syscalls:sys_enter ls

report

读取perf record生成的数据文件，并显示分析数据。

-p<regex>：用指定正则表达式过滤调用函数
-e <event>：指定性能事件（可以是多个，用,分隔列表）
-p <pid>：指定待分析进程的 pid（可以是多个，用,分隔列表）
-t <tid>：指定待分析线程的 tid（可以是多个，用,分隔列表）
-u <uid>：指定收集的用户数据，uid为名称或数字
-a：从所有 CPU 收集系统数据
-C <cpu-list>：只统计指定 CPU 列表的数据，如：0,1,3或1-2
-r <RT priority>：perf 程序以SCHED_FIFO实时优先级RT priority运行这里填入的数值越大，进程优先级越高（即 nice 值越小）
-c <count>： 事件每发生 count 次采一次样
-F <n>：每秒采样 n 次
-o <output.data>：指定输出文件output.data，默认输出到perf.data
-i：输入的数据文件
-v：显示每个符号的地址
-d <dos>：只显示指定dos的符号
-S：只考虑指定符号
-U：只显示已解析的符号
-g[type,min,order]：显示调用关系，具体等同于perf top命令中的-g
-c：只显示指定cpu采样信息
-M：以指定汇编指令风格显示
–source：以汇编和source的形式进行显示

举例

记录执行ls时的性能数据

perf record ls -g

显示

perf report -i perf.data

overhead：cpu-clock占用百分比

command：当前执行的命令

shared object ：依赖的共享库

symbol：当前占用比下对应的符号

[.]代表该调用属于用户态，若自己监控的进程为用户态进程，那么这些即主要为用户态的cpu-clock占用的数值，[k]代表属于内核态的调用。

也许有的人会奇怪为什么自己完全是一个用户态的程序为什么还会统计到内核态的指标？

一是用户态程序运行时会受到内核态的影响，若内核态对用户态影响较大，统计内核态信息可以了解到是内核中的哪些行为导致对用户态产生影响；二则是有些用户态程序也需要依赖内核的某些操作，譬如I/O操作

annotate

perf annotate提供指令级别的record文件定位。使用调试信息-g编译的文件能够显示汇编和本身源码信息。

但要注意， annotate命令并不能够解析内核image中的符号，必须要传递未压缩的内核image给annotate才能正常的解析内核符号，比如：

perf annotate -k /tmp/vmlinux -d symbol

命令解析

-i:输入文件名
-d:只考虑这些DSO中的符号
-f:强制读取
-D：转储ASCII中的原始跟踪
-k： vmlinux路径名
-m：加载模块符号表.仅与-k和一起使用
-l:打印匹配到的源代码行
-P：显示完整路径名
-M 指定反汇编程序样式
-stdio：使用stdio接口
-gtk：使用GTK接口

举例

main.c内容如下：

#include <stdio.h>
#include <time.h>
void func_a() {
   unsigned int num = 1;
   int i;
   for (i = 0;i < 10000000; i++) {
      num *= 2;
      num = 1;
   }
}
void func_b() {
   unsigned int num = 1;
   int i;
   for (i = 0;i < 10000000; i++) {
      num <<= 1;
      num = 1;
   }
}
int main() {
   func_a();
   func_b();
   return 0;
}

编译命令：

gcc -g -O0 main.c #-g是debug信息，保留符号表等；-O0表示不进行优化处理

统计命令：

perf record -a -g ./a.out

perf report查看结果：

Samples: 73  of event 'cpu-clock', Event count (approx.): 18250000       
  Children      Self  Command  Shared Object      Symbol    
+   97.26%     0.00%  a.out    a.out              [.] main 
+   97.26%     0.00%  a.out    libc-2.19.so       [.] __libc_start_main 
+   49.32%    49.32%  a.out    a.out              [.] func_a 
+   47.95%    47.95%  a.out    a.out              [.] func_b 
+    1.37%     1.37%  perf     [kernel.kallsyms]  [k] finish_task_switch  
+    1.37%     0.00%  a.out    ld-2.19.so         [.] dl_main

perf annotate查看结果：

func_a  /home/goodboy/tmp/a.out           
       │    void func_a() {
       │      push   %rbp
       │      mov    %rsp,%rbp
       │       unsigned int num = 1;
       │      movl   $0x1,-0x8(%rbp)
       │       int i;
       │       for (i = 0;i < 10000000; i++) {
       │      movl   $0x0,-0x4(%rbp)
       │    ↓ jmp    22
       │          num *= 2;
 11.11 │14:┌─→shll   -0x8(%rbp)
       │   │      num = 1;
       │   │  movl   $0x1,-0x8(%rbp)
       │   │#include <stdio.h>
       │   │#include <time.h>
       │   │void func_a() {
       │   │   unsigned int num = 1;
       │   │   int i;
       │   │   for (i = 0;i < 10000000; i++) {
  5.56 │   │  addl   $0x1,-0x4(%rbp)
 33.33 │22:│  cmpl   $0x98967f,-0x4(%rbp)
 50.00 │   └──jle    14
       │          num *= 2;
       │          num = 1;
       │       }
       │    }
       │      pop    %rbp
       │    ← retq

top

实时显示系统/进程的性能统计信息

命令解析

-e：指定性能事件
-a：显示在所有CPU上的性能统计信息
-d：界面的刷新周期，默认为2s。
-C：显示在指定CPU上的性能统计信息
-p：指定进程PID
-t：指定线程TID
-K：隐藏内核统计信息
-k：带符号表的内核映像所在的路径。
-U：隐藏用户空间的统计信息
-s：指定待解析的符号信息
-g：得到函数的调用关系图。
‘‐G’ or‘‐‐call‐graph’ <output_type,min_percent,call_order>
graph: 使用调用树，将每条调用路径进一步折叠。这种显示方式更加直观。
每条调用路径的采样率为绝对值。也就是该条路径占整个采样域的比率。
fractal
默认选项。类似与 graph，但是每条路径前的采样率为相对值。
flat
不折叠各条调用
选项 call_order 用以设定调用图谱的显示顺序，该选项有 2个取值，分别是
callee 与caller。
将该选项设为callee 时，perf按照被调用的顺序显示调用图谱，上层函数被下层函数所调用。
该选项被设为caller 时，按照调用顺序显示调用图谱，即上层函数调用了下层函数路径，也不显示每条调用路径的采样率

举例

显示分配高速缓存最多的函数

perf top -e kmem:kmem_cache_alloc

显示内核和模块中，消耗最多CPU周期的函数

perf top -e cycles:k

第一列：符号引发的性能事件的比例，默认指占用的cpu周期比例。

第二列：符号所在的DSO(Dynamic Shared Object)，可以是应用程序、内核、动态链接库、模块。

第三列：DSO的类型。[.]表示此符号属于用户态的ELF文件，包括可执行文件与动态链接库)。[k]表述此符号属于内核或模块。

第四列：符号名。有些符号不能解析为函数名，只能用地址表示。

bench

perf bench作为benchmark工具的通用框架，包含sched/mem/numa/futex等子系统，all可以指定所有。

perf bench可用于评估系统sched/mem等特定性能。

命令解析

-f, --format <default|simple> 选择输出格式，simple模式下只显示测量时间
-r, --repeat <n>      指定重复运行的次数

子系统包括

sched：调度器和IPC机制。包含messaging和pipe两个功能。

mem：内存存取性能。包含memcpy和memset两个功能。

numa：NUMA架构的调度和内存处理性能。包含mem功能。

futex：futex压力测试。包含hash/wake/wake-parallel/requeue/lock-pi功能。

all：所有bench测试的集合

举例

sched messaging评估进程调度和核间通信，sched message 是从经典的测试程序 hackbench 移植而来，用来衡量调度器的性能，overhead 以及可扩展性。

该 benchmark 启动 N 个 reader/sender 进程或线程对，通过 IPC(socket 或者 pipe) 进行并发的读写。一般人们将 N 不断加大来衡量调度器的可扩展性。

sched message 的用法及用途和 hackbench 一样，可以通过修改参数进行不同目的测试：

-g, --group <n> Specify number of groups
-l, --nr_loops <n> Specify the number of loops to run (default: 100)
-p, --pipe Use pipe() instead of socketpair()
-t, --thread Be multi thread instead of multi process

ubuntu# perf bench sched all
# Running sched/messaging benchmark...
# 20 sender and receiver processes per group
# 10 groups == 400 processes run

     Total time: 0.077 [sec]

# Running sched/pipe benchmark...
# Executed 1000000 pipe operations between two processes

     Total time: 27.550 [sec]

      27.550083 usecs/op
          36297 ops/sec

使用pipe()和socketpair()对测试影响：

ubuntu# perf bench sched messaging
# Running 'sched/messaging' benchmark:
# 20 sender and receiver processes per group
# 10 groups == 400 processes run

     Total time: 0.071 [sec]
ubuntu# perf bench sched messaging -p
# Running 'sched/messaging' benchmark:
# 20 sender and receiver processes per group
# 10 groups == 400 processes run

     Total time: 0.069 [sec]
ubuntu#

可见socketpair()性能要明显低于pipe()。

使用perf分析完整例子

下面我们举一个具体的例子来看下perf的使用方法。

//t1.c 
void longa() 
{
   
  int i,j; 
  for(i = 0; i < 1000000; i++) 
  j=i; //am I silly or crazy? I feel boring and desperate. 
}  
 
void foo2() 
{
   
  int i; 
  for(i=0 ; i < 10; i++) 
       longa(); 
} 
 
void foo1() 
{
   
  int i; 
  for(i = 0; i< 100; i++) 
     longa(); 
} 
 
int main(void) 
{
while(1)
{
  foo1(); 
  foo2();
} 
}

总揽全局

先用 perf stat 查看下程序整体性能情况，该工具主要是从全局上监控，可以看到程序导致性能瓶颈主要是什么原因。perf stat通过概括精简的方式提供被调试程序运行的整体情况和汇总数据。

ubuntu# perf stat ./perf_test
^C./perf_test: Interrupt

 Performance counter stats for './perf_test':

          8,659.24 msec task-clock                #    1.000 CPUs utilized          
                21      context-switches          #    0.002 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
                43      page-faults               #    0.005 K/sec                  
   <not supported>      cycles                                                      
   <not supported>      instructions                                                
   <not supported>      branches                                                    
   <not supported>      branch-misses                                               

       8.660065455 seconds time elapsed

       8.659661000 seconds user
       0.000000000 seconds sys

task-clock ：指程序运行期间占用了8,659.24 msec的任务时钟周期，该值高，说明程序的多数时间花费在 CPU 计算上而非 IO。

context-switches ：表示程序运行期间进行了21次上下文切换。记录了程序运行过程中发生了多少次进程切换。

page-faults ：是指程序发生了 43次缺页错误。

通过perf stat获得了程序性能瓶颈类型后，已经知道哪个进程需要优化，若不知道则需要使用perf top进行进一步监控。

精准导航

下一步就是对该进程进行细粒度的分析，分析在长长的程序代码中究竟是哪几段代码、哪几个函数需要修改呢?

perf record -e cpu-clock -g  ./perf_test

-g选项是告诉perf record额外记录函数的调用关系，-e cpu-clock 指perf record监控的指标为cpu周期，程序运行完之后，perf record会生成一个名为perf.data的文件。

可视化分析

前面通过perf record工具获得了某一进程的指标监控数据perf.data，下面就需要使用perf report工具查看该文件。

perf report -i perf.data

Self：是最后一列的符号（可以理解为函数）本身所占比例。

Children ：是这个符号调用的其他符号（可以理解为子函数，包括直接和间接调用）占用的比例之和。

[.]：代表该调用属于用户态，若自己监控的进程为用户态进程，那么这些即主要为用户态的cpu-clock占用的数值，[k]代表属于内核态的调用。

我们可以看到longa符号占用了perf_test程序的99%的CPU资源。

通过方向键和回车，可以看到函数的调用关系，同时以汇编代码的形式展示资源的消耗情况。

addl $0x1,-0x8(%rbp)
cmpl $0xf423f,-0x8(%rbp)

这两句汇编代码，先将0x8(%rbp)加一，然后和一个常数进行比较，占据了63.5%的资源。

查看源代码可以发现做了一次1000000次的for循环。接着以同样的方法，可以发现foo1() 也是一个潜在的调优对象，为什么要调用 100 次那个无聊的 longa() 函数呢。

火焰图

on-cpu火焰图可以用于分析cpu是被哪些线程、哪些函数占用的，可以方便的找到热点代码便于后续分析优化。下面我们介绍下火焰图的生成和使用方法。

使用方法

准备FlameGraph工具。

git clone https://github.com/brendangregg/FlameGraph.git

用perf record采集CPU信息。

perf record -e cpu-clock -g  ./perf_test

Ctrl+c结束执行后，在当前目录下会生成采样数据perf.data。

用perf script工具对perf.data进行解析。

perf script -i perf.data &> perf.unfold

将perf.unfold中的符号进行折叠。

./stackcollapse-perf.pl perf.unfold &> perf.folded

最后生成svg图。

./flamegraph.pl perf.folded > perf.svg

perf.svg 用浏览器就可以打开

火焰图解读

y 轴表示调用栈，每一层都是一个函数。调用栈越深，火焰就越高，顶部就是正在执行的函数，下方都是它的父函数。

x 轴表示抽样数，如果一个函数在 x 轴占据的宽度越宽，就表示它被抽到的次数多，即执行的时间长。注意，x 轴不代表时间，而是所有的调用栈合并后，按字母顺序排列的。

火焰图就是看顶层的哪个函数占据的宽度最大。只要有"平顶"（plateaus），就表示该函数可能存在性能问题。比如图中的longa()函数正是问题所在点。

颜色没有特殊含义，因为火焰图表示的是 CPU 的繁忙程度，所以一般选择暖色调。

互动

火焰图是SVG 图片，可以与用户互动。

火焰的每一层都会标注函数名，鼠标悬浮时会显示完整的函数名、抽样抽中的次数、占据总抽样次数的百分比。下面是一个例子。

在某一层点击，火焰图会水平放大，该层会占据所有宽度，显示详细信息。

按下 Ctrl + F 会显示一个搜索框，用户可以输入关键词或正则表达式，所有符合条件的函数名会高亮显示。

其他

还有几个火焰图，就不介绍了，可以去看brendang regg的网站，简单说一下：

off-cpu相关：

off-cpu flame graphs —— 与on-cpu互补，on-cpu只能看到运行情况，但如果某个请求运行慢，可能是被阻塞导致，那么就需要分析阻塞点在代码的哪个位置，off-cpu就是画出每个阻塞点的阻塞时间，用于分析这个问题。Wakeup flame graphs —— off-cpu的进一步，off-cpu画出了阻塞点，但不知道阻塞是被谁唤醒的，wakeup通过分析唤醒阻塞点的线程栈，就可以知道是在哪里进行的唤醒，从而分析唤醒慢的原因。Chain graphs —— off-cpu和wakeup火焰图画出了阻塞点、唤醒点，但两者之间的关系并没有，也就是不知道唤醒点是唤醒哪个阻塞点，chain graph就是解决这个问题

其他

Hot/Cold Flame Graphs —— 就是讲on-cpu与off-cpu结合，在一张图上显示，这样可以清晰的看到on和off各自的比例Differential Flame Graphs —— 对比两个数据，画出来的图上显示变化情况，也就是相对之前的数据，每个部分占用是变高还是变低

总结

使用perf+FlameGraph可以清晰的了解程序on-cpu运行时间占比，可以高效的了解程序性能，这种方法对我们了解程序运行过程具有重要指导作用。要善于使用工具帮助我们分析复杂问题。

本文参考

https://www.cnblogs.com/arnoldlu/p/6241297.html

https://www.cnblogs.com/lizhaolong/p/16437171.html

https://www.coonote.com/vim-note/perf-usage.html

https://developer.aliyun.com/article/131443

https://blog.csdn.net/qq_15437667/article/details/50724330

https://perf.wiki.kernel.org/index.php/Tutorial#Sample_analysis_with_perf_report

https://blog.csdn.net/ggsyxhhhh/article/details/104739296/

https://blog.csdn.net/runafterhit/article/details/107801860

perf和火焰图使用方法

安装

x86安装

交叉编译

使用方法

总览

help

list

event

属性

stat

命令解析

举例

record

命令解析

举例

report

举例

annotate

命令解析

举例

top

命令解析

举例

bench

命令解析

举例

使用perf分析完整例子

总揽全局

精准导航

可视化分析

火焰图

使用方法

火焰图解读

互动

其他

总结

相关推荐