Hi Matt,
I had never looked into this before, but I found (in the RHEL6 manual of all things!) that there is a "perf_event" cgroup controller.
This would allow us to, amongst other things, record CPI for all HTCondor jobs and report them in the resulting classads.
For example, from a running job on our cluster:
[root@node110 ~]# mkdir /cgroup/perf_event/foo
[root@node110 ~]# echo 13020 > /cgroup/perf_event/foo/tasks
[root@node110 ~]# sudo perf stat -a -e task-clock,cpu-cycles,branches,branch-misses,instructions,cs,faults,migrations,stalled-cycles-frontend,stalled-cycles-backend -G /foo,/foo,/foo,/foo,/foo,/foo,/foo,/foo,/foo,/foo sleep 5
Performance counter stats for 'sleep 5':
4956.990670 task-clock /foo # 0.991 CPUs utilized [99.98%]
11,654,810,547 cpu-cycles /foo # 2.351 GHz [83.29%]
959,551,261 branches /foo # 193.575 M/sec [83.34%]
24,915,394 branch-misses /foo # 2.60% of all branches [66.62%]
11,423,755,623 instructions /foo # 0.98 insns per cycle
# 0.68 stalled cycles per insn [83.30%]
120 cs /foo # 0.024 K/sec [99.99%]
0 faults /foo # 0.000 K/sec [99.99%]
0 migrations /foo # 0.000 K/sec [99.99%]
7,720,637,500 stalled-cycles-frontend /foo # 66.24% frontend cycles idle [83.33%]
925,166,949 stalled-cycles-backend /foo # 7.94% backend cycles idle [83.35%]
5.001046504 seconds time elapsed
(there's a machine-readable version of the output if you add "-x ,")
Performance counters would be FANTASTIC to have as users typically have no clue about this data.
Brian
On Apr 12, 2013, at 6:23 AM, Matthew Farrellee <matt@xxxxxxxxxx> wrote:
http://research.google.com/pubs/pub40737.html
Interesting approach using cycles-per-instruction as a health metric
_______________________________________________
HTCondor-devel mailing list
HTCondor-devel@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-devel
_______________________________________________
HTCondor-devel mailing list
HTCondor-devel@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-devel