Re: [HTCondor-devel] CPI2: CPU performance isolation for shared compute clusters


Date: Sat, 13 Apr 2013 12:16:55 -0500
From: Brian Bockelman <bbockelm@xxxxxxxxxxx>
Subject: Re: [HTCondor-devel] CPI2: CPU performance isolation for shared compute clusters
Hi Matt,

I had never looked into this before, but I found (in the RHEL6 manual of all things!) that there is a "perf_event" cgroup controller.

This would allow us to, amongst other things, record CPI for all HTCondor jobs and report them in the resulting classads.

For example, from a running job on our cluster:

[root@node110 ~]# mkdir  /cgroup/perf_event/foo
[root@node110 ~]# echo 13020 > /cgroup/perf_event/foo/tasks
[root@node110 ~]# sudo perf stat -a -e task-clock,cpu-cycles,branches,branch-misses,instructions,cs,faults,migrations,stalled-cycles-frontend,stalled-cycles-backend -G /foo,/foo,/foo,/foo,/foo,/foo,/foo,/foo,/foo,/foo sleep 5

 Performance counter stats for 'sleep 5':

       4956.990670 task-clock                /foo #    0.991 CPUs utilized           [99.98%]
    11,654,810,547 cpu-cycles                /foo #    2.351 GHz                     [83.29%]
       959,551,261 branches                  /foo #  193.575 M/sec                   [83.34%]
        24,915,394 branch-misses             /foo #    2.60% of all branches         [66.62%]
    11,423,755,623 instructions              /foo #    0.98  insns per cycle        
                                             #    0.68  stalled cycles per insn [83.30%]
               120 cs                        /foo #    0.024 K/sec                   [99.99%]
                 0 faults                    /foo #    0.000 K/sec                   [99.99%]
                 0 migrations                /foo #    0.000 K/sec                   [99.99%]
     7,720,637,500 stalled-cycles-frontend   /foo #   66.24% frontend cycles idle    [83.33%]
       925,166,949 stalled-cycles-backend    /foo #    7.94% backend  cycles idle    [83.35%]

       5.001046504 seconds time elapsed

(there's a machine-readable version of the output if you add "-x ,")

Performance counters would be FANTASTIC to have as users typically have no clue about this data.

Brian

On Apr 12, 2013, at 6:23 AM, Matthew Farrellee <matt@xxxxxxxxxx> wrote:

> http://research.google.com/pubs/pub40737.html
> 
> Interesting approach using cycles-per-instruction as a health metric
> _______________________________________________
> HTCondor-devel mailing list
> HTCondor-devel@xxxxxxxxxxx
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-devel

Attachment: smime.p7s
Description: S/MIME cryptographic signature

[← Prev in Thread] Current Thread [Next in Thread→]