Re: [HTCondor-devel] CPI2: CPU performance isolation for shared compute clusters


Date: Sat, 13 Apr 2013 13:40:38 -0500
From: Brian Bockelman <bbockelm@xxxxxxxxxxx>
Subject: Re: [HTCondor-devel] CPI2: CPU performance isolation for shared compute clusters
Actually, it depends on the implementation:
- To track all activity in a cgroup, you need CAP_SYS_ADMIN.
- To track a specific process, you can do this as a "normal user" - you can only track your own processes though.

So a careful implementation should work for root and non-root HTCondor.  The implementation is more difficult for non-root, but the procd would do most of the heavy lifting.  One FD per attribute per group (a PID or a cgroup is counted as a group) is required; so, we may need to carefully watch the number of FD resources used.

Looking at the man page, it should be possible for multiple perf instances to monitor the same processes, but the data may become statistical if the processor runs out of PMU slots (for some CPUs, it may be that 2 perf instances each get access to the registers for 1/2 of the time while the process is running on that CPU).  So, recursive glideins should also be OK - I don't know when the wheels fall apart though.

If the glidein uses glexec, it won't be able to track payload processes without a glexec invocation.  Of course, none of the condor_procd monitoring methods are guaranteed to work in future kernels for the glexec case (future kernels allow the sysadmin to disable poking around in /proc for other user's processes).

I've copied below an example program that measures CPI for PID 2055, which is an unrelated shell owned by my user.  The output is:

[bbockelm@hcc-briantest tmp]$  gcc perf_test.c -o perf_test &&  ./perf_test 
Measuring instruction count for this printf
Used 252238 cycles
Used 65446 instructions

Brian

On Apr 13, 2013, at 1:26 PM, Igor Sfiligoi <sfiligoi@xxxxxxxx> wrote:

> But I guess this is not available to the normal users, right?
> (i.e. the glidein use case)
> 
> Igor
> 
> On 04/13/2013 10:16 AM, Brian Bockelman wrote:
>> Hi Matt,
>> 
>> I had never looked into this before, but I found (in the RHEL6 manual of all things!) that there is a "perf_event" cgroup controller.
>> 
>> This would allow us to, amongst other things, record CPI for all HTCondor jobs and report them in the resulting classads.
>> 
>> For example, from a running job on our cluster:
>> 
>> [root@node110 ~]# mkdir  /cgroup/perf_event/foo
>> [root@node110 ~]# echo 13020 > /cgroup/perf_event/foo/tasks
>> [root@node110 ~]# sudo perf stat -a -e task-clock,cpu-cycles,branches,branch-misses,instructions,cs,faults,migrations,stalled-cycles-frontend,stalled-cycles-backend -G /foo,/foo,/foo,/foo,/foo,/foo,/foo,/foo,/foo,/foo sleep 5
>> 
>>  Performance counter stats for 'sleep 5':
>> 
>>        4956.990670 task-clock                /foo #    0.991 CPUs utilized           [99.98%]
>>     11,654,810,547 cpu-cycles                /foo #    2.351 GHz                     [83.29%]
>>        959,551,261 branches                  /foo #  193.575 M/sec                   [83.34%]
>>         24,915,394 branch-misses             /foo #    2.60% of all branches         [66.62%]
>>     11,423,755,623 instructions              /foo #    0.98  insns per cycle
>>                                              #    0.68  stalled cycles per insn [83.30%]
>>                120 cs                        /foo #    0.024 K/sec                   [99.99%]
>>                  0 faults                    /foo #    0.000 K/sec                   [99.99%]
>>                  0 migrations                /foo #    0.000 K/sec                   [99.99%]
>>      7,720,637,500 stalled-cycles-frontend   /foo #   66.24% frontend cycles idle    [83.33%]
>>        925,166,949 stalled-cycles-backend    /foo #    7.94% backend  cycles idle    [83.35%]
>> 
>>        5.001046504 seconds time elapsed
>> 
>> (there's a machine-readable version of the output if you add "-x ,")
>> 
>> Performance counters would be FANTASTIC to have as users typically have no clue about this data.
>> 
>> Brian
>> 
>> On Apr 12, 2013, at 6:23 AM, Matthew Farrellee <matt@xxxxxxxxxx> wrote:
>> 
>>> http://research.google.com/pubs/pub40737.html
>>> 
>>> Interesting approach using cycles-per-instruction as a health metric
>>> _______________________________________________
>>> HTCondor-devel mailing list
>>> HTCondor-devel@xxxxxxxxxxx
>>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-devel
>> 
>> 
>> 
>> _______________________________________________
>> HTCondor-devel mailing list
>> HTCondor-devel@xxxxxxxxxxx
>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-devel
>> 


#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <errno.h>
#include <string.h>
#include <sys/ioctl.h>
#include <linux/perf_event.h>
#include <asm/unistd.h>
#include <fcntl.h>

long
perf_event_open(struct perf_event_attr *hw_event, pid_t pid,
                int cpu, int group_fd, unsigned long flags)
{
    int ret;

    ret = syscall(__NR_perf_event_open, hw_event, pid, cpu,
                   group_fd, flags);
    return ret;
}

int
main(int argc, char **argv)
{
    struct perf_event_attr pe;
    long long count[10];
    int c_fd, fd, fd2;

    if ((c_fd = open("/cgroup/perf_event/foo", O_DIRECTORY)) == -1) {
        fprintf(stderr, "Unable to open perf cgroup. (errno=%d, %s)\n", errno, strerror(errno));
        exit(EXIT_FAILURE);
    }

    memset(&pe, 0, sizeof(struct perf_event_attr));
    pe.type = PERF_TYPE_HARDWARE;
    pe.size = sizeof(struct perf_event_attr);
    pe.config = PERF_COUNT_HW_CPU_CYCLES;
    pe.disabled = 1;
    pe.exclude_kernel = 1;
    pe.exclude_hv = 1;
    pe.read_format = PERF_FORMAT_GROUP;

    fd = perf_event_open(&pe, 2055, 0, -1, 0);
    if (fd == -1) {
       fprintf(stderr, "Error opening leader %llx (errno=%d, %s)\n", pe.config, errno, strerror(errno));
       exit(EXIT_FAILURE);
    }

    pe.config = PERF_COUNT_HW_INSTRUCTIONS;
    pe.disabled = 0;
    fd2 = perf_event_open(&pe, 2055, 0, fd, 0);
    if (fd2 == -1) {
        fprintf(stderr, "Error opening group follower %llx\n", pe.config);
        exit(EXIT_FAILURE);
    }

    ioctl(fd, PERF_EVENT_IOC_RESET, 0);
    ioctl(fd, PERF_EVENT_IOC_ENABLE, 0);

    printf("Measuring instruction count for this printf\n");
    sleep(1);

    ioctl(fd, PERF_EVENT_IOC_DISABLE, 0);
    if (-1 == read(fd, count, 10*sizeof(long long))) {
        fprintf(stderr, "Error reading performance counters: %d, %s", errno, strerror(errno));
        exit(EXIT_FAILURE);
    }

    if (count[0] != 2) {
        fprintf(stderr, "Kernel returned the wrong number of events (%d).\n", count[0]);
        exit(EXIT_FAILURE);
    }

    printf("Used %lld cycles\n", count[1]);
    printf("Used %lld instructions\n", count[2]);

    close(fd);
    close(fd2);
}
 

Attachment: smime.p7s
Description: S/MIME cryptographic signature

[← Prev in Thread] Current Thread [Next in Thread→]