Hi, we have two things observed with cgroups for managing jobs and we would like to ask, if somebody has seen something similar as well? - every now and then jobs get no cgroup shares set resulting in a larger (relative) CPU and memory shares than supposed. E.g., a share of ~20.4% for an 8 core job on a 48 core node - while its share should be 16.7% (same for mem shares) [details attached] Apparently, condor could not apply cgroup limits for such a job so that the defaults got used on the node [SL 6.7, Condor 8.4.8] While it seems to be a transient problem, such jobs seem to appear every now and then on our nodes. - on another node, we observed an overbooking, i.e., seven 8core jobs on a 48 core machine resulting in relative shares of ~14.3% [6] Here, the node has a longer running job and else had run empty following problems with our schedd [7]. After restarting the scheduler, the node's startd got new jobs but (as a guess) may have not take into account the still running job? So, I would assume that the overbooking was just a glitch, or? Cheers, Thomas --------------------- [details] The general htcondor cgroup has a share of 1024, i.e. 100% since its the only group on the node [1,2]. While within the cgroup 5 jobs had a CPU share of 800 each, this one job in dynamic slot1_2 had a share of 1024 of the overall group's share. So it got a share of ~20.4% vs ~15.9% for each of the other jobs on a node with 48 cores. The jobs had been submitted via an ARC CE requesting each 8 cores [3]. With the six jobs on the node, I would have expected a nominally cpu share of 16.7% per job. While the startd tried to set cgroup shares [5] it failed for each resource with '50016 Invalid argument'. So far I have found only this question relating condor https://www-auth.cs.wisc.edu/lists/htcondor-users/2014-November/threads.shtml#00023 and a bug report on the error message for libvirt - but I do not an obvious relationship to condor, or?? [1] > /cgroup/cpu/htcondor/condor_var_lib_condor_execute_slot1_2\@batch0209.desy.de/cpu.shares 1024 > cat /cgroup/cpu/htcondor/condor_var_lib_condor_execute_*/cpu.shares 800 1024 800 800 800 800 [2] > cat /cgroup/cpu/htcondor/cpu.shares 1024 > cat /cgroup/cpu/cpu.shares 1024 [3] > cat /var/lib/condor/execute/dir_3388323/VUaNDm1B0FpnvxDnJpv7dLGoABFKDmABFKDmNaWKDmCBFKDmEzacam.diag runtimeenvironments=ENV/GLITE;ENV/PROXY; nodename=batch0209.desy.de Processors=8 [4] drwxr-xr-x 2 root root 0 Oct 17 10:21 condor_var_lib_condor_execute_slot1_1@xxxxxxxxxxxxxxxxx drwxr-xr-x 2 root root 0 Oct 17 10:22 condor_var_lib_condor_execute_slot1_2@xxxxxxxxxxxxxxxxx drwxr-xr-x 2 root root 0 Oct 17 10:23 condor_var_lib_condor_execute_slot1_3@xxxxxxxxxxxxxxxxx drwxr-xr-x 2 root root 0 Oct 17 10:23 condor_var_lib_condor_execute_slot1_4@xxxxxxxxxxxxxxxxx drwxr-xr-x 2 root root 0 Oct 17 10:23 condor_var_lib_condor_execute_slot1_5@xxxxxxxxxxxxxxxxx drwxr-xr-x 2 root root 0 Oct 17 10:24 condor_var_lib_condor_execute_slot1_6@xxxxxxxxxxxxxxxxx [5] grep "10/17/16 10:22" /var/log/condor/StarterLog.slot1_2 10/17/16 10:22:34 (pid:3394100) ****************************************************** 10/17/16 10:22:34 (pid:3394100) ** condor_starter (CONDOR_STARTER) STARTING UP 10/17/16 10:22:34 (pid:3394100) ** /usr/sbin/condor_starter 10/17/16 10:22:34 (pid:3394100) ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1) 10/17/16 10:22:34 (pid:3394100) ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON 10/17/16 10:22:34 (pid:3394100) ** $CondorVersion: 8.4.8 Jun 30 2016 BuildID: 373513 $ 10/17/16 10:22:34 (pid:3394100) ** $CondorPlatform: x86_64_RedHat6 $ 10/17/16 10:22:34 (pid:3394100) ** PID = 3394100 10/17/16 10:22:34 (pid:3394100) ** Log last touched 10/17 03:15:29 10/17/16 10:22:34 (pid:3394100) ****************************************************** 10/17/16 10:22:34 (pid:3394100) Using config source: /etc/condor/condor_config 10/17/16 10:22:34 (pid:3394100) Using local config sources: 10/17/16 10:22:34 (pid:3394100) /etc/condor/config.d/00worker.conf 10/17/16 10:22:34 (pid:3394100) /etc/condor/config.d/01grid.conf 10/17/16 10:22:34 (pid:3394100) /etc/condor/config.d/20rebooter.conf 10/17/16 10:22:34 (pid:3394100) /etc/condor/condor_config.local 10/17/16 10:22:34 (pid:3394100) config Macros = 100, Sorted = 99, StringBytes = 3219, TablesBytes = 3672 10/17/16 10:22:34 (pid:3394100) CLASSAD_CACHING is OFF 10/17/16 10:22:34 (pid:3394100) Daemon Log is logging: D_ALWAYS D_ERROR 10/17/16 10:22:34 (pid:3394100) SharedPortEndpoint: waiting for connections to named socket 7920_1b3b_14789 10/17/16 10:22:34 (pid:3394100) DaemonCore: command socket at <131.169.160.40:9620?addrs=131.169.160.40-9620&noUDP&sock=7920_1b3b_14789> 10/17/16 10:22:34 (pid:3394100) DaemonCore: private command socket at <131.169.160.40:9620?addrs=131.169.160.40-9620&noUDP&sock=7920_1b3b_14789> 10/17/16 10:22:34 (pid:3394100) Communicating with shadow <131.169.223.111:9620?addrs=131.169.223.111-9620&noUDP&sock=1950804_a55a_233> 10/17/16 10:22:34 (pid:3394100) Submitting machine is "grid-arcce1.desy.de" 10/17/16 10:22:34 (pid:3394100) setting the orig job name in starter 10/17/16 10:22:34 (pid:3394100) setting the orig job iwd in starter 10/17/16 10:22:34 (pid:3394100) Chirp config summary: IO false, Updates false, Delayed updates true. 10/17/16 10:22:34 (pid:3394100) Initialized IO Proxy. 10/17/16 10:22:34 (pid:3394100) Done setting resource limits 10/17/16 10:22:34 (pid:3394100) File transfer completed successfully. 10/17/16 10:22:35 (pid:3394100) Job 261123.0 set to execute immediately 10/17/16 10:22:35 (pid:3394100) Starting a VANILLA universe job with ID: 261123.0 10/17/16 10:22:35 (pid:3394100) IWD: /var/lib/condor/execute/dir_3394100 10/17/16 10:22:35 (pid:3394100) Output file: /var/lib/condor/execute/dir_3394100/_condor_stdout 10/17/16 10:22:35 (pid:3394100) Error file: /var/lib/condor/execute/dir_3394100/_condor_stdout 10/17/16 10:22:35 (pid:3394100) Renice expr "0" evaluated to 0 10/17/16 10:22:35 (pid:3394100) About to exec /var/lib/condor/execute/dir_3394100/condor_exec.exe 10/17/16 10:22:35 (pid:3394100) Running job as user cmsger014 10/17/16 10:22:35 (pid:3394100) Create_Process succeeded, pid=3394104 10/17/16 10:22:35 (pid:3394100) Limiting (soft) memory usage to 21072183296 bytes 10/17/16 10:22:35 (pid:3394100) Limiting (hard) memory usage to 143033016320 bytes 10/17/16 10:22:35 (pid:3394100) Unable to commit memory soft limit for htcondor/condor_var_lib_condor_execute_slot1_2@xxxxxxxxxxxxxxxxx : 50016 Invalid argument 10/17/16 10:22:35 (pid:3394100) Limiting memsw usage to 143033020416 bytes 10/17/16 10:22:35 (pid:3394100) Unable to commit memsw limit for htcondor/condor_var_lib_condor_execute_slot1_2@xxxxxxxxxxxxxxxxx : 50016 Invalid argument 10/17/16 10:22:35 (pid:3394100) Unable to commit CPU shares for htcondor/condor_var_lib_condor_execute_slot1_2@xxxxxxxxxxxxxxxxx: 50016 Invalid argument [6] [root@batch0216 ~]# cat /cgroup/cpu/htcondor/condor_var_lib_condor_execute_slot1_*/cpu.shares 800 800 800 800 800 800 800 [7] ls -call /cgroup/cpu/htcondor/ | sort -k8 drwxr-xr-x 2 root root 0 Oct 16 01:13 condor_var_lib_condor_execute_slot1_8@xxxxxxxxxxxxxxxxx drwxr-xr-x 9 root root 0 Oct 17 03:15 . -rw-rw-r-- 1 root root 0 Oct 17 03:15 tasks drwxr-xr-x 2 root root 0 Oct 17 10:21 condor_var_lib_condor_execute_slot1_1@xxxxxxxxxxxxxxxxx drwxr-xr-x 2 root root 0 Oct 17 10:22 condor_var_lib_condor_execute_slot1_2@xxxxxxxxxxxxxxxxx drwxr-xr-x 2 root root 0 Oct 17 10:23 condor_var_lib_condor_execute_slot1_3@xxxxxxxxxxxxxxxxx drwxr-xr-x 2 root root 0 Oct 17 10:23 condor_var_lib_condor_execute_slot1_4@xxxxxxxxxxxxxxxxx drwxr-xr-x 2 root root 0 Oct 17 10:23 condor_var_lib_condor_execute_slot1_5@xxxxxxxxxxxxxxxxx drwxr-xr-x 2 root root 0 Oct 17 10:24 condor_var_lib_condor_execute_slot1_6@xxxxxxxxxxxxxxxxx
Attachment:
smime.p7s
Description: S/MIME Cryptographic Signature