Hi all, in our quest to get preemption running in a somewhat reliable and predictable fashion, we now found a problem when a job which requested a gpu is already running but is the preempted by the negotiator by another job requesting a gpu. The final lines of the node's startd are: 04/09/20 07:04:29 Attempting to send update via TCP to collector condor1.atlas.local <10.20.30.16:9618> 04/09/20 07:04:29 slot1_3: Sent update to 1 collector(s) 04/09/20 07:05:14 slot1: Schedd addr = <10.20.30.16:9618?addrs=10.20.30.16-9618&noUDP&sock=1998632_e432_4> 04/09/20 07:05:14 slot1: Alive interval = 300 04/09/20 07:05:14 slot1: Schedd sending 1 preempting claims. 04/09/20 07:05:14 slot1_5: Canceled ClaimLease timer (48) 04/09/20 07:05:14 slot1_5: Changing state and activity: Claimed/Busy -> Preempting/Killing 04/09/20 07:05:14 slot1_5[48.2]: In Starter::kill() with pid 6043, sig 3 (SIGQUIT) 04/09/20 07:05:14 Send_Signal(): Doing kill(6043,3) [SIGQUIT] 04/09/20 07:05:14 slot1_5[48.2]: in starter:killHard starting kill timer 04/09/20 07:05:14 slot1: Total execute space: 859473444 04/09/20 07:05:14 slot1_5: Total execute space: 859473444 04/09/20 07:05:14 slot1: Received ClaimId from schedd (<10.10.38.22:9618?addrs=10.10.38.22-9618&noUDP&sock=6676_c25a_6>#1586415265#15#...) 04/09/20 07:05:14 slot1: Match requesting resources: cpus=1 memory=128 disk=0.1% GPUs=1 04/09/20 07:05:14 Got execute_dir = /local/condor/execute 04/09/20 07:05:14 slot1: Total execute space: 859473444 04/09/20 07:05:14 bind_DevIds for slot1.1 before : GPUs:{CUDA0, }{1_5, } 04/09/20 07:05:14 ERROR "Failed to bind local resource 'GPUs'" at line 1272 in file /home/tim/CONDOR_SRC/.tmplCDN9v/condor-8.8.7/src/condor_sta rtd.V6/ResAttributes.cpp 04/09/20 07:05:14 CronJobMgr: 1 jobs alive 04/09/20 07:05:14 slot1_4: Canceled ClaimLease timer (28) 04/09/20 07:05:14 slot1_4: Changing state and activity: Claimed/Busy -> Preempting/Killing 04/09/20 07:05:14 slot1_4[49.7]: In Starter::kill() with pid 5785, sig 3 (SIGQUIT) 04/09/20 07:05:14 Send_Signal(): Doing kill(5785,3) [SIGQUIT] 04/09/20 07:05:14 slot1_4[49.7]: in starter:killHard starting kill timer 04/09/20 07:05:14 slot1_3: Canceled ClaimLease timer (25) 04/09/20 07:05:14 slot1_3: Changing state and activity: Claimed/Busy -> Preempting/Killing 04/09/20 07:05:14 slot1_3[49.6]: In Starter::kill() with pid 5783, sig 3 (SIGQUIT) 04/09/20 07:05:14 Send_Signal(): Doing kill(5783,3) [SIGQUIT] 04/09/20 07:05:14 slot1_3[49.6]: in starter:killHard starting kill timer 04/09/20 07:05:14 startd exiting because of fatal exception. 04/09/20 07:05:25 Result of reading /etc/issue: Debian GNU/Linux 10 \n \l 04/09/20 07:05:25 Using IDs: 4 processors, 4 CPUs, 0 HTs 04/09/20 07:05:25 Reading condor configuration from '/etc/condor/condor_config' The problem seems to be this: 04/09/20 07:05:14 bind_DevIds for slot1.1 before : GPUs:{CUDA0, }{1_5, } 04/09/20 07:05:14 ERROR "Failed to bind local resource 'GPUs'" at line 1272 in file /home/tim/CONDOR_SRC/.tmplCDN9v/condor-8.8.7/src/condor_sta rtd.V6/ResAttributes.cpp At this point the master sees: 04/09/20 06:54:25 Started DaemonCore process "/usr/sbin/condor_startd", pid and pgroup = 5719 04/09/20 06:54:29 Setting ready state 'Ready' for STARTD 04/09/20 07:05:14 DefaultReaper unexpectedly called on pid 5719, status 1024. 04/09/20 07:05:14 The STARTD (pid 5719) exited with status 4 04/09/20 07:05:14 Sending obituary for "/usr/sbin/condor_startd" 04/09/20 07:05:15 restarting /usr/sbin/condor_startd in 10 seconds 04/09/20 07:05:25 Started DaemonCore process "/usr/sbin/condor_startd", pid and pgroup = 6536 04/09/20 07:05:28 Setting ready state 'Ready' for STARTD One interesting bit is that this is not related to GPU usage per se, as the job is simply starting /bin/sleep and then does nothing - it only requests the gpu via its submit file. Has anyone seen this or something similar? (maybe this is the same which happened back in 2017? https://www-auth.cs.wisc.edu/lists/htcondor-users/2017-November/msg00024.shtml) Shall we continue here or shall I send more information/logs somewhere to start a ticket? Cheers and thanks a lot in advance looking into this Carsten -- Dr. Carsten Aulbert, Max Planck Institute for Gravitational Physics, CallinstraÃe 38, 30167 Hannover, Germany Phone: +49 511 762 17185
Attachment:
smime.p7s
Description: S/MIME Cryptographic Signature