Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] startd bug? Seems to be able to reliably kill startd with GPU preemption on 8.8.7
- Date: Thu, 9 Apr 2020 22:41:12 +0000
- From: John M Knoeller <johnkn@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] startd bug? Seems to be able to reliably kill startd with GPU preemption on 8.8.7
Carsten. Thanks for the log snippet. I see the failure
04/09/20 07:05:14 ERROR "Failed to bind local resource 'GPUs'" at line 1272 ..
There was a known bug in this code when there were multiple GPUS that had the same device name.
(i.e. the device list was CUDA0,CUDA0) Is that the case here?
-tj
-----Original Message-----
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Carsten Aulbert
Sent: Thursday, April 9, 2020 2:20 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: [HTCondor-users] startd bug? Seems to be able to reliably kill startd with GPU preemption on 8.8.7
Hi all,
in our quest to get preemption running in a somewhat reliable and
predictable fashion, we now found a problem when a job which requested a
gpu is already running but is the preempted by the negotiator by another
job requesting a gpu.
The final lines of the node's startd are:
04/09/20 07:04:29 Attempting to send update via TCP to collector
condor1.atlas.local <10.20.30.16:9618>
04/09/20 07:04:29 slot1_3: Sent update to 1 collector(s)
04/09/20 07:05:14 slot1: Schedd addr =
<10.20.30.16:9618?addrs=10.20.30.16-9618&noUDP&sock=1998632_e432_4>
04/09/20 07:05:14 slot1: Alive interval = 300
04/09/20 07:05:14 slot1: Schedd sending 1 preempting claims.
04/09/20 07:05:14 slot1_5: Canceled ClaimLease timer (48)
04/09/20 07:05:14 slot1_5: Changing state and activity: Claimed/Busy ->
Preempting/Killing
04/09/20 07:05:14 slot1_5[48.2]: In Starter::kill() with pid 6043, sig 3
(SIGQUIT)
04/09/20 07:05:14 Send_Signal(): Doing kill(6043,3) [SIGQUIT]
04/09/20 07:05:14 slot1_5[48.2]: in starter:killHard starting kill timer
04/09/20 07:05:14 slot1: Total execute space: 859473444
04/09/20 07:05:14 slot1_5: Total execute space: 859473444
04/09/20 07:05:14 slot1: Received ClaimId from schedd
(<10.10.38.22:9618?addrs=10.10.38.22-9618&noUDP&sock=6676_c25a_6>#1586415265#15#...)
04/09/20 07:05:14 slot1: Match requesting resources: cpus=1 memory=128
disk=0.1% GPUs=1
04/09/20 07:05:14 Got execute_dir = /local/condor/execute
04/09/20 07:05:14 slot1: Total execute space: 859473444
04/09/20 07:05:14 bind_DevIds for slot1.1 before : GPUs:{CUDA0, }{1_5, }
04/09/20 07:05:14 ERROR "Failed to bind local resource 'GPUs'" at line
1272 in file /home/tim/CONDOR_SRC/.tmplCDN9v/condor-8.8.7/src/condor_sta
rtd.V6/ResAttributes.cpp
04/09/20 07:05:14 CronJobMgr: 1 jobs alive
04/09/20 07:05:14 slot1_4: Canceled ClaimLease timer (28)
04/09/20 07:05:14 slot1_4: Changing state and activity: Claimed/Busy ->
Preempting/Killing
04/09/20 07:05:14 slot1_4[49.7]: In Starter::kill() with pid 5785, sig 3
(SIGQUIT)
04/09/20 07:05:14 Send_Signal(): Doing kill(5785,3) [SIGQUIT]
04/09/20 07:05:14 slot1_4[49.7]: in starter:killHard starting kill timer
04/09/20 07:05:14 slot1_3: Canceled ClaimLease timer (25)
04/09/20 07:05:14 slot1_3: Changing state and activity: Claimed/Busy ->
Preempting/Killing
04/09/20 07:05:14 slot1_3[49.6]: In Starter::kill() with pid 5783, sig 3
(SIGQUIT)
04/09/20 07:05:14 Send_Signal(): Doing kill(5783,3) [SIGQUIT]
04/09/20 07:05:14 slot1_3[49.6]: in starter:killHard starting kill timer
04/09/20 07:05:14 startd exiting because of fatal exception.
04/09/20 07:05:25 Result of reading /etc/issue: Debian GNU/Linux 10 \n \l
04/09/20 07:05:25 Using IDs: 4 processors, 4 CPUs, 0 HTs
04/09/20 07:05:25 Reading condor configuration from
'/etc/condor/condor_config'
The problem seems to be this:
04/09/20 07:05:14 bind_DevIds for slot1.1 before : GPUs:{CUDA0, }{1_5, }
04/09/20 07:05:14 ERROR "Failed to bind local resource 'GPUs'" at line
1272 in file /home/tim/CONDOR_SRC/.tmplCDN9v/condor-8.8.7/src/condor_sta
rtd.V6/ResAttributes.cpp
At this point the master sees:
04/09/20 06:54:25 Started DaemonCore process "/usr/sbin/condor_startd",
pid and pgroup = 5719
04/09/20 06:54:29 Setting ready state 'Ready' for STARTD
04/09/20 07:05:14 DefaultReaper unexpectedly called on pid 5719, status
1024.
04/09/20 07:05:14 The STARTD (pid 5719) exited with status 4
04/09/20 07:05:14 Sending obituary for "/usr/sbin/condor_startd"
04/09/20 07:05:15 restarting /usr/sbin/condor_startd in 10 seconds
04/09/20 07:05:25 Started DaemonCore process "/usr/sbin/condor_startd",
pid and pgroup = 6536
04/09/20 07:05:28 Setting ready state 'Ready' for STARTD
One interesting bit is that this is not related to GPU usage per se, as
the job is simply starting /bin/sleep and then does nothing - it only
requests the gpu via its submit file.
Has anyone seen this or something similar?
(maybe this is the same which happened back in 2017?
https://www-auth.cs.wisc.edu/lists/htcondor-users/2017-November/msg00024.shtml)
Shall we continue here or shall I send more information/logs somewhere
to start a ticket?
Cheers and thanks a lot in advance looking into this
Carsten
--
Dr. Carsten Aulbert, Max Planck Institute for Gravitational Physics,
CallinstraÃe 38, 30167 Hannover, Germany
Phone: +49 511 762 17185