Hi Thomas,
Looking at the log message posted it seems like the docker command is getting hung and the wrapper script /usr/local/bin/docker.py is the one declaring the timeout and exiting a failure. If it was condor declaring the timeout you would seem some sort of message
like provided followed immediately by "Declaring a hung docker".
What is the reason for using a wrapper python script around the docker commands?
-Cole Bollig
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Thomas Birkett - STFC UKRI via HTCondor-users <htcondor-users@xxxxxxxxxxx>
Sent: Tuesday, May 23, 2023 5:55 AM To: condor-users@xxxxxxxxxxx <condor-users@xxxxxxxxxxx> Cc: Thomas Birkett - STFC UKRI <thomas.birkett@xxxxxxxxxx> Subject: [HTCondor-users] Extend Docker container removal timeout Hi all,
I hope everyone is keeping well. Quick question for the community, we have intermittent timeouts for containers on nodes with the logs detailing the following:
condor_startd[3404]: Failed to read results from '/usr/local/bin/docker.py rm -f -v HTCJob2693132_0_slot1_8_PID2466140': 'Timed out waiting for program to exit' (110)
Is there a knob / config option that exists for extending the removal timeout value for containers and jobs on startd’s? Docker does eventually remove the container but as the workers have very high I/O at times, the node may need more time to supply a response to the startd. We’re running Condor 9.0.15 across the pool.
Many thanks in advance,
Thomas Birkett Senior Systems Administrator Scientific Computing Department Science and Technology Facilities Council (STFC) Rutherford Appleton Laboratory, Chilton, Didcot
|