_______________________________________________Hi Thomas,
Looking at the log message posted it seems like the docker command is getting hung and the wrapper script /usr/local/bin/docker.py is the one declaring the timeout and exiting a failure. If it was condor declaring the timeout you would seem some sort of message like provided followed immediately by "Declaring a hung docker".Â
What is the reason for using a wrapper python script around the docker commands?
-Cole Bollig
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Thomas Birkett - STFC UKRI via HTCondor-users <htcondor-users@xxxxxxxxxxx>
Sent: Tuesday, May 23, 2023 5:55 AM
To: condor-users@xxxxxxxxxxx <condor-users@xxxxxxxxxxx>
Cc: Thomas Birkett - STFC UKRI <thomas.birkett@xxxxxxxxxx>
Subject: [HTCondor-users] Extend Docker container removal timeoutÂHi all,
Â
I hope everyone is keeping well. Quick question for the community, we have intermittent timeouts for containers on nodes with the logs detailing the following:
Â
condor_startd[3404]: Failed to read results from '/usr/local/bin/docker.py rm -f -v HTCJob2693132_0_slot1_8_PID2466140': 'Timed out waiting for program to exit' (110)
Â
Is there a knob / config option that exists for extending the removal timeout value for containers and jobs on startdâs? Docker does eventually remove the container but as the workers have very high I/O at times, the node may need more time to supply a response to the startd. Weâre running Condor 9.0.15 across the pool.
Â
Many thanks in advance,
Â
Thomas Birkett
Senior Systems Administrator
Scientific Computing Department Â
Science and Technology Facilities Council (STFC)
Rutherford Appleton Laboratory, Chilton, DidcotÂ
OX11 0QXÂ
Â
Â
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/