Hi all,
Thank you for the responses. We don’t explicitly set any timeouts within Condor or our Docker wrapper. I managed to see the error occur in real time and did a better capture of the logs this time around. I do see the statement
"Declaring a hung docker", please find below:
May 25 16:34:40 lcg2256.gridpp.rl.ac.uk condor_starter[1318016]: condor_read(): timeout reading 1 bytes from Docker Socket.
May 25 16:34:40 lcg2256.gridpp.rl.ac.uk condor_starter[1004513]: condor_read(): timeout reading 1 bytes from Docker Socket.
May 25 16:34:40 lcg2256.gridpp.rl.ac.uk condor_starter[252549]: condor_read(): timeout reading 1 bytes from Docker Socket.
May 25 16:34:41 lcg2256.gridpp.rl.ac.uk condor_startd[2712]: Failed to read results from '/usr/local/bin/docker.py rm -f -v HTCJob3189095_0_slot1_38_PID2292872': 'Timed out waiting for program to exit' (110)
May 25 16:34:41 lcg2256.gridpp.rl.ac.uk condor_startd[2712]: Declaring a hung docker
May 25 16:34:41 lcg2256.gridpp.rl.ac.uk condor_startd[2712]: DockerAPI::rm returned docker_hung. Taking Docker universe offline
May 25 16:34:41 lcg2256.gridpp.rl.ac.uk condor_startd[2712]: OfflineUniverses = {"Docker"}
May 25 16:34:41 lcg2256.gridpp.rl.ac.uk condor_startd[2712]: slot1_38: State change: starter exited
May 25 16:34:41 lcg2256.gridpp.rl.ac.uk condor_startd[2712]: slot1_38: Changing activity: Busy -> Idle
May 25 16:34:41 lcg2256.gridpp.rl.ac.uk condor_procd[2695]: PROC_FAMILY_UNREGISTER_FAMILY
May 25 16:34:41 lcg2256.gridpp.rl.ac.uk condor_procd[2695]: unregistering family with root pid 2292872
May 25 16:34:41 lcg2256.gridpp.rl.ac.uk condor_startd[2712]: Unable to calculate keyboard/mouse idle time due to them both being USB or not present, assuming infinite idle time for these devices.
May 25 16:34:41 lcg2256.gridpp.rl.ac.uk condor_procd[2695]: PROC_FAMILY_GET_USAGE for pid 2325658
May 25 16:34:41 lcg2256.gridpp.rl.ac.uk condor_procd[2695]: PROC_FAMILY_GET_USAGE for pid 2380875
The Docker daemon does (eventually) recover, is there some recovery process within Condor that can clear the “DockerOffline*” ClassAds when the system is stable again? Needless to say I’m continuing to find the source of our Docker
instance becoming unavailable.
Many thanks,
Tom
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Jose Caballero <jcaballero.hep@xxxxxxxxx>
Date: Friday, 26 May 2023 at 08:29
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Extend Docker container removal timeout
Hi Cole,
Our wrapper script is mostly legacy. It was created many years ago, and I believe the native support for containers in HTCondor was not as sophisticated as it has become recently.
It mostly sets some parameters for the docker commands, in particular env vars for specific users. Do I understand all of that can now be done using classads?
Said that, and Tom can correct me if I'm wrong, I don't believe we have any timeout in that wrapper script. Not explicitly at least.
Looking at the log message posted it seems like the docker command is getting hung and the wrapper script /usr/local/bin/docker.py is the one declaring the timeout and exiting a failure. If it was condor declaring
the timeout you would seem some sort of message like provided followed immediately by "Declaring a hung docker".
What is the reason for using a wrapper python script around the docker commands?
Hi all,
I hope everyone is keeping well. Quick question for the community, we have intermittent timeouts for containers on nodes with the logs detailing the following:
condor_startd[3404]: Failed to read results from '/usr/local/bin/docker.py rm -f -v HTCJob2693132_0_slot1_8_PID2466140': 'Timed out waiting for program to exit' (110)
Is there a knob / config option that exists for extending the removal timeout value for containers and jobs on startd’s? Docker does eventually remove the container but as the workers have very high I/O at times, the node may need more time to supply a response
to the startd. We’re running Condor 9.0.15 across the pool.
Many thanks in advance,
Thomas Birkett
Senior Systems Administrator
Scientific Computing Department
Science and Technology Facilities Council (STFC)
Rutherford Appleton Laboratory, Chilton, Didcot
OX11 0QX

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to
htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/