Hi All.Yes I have seen it.Most of the time it relates to storage issue. For example, job is running and user decide to remove the job. So condor will run docker stop/rm and docker trying to kill the process while the process try to close/write/open only when the systen call is back the process will stop. So the timeout is reasonable.
I think we need a periodic docker check that will bring the docker universe back to online.
ThanksDavid
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Jose Caballero <jcaballero.hep@xxxxxxxxx>
Sent: Thursday, November 16, 2023 9:40:09 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: condor-users@xxxxxxxxxxx <condor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Docker hang re-evaluationHi,
Has anybody else seen this behaviour? If so, how did you fix it?Or, is there some classad for the timeout that can be adjusted?Any comment is more than welcome.
Cheers,Jose
El mar, 14 nov 2023 a las 15:28, Thomas Birkett - STFC UKRI via HTCondor-users (<htcondor-users@xxxxxxxxxxx>) escribiÃ:
_______________________________________________Hi all,
Hope everyone is keeping well. I have an interesting issue/irregular situation that occurs with our workernodes. We currently run Docker containers on our workers with Condor 10.0.9. Some of our newer workernodes can run ~250 jobs per physical node and this can lead to a highly loaded system. Due to this, there are times that Docker can be slow to respond or give the impression of a hang, leading to the following ClassAds for the Startd:
DockerOfflineReason = Docker hung trying to rm an orphaned container
And sets ATTR_HAS_DOCKER = false
Looking at the source I see this behaviour defined: https://github.com/htcondor/htcondor/blob/main/src/condor_startd.V6/util.cpp#L244C34-L244C34
As the Docker hang is ofttimes recoverable, is there functionality in Condor to re-evaluate Dockerâs status without having to restart the Condor daemon or manually amending these ClassAds?
Many thanks,
Thomas Birkett
Senior Systems Administrator
Scientific Computing Department
Science and Technology Facilities Council (STFC)
Rutherford Appleton Laboratory, Chilton, Didcot
OX11 0QX
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________ HTCondor-users mailing list To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a subject: Unsubscribe You can also unsubscribe by visiting https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users The archives can be found at: https://lists.cs.wisc.edu/archive/htcondor-users/
-- Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison Center for High Throughput Computing Department of Computer Sciences Calendar: https://tinyurl.com/yd55mtgd 1210 W. Dayton St. Rm #4257