_______________________________________________Hi all,
Â
Hope everyone is keeping well. I have an interesting issue/irregular situation that occurs with our workernodes. We currently run Docker containers on our workers with Condor 10.0.9. Some of our newer workernodes can run ~250 jobs per physical node and this can lead to a highly loaded system. Due to this, there are times that Docker can be slow to respond or give the impression of a hang, leading to the following ClassAds for the Startd:
Â
DockerOfflineReason = Docker hung trying to rm an orphaned container
And sets ATTR_HAS_DOCKER = false
Â
Looking at the source I see this behaviour defined: https://github.com/htcondor/htcondor/blob/main/src/condor_startd.V6/util.cpp#L244C34-L244C34
Â
As the Docker hang is ofttimes recoverable, is there functionality in Condor to re-evaluate Dockerâs status without having to restart the Condor daemon or manually amending these ClassAds?
Â
Many thanks,
Â
Thomas Birkett
Senior Systems Administrator
Scientific Computing Department Â
Science and Technology Facilities Council (STFC)
Rutherford Appleton Laboratory, Chilton, DidcotÂ
OX11 0QXÂ
Â
Â
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/