Dear CHTC users,
This email is for users of our High Throughput Computing (HTC) system.
Multiple coincidental issues are currently affecting the HTC system:
Reduced HTC Capacity
-
An unexpected power outage is impacting one of our server rooms, and may shut down some of our execution points. This may reduce the size of the pool and lead to longer-than-usual queue times.
Issues with Docker Jobs
-
Docker jobs on the machines running CentOS Stream 9 may not be able to access the internet due to issues with our firewall. This may cause various and esoteric messages depending on your program
and if it depends on network access.
-
To solve this issue, many of our nodes will need to be rebooted. These reboots are starting now and will cause jobs to be interrupted. Interrupted jobs should remain in the queue and be restarted
by HTCondor.
-
Docker jobs may fail to download the Docker container image. Such jobs go on hold with a message like "Error from
slotY_ZZ@xxxxxxxxxxxxxxxxxxx: Unable to find image" followed by messages with "Pulling fs layer". This is separate from the above issues, and we are working to resolve it.
-
To workaround this issue, you can release the job with "condor_release JobID" and the job will hopefully run on a different machine. If it does not, you can add a requirement to your submit file
to avoid the bad machine: "requirements = Machine != 'eXXX.chtc.wisc.edu' ".
We will send an update next week once as we continue to resolve these issues. You can also monitor the progress of these issues on our status page:
https://status.chtc.wisc.edu/
In the meantime, we continue to encourage users to transition to containerizing their software and to attend our additional in-person office hours next week on Wednesday
from 10-12pm in the Discovery Building.
Best,
The Facilitation team