Dear CHTC users,
This email is for users of our High Throughput Computing (HTC) system.
Outage Updates
Resolved:
On Friday, Docker jobs were not able to connect to the internet. All Docker jobs should be able to do so now.
In Progress:
We are in the process of addressing the issue where Docker jobs may fail to download the Docker container image. That issue should occur less and less over the next few days.
The CHTC Status page has been updated accordingly.
Contact
chtc@xxxxxxxxxxx with any questions or to report potential
system issues.
Reminder of Extra Software Office Hours Wednesday and Monday
If you are planning to create a container for your software, please consider joining us for our additional in-person office hours this Wednesday, May 15, from 10-12pm
or Monday, May 20 from 1-3pm in the Discovery Building.
Best,
CHTC Facilitation Team
Date: Friday, May 10, 2024 at 4:49 PM
To: chtc-users <chtc-users@xxxxxxxxxxx>
Subject: Partial outages affecting the HTC system
Dear CHTC users,
This email is for users of our High Throughput Computing (HTC) system.
Multiple coincidental issues are currently affecting the HTC system:
Reduced HTC Capacity
-
An unexpected power outage is impacting one of our server rooms, and may shut down some of our execution points. This may reduce the size of the pool and lead to longer-than-usual queue times.
Issues with Docker Jobs
-
Docker jobs on the machines running CentOS Stream 9 may not be able to access the internet due to issues with our firewall. This may cause various and esoteric messages depending on your program and if it depends
on network access.
-
To solve this issue, many of our nodes will need to be rebooted. These reboots are starting now and will cause jobs to be interrupted. Interrupted jobs should remain in the queue and be restarted by HTCondor.
-
Docker jobs may fail to download the Docker container image. Such jobs go on hold with a message like "Error from
slotY_ZZ@xxxxxxxxxxxxxxxxxxx: Unable to find image" followed by messages with "Pulling fs layer". This is separate from the above issues, and we are working to resolve it.
-
To workaround this issue, you can release the job with "condor_release JobID" and the job will hopefully run on a different machine. If it does not, you can add a requirement to your submit file to avoid the bad
machine: "requirements = Machine != 'eXXX.chtc.wisc.edu' ".
We will send an update next week once as we continue to resolve these issues. You can also monitor the progress of these issues on our status page:
https://status.chtc.wisc.edu/
In the meantime, we continue to encourage users to transition to containerizing their software and to attend our additional in-person office hours next week on Wednesday
from 10-12pm in the Discovery Building.
Best,
The Facilitation team