Follow-up on HTC Outages

Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

Date:	Tue, 14 May 2024 21:10:53 +0000
From:	chtc-users@xxxxxxxxxxx
Subject:	Follow-up on HTC Outages

Dear CHTC users,

This email is for users of our High Throughput Computing (HTC) system.

Outage Updates

Resolved: On Friday, Docker jobs were not able to connect to the internet. All Docker jobs should be able to do so now.

In Progress: We are in the process of addressing the issue where Docker jobs may fail to download the Docker container image. That issue should occur less and less over the next few days.

The CHTC Status page has been updated accordingly.

Contact chtc@xxxxxxxxxxx with any questions or to report potential system issues.

Reminder of Extra Software Office Hours Wednesday and Monday

If you are planning to create a container for your software, please consider joining us for our additional in-person office hours this Wednesday, May 15, from 10-12pm or Monday, May 20 from 1-3pm in the Discovery Building.

Best,

CHTC Facilitation Team

Date: Friday, May 10, 2024 at 4:49 PM
To: chtc-users <chtc-users@xxxxxxxxxxx>
Subject: Partial outages affecting the HTC system

Dear CHTC users,

This email is for users of our High Throughput Computing (HTC) system.

Multiple coincidental issues are currently affecting the HTC system:

Reduced HTC Capacity

An unexpected power outage is impacting one of our server rooms, and may shut down some of our execution points. This may reduce the size of the pool and lead to longer-than-usual queue times.

Issues with Docker Jobs

Docker jobs on the machines running CentOS Stream 9 may not be able to access the internet due to issues with our firewall. This may cause various and esoteric messages depending on your program and if it depends on network access.

To solve this issue, many of our nodes will need to be rebooted. These reboots are starting now and will cause jobs to be interrupted. Interrupted jobs should remain in the queue and be restarted by HTCondor.

Docker jobs may fail to download the Docker container image. Such jobs go on hold with a message like "Error from slotY_ZZ@xxxxxxxxxxxxxxxxxxx: Unable to find image" followed by messages with "Pulling fs layer". This is separate from the above issues, and we are working to resolve it.

To workaround this issue, you can release the job with "condor_release JobID" and the job will hopefully run on a different machine. If it does not, you can add a requirement to your submit file to avoid the bad machine: "requirements = Machine != 'eXXX.chtc.wisc.edu' ".

We will send an update next week once as we continue to resolve these issues. You can also monitor the progress of these issues on our status page: https://status.chtc.wisc.edu/

In the meantime, we continue to encourage users to transition to containerizing their software and to attend our additional in-person office hours next week on Wednesday from 10-12pm in the Discovery Building.

Best,

The Facilitation team

[← Prev in Thread]	Current Thread	[Next in Thread→]
Follow-up on HTC Outages, chtc-users <=

Previous by Date:	Partial outages affecting the HTC system, chtc-users
Next by Date:	Summer Changes at CHTC, chtc-users
Previous by Thread:	, (nil)
Next by Thread:	HPC Cluster Operating System Transition - Test Jobs!!, chtc-users
Indexes:	[Date] [Thread]

Mailing List Archives

Authenticated access

Follow-up on HTC Outages