Partial outages affecting the HTC system

Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

Date:	Fri, 10 May 2024 21:49:01 +0000
From:	chtc-users@xxxxxxxxxxx
Subject:	Partial outages affecting the HTC system

Dear CHTC users,

This email is for users of our High Throughput Computing (HTC) system.

Multiple coincidental issues are currently affecting the HTC system:

Reduced HTC Capacity

An unexpected power outage is impacting one of our server rooms, and may shut down some of our execution points. This may reduce the size of the pool and lead to longer-than-usual queue times.

Issues with Docker Jobs

Docker jobs on the machines running CentOS Stream 9 may not be able to access the internet due to issues with our firewall. This may cause various and esoteric messages depending on your program and if it depends on network access.

To solve this issue, many of our nodes will need to be rebooted. These reboots are starting now and will cause jobs to be interrupted. Interrupted jobs should remain in the queue and be restarted by HTCondor.

Docker jobs may fail to download the Docker container image. Such jobs go on hold with a message like "Error from slotY_ZZ@xxxxxxxxxxxxxxxxxxx: Unable to find image" followed by messages with "Pulling fs layer". This is separate from the above issues, and we are working to resolve it.

To workaround this issue, you can release the job with "condor_release JobID" and the job will hopefully run on a different machine. If it does not, you can add a requirement to your submit file to avoid the bad machine: "requirements = Machine != 'eXXX.chtc.wisc.edu' ".

We will send an update next week once as we continue to resolve these issues. You can also monitor the progress of these issues on our status page: https://status.chtc.wisc.edu/

In the meantime, we continue to encourage users to transition to containerizing their software and to attend our additional in-person office hours next week on Wednesday from 10-12pm in the Discovery Building.

Best,

The Facilitation team

[← Prev in Thread]	Current Thread	[Next in Thread→]
Partial outages affecting the HTC system, chtc-users <=

Previous by Date:	In-person Help Sessions for HTC OS Transition, chtc-users
Next by Date:	Follow-up on HTC Outages, chtc-users
Previous by Thread:	In-person Help Sessions for HTC OS Transition, chtc-users
Next by Thread:	Reminder: HTC Operating System Change TODAY (May 1), chtc-users
Indexes:	[Date] [Thread]

Mailing List Archives

Authenticated access

Partial outages affecting the HTC system