[Chtc-users] URGENT FOR HTC USERS: Issue for users of ChtcRun and "squid"


Date: Thu, 7 Aug 2014 16:48:12 -0500
From: Lauren Michael <lmichael@xxxxxxxx>
Subject: [Chtc-users] URGENT FOR HTC USERS: Issue for users of ChtcRun and "squid"
Greetings,

This message pertains only to CHTC's users of our HTCondor pool (via condor submit node), and not to users of the HPC Cluster.

We have just fixed a problem that has affected ANYONE using the ChtcRun package (for setting up batches of jobs with "mkdag"), and for any users who have otherwise placed large files in "/squid/<username>" so those files can be delivered to jobs.

In either case above, it is likely that significant numbers of your jobs have failed due to this issue, which is related to a hardware failure on one of our proxy servers, and it is likely that you will need to remove affected batches of jobs from the queue, and re-run them (for any batches with jobs that have been running over the last 3 days). We apologize for any inconvenience, as this issue was completely unforeseen by CHTC staff.

If you are using ChtcRun, you can determine the number of job failures from the "mydag.dag.dagman.out" file in any batch's main output directory. (Somewhere near the bottom of this file will be a table of job completions and failures).

If you would like help with removing and re-running your jobs, please send an email to chtc@xxxxxxxxxxx. ÂWe are happy to help.

Thank you,
Your CHTC Team




[← Prev in Thread] Current Thread [Next in Thread→]
  • [Chtc-users] URGENT FOR HTC USERS: Issue for users of ChtcRun and "squid", Lauren Michael <=