[Chtc-users] For HTC users: Jobs going on hold due to memory over-use


Date: Mon, 18 Aug 2014 14:21:16 -0500
From: Lauren Michael <lmichael@xxxxxxxx>
Subject: [Chtc-users] For HTC users: Jobs going on hold due to memory over-use
(This message is for users of CHTC's HTCondor pool and other HTC resources. HPC Cluster users can ignore.)

We have recently changed the configuration in the CHTC HTCondor Pool to more effectively enforce the use of Memory.Â

Recently, the UW-Madison campus has been blocked by certain sites in the OSG due to jobs that are repeatedly using more memory than requested. Thus, we have removed our custom configuration in the CHTC Pool that was more lenient with memory over-use, in order to help users become more active in determining their jobs' requirements.

Many of you may see now (or in the future) that you have many jobs going on hold, and this is likely because your jobs are using more memory than requested in your submit files (or in process.template, for those using the ChtcRun package).

While we apologize for any inconvenience, it is essential to always be aware of the memory needs of your jobs, and to submit them with adequate memory requests. Recent events have made us aware of just how many users' jobs were over-using memory, often due to a complete lack of testing. However, it is ESSENTIAL in batch computing to properly handle memory and disk use of your jobs by testing at a small-scale (a few jobs) and examining log files.


To Determine Whether Your Jobs Are on Hold Due to Memory Over-use:
command: "condor_q -hold <username>"
(inserting your username into the command)
The output will list your held jobs, and provide a "Hold Reason" for each job.


What to Do If Your Jobs Use Too Much Memory:

1. Importantly, you should always be running several (3-10) test jobs to anticipate memory needs from the log files, and then modify your submit files to request an upper limit (or higher) amount of memory per-job BEFORE submitting many jobs.

2. If your jobs go on hold for using too much memory:

A. Regular HTCondor submission:
You can modify the Memory request, and then release the jobs so that they'll be re-run:
command to edit memory request: "condor_qedit <descriptor> RequestMemory <new_value>"
command to release jobs after editing: "condor_release <descriptor>"
Insert your username, a job cluster number, or a single job's unique jobID number (cluster.process) for <descriptor>.

B. Users of the ChtcRun package:
Your best bet (easiest) is to remove the batch of jobs, and then to re-run them after modifying the memory request in process.template within ChtcRun.
command to remove jobs: "condor_rm <descriptor>", where <descriptor> may be your username, in order to remove all of your jobs.
-OR-
Replace <descriptor> with the job.ID number of the condor_dagman job that is responsible for a certain batch (viewable in the mydag.dag.dagman.log file for that batch, toward the top left), and only that batch will be removed.


Please let us know if you have any questions.

Best Wishes,
Your CHTC Team

(care of Lauren Michael)
[← Prev in Thread] Current Thread [Next in Thread→]
  • [Chtc-users] For HTC users: Jobs going on hold due to memory over-use, Lauren Michael <=