Hi On 22.09.21 14:35, Rajagopala Reddy Seelam wrote:
Response to this email: No, I think "dagman" may not help me here. This has to do with the "request_cpus=1". HTCondor accepts jobs upto 20 and immediately runs these 20 calculations. As a result, the memory is exhausted and the machine hangs. I am looking to the "hold" possibility to manually specify the scheduler to hold the job and release the job after the earlier job is completed.
I think the partition-able slot will help here as well as you can also can simply use
request_memory = 6G request_cpus = 5and if the machine has 20 cores and 16 GByte of RAM, it would only ever run two of these at the same time as condor only as 4 GByte and 10 CPU cores left for a new job.
There are many more knobs to try to achieve this, but these would be the ones I would
try first. Cheers Carsten
Attachment:
smime.p7s
Description: S/MIME Cryptographic Signature