Hi Brian, many thanks for the explanation! The issue should be fixed for good and we do not disesteem hold jobs as before now ;) A short question: to limit the total number of jobs per scheduler MAX_JOBS_SUBMITTED should be the right ad, i.e., idle+running+hold, or? So far I noticed only MaxJobsRunning/MAX_JOBS_RUNNING @ 10000 [1] and would like to go now also for an upper limit of all jobs. Cheers and many thanks, Thomas [1] >>> thisSchedd['MaxJobsRunning'] 10000L http://research.cs.wisc.edu/htcondor/manual/v8.6/3_5Configuration_Macros.html#param:MaxJobsRunning On 2017-02-15 03:59, Brian Bockelman wrote: > Hi Thomas, > > That makes sense - when the schedd forked a child condor_schedd to answer a query, the child grew rapidly in terms of RAM usage. By default, the parent and child share the same pages in RAM - meaning the cost of the fork is relatively low. *However*, if the parent condor_schedd's queue is rapidly changing, Linux will allocate a lot of new pages. > > A large, changing queue causes trouble in two ways: > - Breaks the sharing between parent and child, AND > - Causes child to live longer as response to query takes long. > > Note that you can reduce the number of helps down drastically and force the schedd to respond to queries in-process. It will make the parent process work harder - but consumes less memory in the worst-case scenario. > > Brian > >> On Feb 14, 2017, at 7:48 AM, Thomas Hartmann <thomas.hartmann@xxxxxxx> wrote: >> >> Hi all, >> >> we probably found the cause and fixed it (fingers crossed) >> >> post mortem ~> >> https://confluence.desy.de/pages/viewpage.action?pageId=47425023 >> >> Presumably during a 'transparent' maintenance on the ARC's underlying >> supervisor, Condor shadows etc. could not access the local job files. >> This caused(?) a large number of jobs to be seen as failed by condor and >> sending them to hold. >> Apparently, condor was overwhelmed by the large number of hold jobs >> (160.000 jobs in hold, /var/lib/condor/spool.old.20170214/job_queue.log >> already at ~1.4GB). Plain removing the hold jobs with condor_rm failed >> accordingly(?), so that we moved the spool dir away and gave condor a >> fresh start. >> >> Since then, the node has been running fine again. >> >> Cheers, >> Thomas >> >> _______________________________________________ >> HTCondor-users mailing list >> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a >> subject: Unsubscribe >> You can also unsubscribe by visiting >> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users >> >> The archives can be found at: >> https://lists.cs.wisc.edu/archive/htcondor-users/ > > _______________________________________________ > HTCondor-users mailing list > To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a > subject: Unsubscribe > You can also unsubscribe by visiting > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users > > The archives can be found at: > https://lists.cs.wisc.edu/archive/htcondor-users/ >
Attachment:
smime.p7s
Description: S/MIME Cryptographic Signature