Christoph, Thanks for the response! From the MasterLog on the scheduler node, condor_startd appears to be constantly dying and restarting: [15591333768] Started DaemonCore process “/usr/sbin/condor_startd”, pid and pgroup = 14248 [15591333774] DefaultReaper unexpectedly called on pid 14248, status 1024. [15591333774] The STARTD (pid 14248) exited with status 4 [15591333774] restarting /usr/sbin/condor_startd in 3600 seconds . . . repeat ad infinitum Do you think that could be attributed to the scheduler DB as you mention?
Hi Eric, the 'remove' of 31k jobs comes at a price I guess, we do see similar things sometimes when a lot of 'single' jobs have state changes e.g. from idle to hold or removed the scheduler
becomes kind of unresponsive to other tasks. You might want to put the scheduler db on a ssd device which makes these operations a lot faster or split the load from the scheduler on two different machines. Scripted 'condor_q' requests can be a nuisance too by the way ;)
Best Christoph
|