[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] STARTD_CRON stops running



On 11/07/2011 11:29 AM, Sarah Williams wrote:
Hello Condor users&  experts,

I am using STARTD_CRON to do periodic health checks on worker nodes.
However, on some of the nodes I see that the script is no longer logging
any output, and the nodes do not detect unhealthy states.  Using
condor_config_val -dump, I see the CRON settings are in place:
CRON_JOBLIST = nodecheck
CRON_NODECHECK_EXECUTABLE = /usr/local/sbin/condor_node_check.sh
CRON_NODECHECK_KILL = true
CRON_NODECHECK_MODE = periodic
CRON_NODECHECK_PERIOD = 15m
CRON_NODECHECK_RECONFIG = false
STARTD_CRON_NAME = CRON
The script is world-executable, and the log file is world-writable. My
version of condor is 7.6.0-1.

I wonder if I am being affected by the following bug.
https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2437
Is there any way to expose the current value of CRON_*_SENT?  I see many
instances of this message in StartLog:
StartLog.old:11/01/11 05:08:27 CronJob: Job 'nodecheck' not idle!
Is there a way to reset CRON_*_SENT without killing running jobs?

--Sarah

Sounds like you're hitting GT2437. It looks like upgrading to 7.6.4 will solve it.

Best,


matt