Given the Windows platform, I implement a SUSPEND policy. If the keyboard is touched in the last 5 minutes, or if the non-Condor load reaches a high value, I want to SUSPEND the job. Then CONTINUE the job when the keyboard is untouched for 5 minutes and the load is below the limit.
Unfortunately I have something wrong and the jobs SUSPEND/CONTINUE every 5 seconds:
07/12/11 16:32:21 slot1: Sent update to 1 collector(s) 07/12/11 16:32:22 slot1: State change: SUSPEND is TRUE 07/12/11 16:32:22 slot1: Changing activity: Busy -> Suspended
07/12/11 16:32:22 slot1: In Starter::kill() with pid 5372, sig 100 (DC_SIGSUSPEND) 07/12/11 16:32:23 slot1: Received job ClassAd update from starter. 07/12/11 16:32:26 Trying to update collector <123.456.78.910:9618>
07/12/11 16:32:26 Attempting to send update via UDP to collector delta-mod.water.ca.gov <123.456.78.910:9618> 07/12/11 16:32:26 slot1: Sent update to 1 collector(s)
07/12/11 16:32:27 slot1: State change: CONTINUE is TRUE 07/12/11 16:32:27 slot1: In Starter::kill() with pid 5372, sig 101 (DC_SIGCONTINUE) 07/12/11 16:32:27 slot1: Changing activity: Suspended -> Busy 07/12/11 16:32:27 slot1: Received job ClassAd update from starter.
Attempting to debug this, I set
STARTD_DEBUG = D_FULLDEBUG
While this does give more information (see above), it doesn't state why Condor decides to SUSPEND or CONTINUE a job. And that piece of information I need to see what is wrong with my condition statement. What can I do to see why Condor is changing the state of a job?
Ralph Finch Calif. Dept. of Water Resources Sacramento, CA USA