Hi,
it seems it doesn't matter which value I give: UWCS_MaxJobRetirementTime = 0 MaxJobRetirementTime = 0
or UWCS_MaxJobRetirementTime = 9999999 MaxJobRetirementTime = 9999999 or UWCS_MaxJobRetirementTime = -1 MaxJobRetirementTime = -1
in the condor_config.local file of the node, the job was killed within a second as before.
What helps is if I give MaxJobRetirementTime = 9999999 in the submit descrition file at least the job is still running now.
Where I have to set this value if I want to avoid that every user have to give it in the submit description file?
Thanks Harald
On Thursday 18 August 2016 22:19:21 Harald van Pee wrote: > Hi, > > logged into the execute node that kicked > off the job, I tried > condor_config_val -startd -dump maxjobretirementtime > > result: > # Parameters with names that match maxjobretirementtime: > UWCS_MaxJobRetirementTime = 0 > # Contributing configuration file(s): > # <Default> > > could this cause the problem? It must be default because its not in the > configuration files and not in the submit description files because it > happens also for my jobs and I do not give any MaxJobRetirementTime. > > What value would be best, we never want to stop any regular job > > UWCS_MaxJobRetirementTime = -1 or just a big number? > > Many thanks > Harald > > On Thursday 18 August 2016 21:43:23 Todd Tannenbaum wrote: > > On 8/18/2016 12:50 PM, Harald van Pee wrote: > > > Hi, > > > > > > here just more information, one can see all happens within a second and > > > all jobs are gone (or restarted on another node). > > > > [snip] > > > > > 08/18/16 19:43:03 slot1_1: Changing activity: Busy -> Retiring > > > > > > 08/18/16 19:43:03 slot1_1: State change: claim retirement ended/expired > > > > ^^^ This line is the smoking gun. This is saying that either the job or > > the slot has defined a MaxJobRetirementTime, and that the job has > > already been running for longer than this defined period, so the startd > > immediately leaves Claimed/Retiring state and goes to > > Preempting/Vacating (which sends a SIGTERM to the job). > > > > For example, I get the exact same results as you when I do a condor_off > > -peaceful, with the exact same messages in the StartLog, if I submit a > > > > job that looks like the following: > > Executable = /bin/sleep > > Arguments = 60000 > > # Allow the job to be preempted if HTCondor wants > > # to shutdown or run a higher priority job if and > > # only if this job has already run for more than > > # one second. > > MaxJobRetirementTime = 1 > > Queue > > > > In your tests, are you positive that neither the job(s) being preempted > > nor the execute node where the condor_startd is running define > > MaxJobRetirementTime ? Because it really looks that way to me. To > > check the job, use condor_q (or condor_history if the job left the > > queue) and pass "-af MaxJobRetirementTime" command-line arg. To check > > the condor_startd, if you are logged into the execute node that kicked > > off the job, try > > > > condor_config_val -startd -dump maxjobretirementtime > > > > regards, > > Todd > > > > > 08/18/16 19:43:03 slot1_1: Changing state and activity: > > > Claimed/Retiring -> Preempting/Vacating > > > > > > 08/18/16 19:43:03 PERMISSION DENIED to submit-side@matchsession from > > > host 192.168.xxx.xxx for command 403 (DEACTIVATE_CLAIM), access level > > > DAEMON: reason: cached result for DAEMON; see first case for the full > > > reason > > > > > > 08/18/16 19:43:03 slot1_1: Got DEACTIVATE_CLAIM while in Preempting > > > state, ignoring. > > > > > > 08/18/16 19:43:03 Starter pid 6873 exited with status 0 > > > > > > 08/18/16 19:43:03 slot1_1: State change: starter exited > > > > > > 08/18/16 19:43:03 slot1_1: State change: No preempting claim, returning > > > to owner > > > > > > 08/18/16 19:43:03 slot1_1: Changing state and activity: > > > Preempting/Vacating -> Owner/Idle > > > > > > 08/18/16 19:43:03 slot1_1: State change: IS_OWNER is false > > > > > > 08/18/16 19:43:03 slot1_1: Changing state: Owner -> Unclaimed > > > > > > 08/18/16 19:43:03 slot1_1: Changing state: Unclaimed -> Delete > > > > > > 08/18/16 19:43:03 slot1_1: Resource no longer needed, deleting > > > > > > 08/18/16 19:43:03 Deleting cron job manager > > > > > > 08/18/16 19:43:03 Cron: Killing all jobs > > > > > > 08/18/16 19:43:03 Cron: Killing all jobs > > > > > > 08/18/16 19:43:03 CronJobList: Deleting all jobs > > > > > > 08/18/16 19:43:03 Cron: Killing all jobs > > > > > > 08/18/16 19:43:03 CronJobList: Deleting all jobs > > > > > > 08/18/16 19:43:03 Deleting benchmark job mgr > > > > > > 08/18/16 19:43:03 Cron: Killing all jobs > > > > > > 08/18/16 19:43:03 Killing job mips > > > > > > 08/18/16 19:43:03 Killing job kflops > > > > > > 08/18/16 19:43:03 Cron: Killing all jobs > > > > > > 08/18/16 19:43:03 Killing job mips > > > > > > 08/18/16 19:43:03 Killing job kflops > > > > > > 08/18/16 19:43:03 CronJobList: Deleting all jobs > > > > > > 08/18/16 19:43:03 CronJobList: Deleting job 'mips' > > > > > > 08/18/16 19:43:03 CronJob: Deleting job 'mips' > > > (/usr/lib/condor/libexec/condor_mips), timer -1 > > > > > > 08/18/16 19:43:03 CronJobList: Deleting job 'kflops' > > > > > > 08/18/16 19:43:03 CronJob: Deleting job 'kflops' > > > (/usr/lib/condor/libexec/condor_kflops), timer -1 > > > > > > 08/18/16 19:43:03 Cron: Killing all jobs > > > > > > 08/18/16 19:43:03 CronJobList: Deleting all jobs > > > > > > 08/18/16 19:43:03 All resources are free, exiting. > > > > > > 08/18/16 19:43:03 **** condor_startd (condor_STARTD) pid 6818 EXITING > > > WITH STATUS 0 > > > > > > On Thursday 18 August 2016 19:12:55 Harald van Pee wrote: > > > > @Bop: I also give the command from the central manager. > > > > > > > > > > > > > > > > @Todd: > > > > > > > > I have no MaxJobRetirementTime defined (nothing with retire or time > > > > found > > > > > > > > on condor_config*, not on node, scheduler or central manager. > > > > > > > > > > > > > > > > condor_status| grep node > > > > > > > > slot1@node LINUX X86_64 Unclaimed Idle 0.230 63507 0+00:00:04 > > > > > > > > slot1_1@node LINUX X86_64 Claimed Busy 0.000 1024 0+00:00:03 > > > > > > > > > > > > > > > > after > > > > > > > > condor_off -peaceful -daemon startd node > > > > > > > > condor_status shows no node anymore (within 1 second, as fast as I > > > > can > > > > > > > > type). > > > > > > > > > > > > > > > > We use > > > > > > > > > > > > > > > > CLAIM_WORKLIFE = 120 > > > > > > > > and > > > > > > > > STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler > > > > > > > > > > > > > > > > NUM_SLOTS = 1 > > > > > > > > SLOT_TYPE_1 = 100% > > > > > > > > SLOT_TYPE_1_PARTITIONABLE = true > > > > > > > > NUM_SLOTS_TYPE_1 = 1 > > > > > > > > > > > > > > > > Any help is welcome. > > > > > > > > > > > > > > > > Harald > > > > > > > > On Thursday 18 August 2016 18:29:04 Todd Tannenbaum wrote: > > > > > As another data point, it also seemed to work for me running a > > > > > > > > > > pre-release of HTCondor v8.5.7 on Scientific Linux 6.8. > > > > > > > > > > Behold the simple test below; note the node went from Claimed/Busy > > > > > to > > > > > > > > > > Claimed/Retiring, which is expected. "Retiring" activity is > > > > > defined in > > > > > > > > > > > > > > > > > > > > the Manual (from https://is.gd/mi7mVk ): > > > > > > > > > > Retiring > > > > > > > > > > > > > > > > > > > > When an active claim is about to be preempted for any reason, it > > > > > > > > > > enters > > > > > > > > > > > > > > > > > > > > retirement, while it waits for the current job to finish. The > > > > > > > > > > MaxJobRetirementTime _expression_ determines how long to wait > > > > > (counting > > > > > > > > > > since the time the job started). Once the job finishes or the > > > > > > retirement > > > > > > > > time expires, the Preempting state is entered. > > > > > > > > > > > > > > > > > > > > Perhaps you have a MaxJobRetirementTime defined ? > > > > > > > > > > > > > > > > > > > > regards, > > > > > > > > > > Todd > > > > > > > > > > > > > > > > > > > > [tannenba@localhost test]$ condor_status > > > > > > > > > > Name OpSys Arch State Activity LoadAv Mem > > > > > > > > > > ActvtyTime > > > > > > > > > > > > > > > > > > > > slot1@localhost LINUX X86_64 Claimed Busy 0.000 330 > > > > > > > > > > 0+00:00:04 slot2@localhost LINUX X86_64 Unclaimed Idle 0.000 > > > > > > > > > > 330 0+00:00:05 slot3@localhost LINUX X86_64 Unclaimed Idle > > > > > > > > > > 0.000 330 0+00:00:06 > > > > > > > > > > > > > > > > > > > > Total Owner Claimed Unclaimed Matched Preempting > > > > > > > > > > > > > > > > > > > > Backfill Drain > > > > > > > > > > > > > > > > > > > > X86_64/LINUX 3 0 1 2 0 0 > > > > > > > > > > > > > > > > > > > > 0 0 > > > > > > > > > > > > > > > > > > > > Total 3 0 1 2 0 0 > > > > > > > > > > > > > > > > > > > > 0 0 > > > > > > > > > > > > > > > > > > > > [tannenba@localhost test]$ condor_off -peaceful -daemon startd > > > > > > > > > > Sent "Set-Peaceful-Shutdown" command to local startd > > > > > > > > > > Sent "Kill-Daemon-Peacefully" command to local master > > > > > > > > > > > > > > > > > > > > [tannenba@localhost test]$ condor_status > > > > > > > > > > Name OpSys Arch State Activity LoadAv Mem > > > > > > > > > > ActvtyTime > > > > > > > > > > > > > > > > > > > > slot1@localhost LINUX X86_64 Claimed Retiring 0.000 330 > > > > > > > > > > 0+00:00:03 slot2@localhost LINUX X86_64 Unclaimed Idle 0.000 > > > > > > > > > > 330 0+00:02:49 slot3@localhost LINUX X86_64 Unclaimed Idle > > > > > > > > > > 0.000 330 0+00:00:06 > > > > > > > > > > > > > > > > > > > > Total Owner Claimed Unclaimed Matched Preempting > > > > > > > > > > > > > > > > > > > > Backfill Drain > > > > > > > > > > > > > > > > > > > > X86_64/LINUX 3 0 1 2 0 0 > > > > > > > > > > > > > > > > > > > > 0 0 > > > > > > > > > > > > > > > > > > > > Total 3 0 1 2 0 0 > > > > > > > > > > > > > > > > > > > > 0 0 > > > > > > > > > > On 8/18/2016 11:11 AM, Bob Ball wrote: > > > > > > Just as a data point, I do, from our central manager machine, > > > > > > > > > > > > condor_off -peaceful -daemon startd -name $publicName > > > > > > > > > > > > and it runs just fine. All our jobs are vanilla. HTCondor is > > > > > > version > > > > > > > > > > > > 8.4.6 on Scientific Linux. > > > > > > > > > > > > > > > > > > > > > > > > bob > > > > > > > > > > > > On 8/18/2016 11:54 AM, Harald van Pee wrote: > > > > > >> Hi, > > > > > >> > > > > > >> > > > > > >> > > > > > >> I want to set a job running node offline, but only after all > > > > > >> running > > > > > >> > > > > > >> jobs have finished. Of course until then no new jobs should be > > > > > >> > > > > > >> accepted on that node. > > > > > >> > > > > > >> > > > > > >> > > > > > >> I tried the command: > > > > > >> > > > > > >> > > > > > >> > > > > > >> condor_off -peaceful -daemon startd node > > > > > >> > > > > > >> > > > > > >> > > > > > >> and got the message: > > > > > >> > > > > > >> > > > > > >> > > > > > >> Sent "Set-Peaceful-Shutdown" command to startd node > > > > > >> > > > > > >> > > > > > >> > > > > > >> Sent "Kill-Daemon-Peacefully" command to master node > > > > > >> > > > > > >> > > > > > >> > > > > > >> On node I see in StartLog > > > > > >> > > > > > >> > > > > > >> > > > > > >> 08/18/16 17:20:49 Got SIGTERM. Performing graceful shutdown. > > > > > >> > > > > > >> > > > > > >> > > > > > >> 08/18/16 17:20:49 shutdown graceful > > > > > >> > > > > > >> > > > > > >> > > > > > >> And indeed all jobs running in vannilla universe (we have no > > > > > >> others) > > > > > >> > > > > > >> > > > > > >> > > > > > >> are killed directly and started from the beginning. This is > > > > > >> what a > > > > > >> > > > > > >> > > > > > >> > > > > > >> graceful shutdown will do with vanilla jobs. But I want to have > > > > > >> a > > > > > >> > > > > > >> peaceful shutdown. > > > > > >> > > > > > >> > > > > > >> > > > > > >> Is a peaceful shutdown not possible for vanilla jobs? > > > > > >> > > > > > >> > > > > > >> > > > > > >> Do I have to change the configuration? We use: > > > > > >> > > > > > >> > > > > > >> > > > > > >> PREEMPT = FALSE > > > > > >> > > > > > >> > > > > > >> > > > > > >> PREEMPTION_REQUIREMENTS = False > > > > > >> > > > > > >> > > > > > >> > > > > > >> KILL = FALSE > > > > > >> > > > > > >> > > > > > >> > > > > > >> WANT_SUSPEND = false > > > > > >> > > > > > >> > > > > > >> > > > > > >> WANT_VACATE = false > > > > > >> > > > > > >> > > > > > >> > > > > > >> Or can I use just a different command? > > > > > >> > > > > > >> > > > > > >> > > > > > >> We use condor 8.4.8 on debian 8 (AMD64). > > > > > >> > > > > > >> > > > > > >> > > > > > >> Thanks > > > > > >> > > > > > >> > > > > > >> > > > > > >> Harald > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> _______________________________________________ > > > > > >> > > > > > >> HTCondor-users mailing list > > > > > >> > > > > > >> To unsubscribe, send a message > > > > > >> tohtcondor-users-request@xxxxxxxxxxx > > > > > >> > > > > > >> with a subject: Unsubscribe > > > > > >> > > > > > >> You can also unsubscribe by visiting > > > > > >> > > > > > >> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users > > > > > >> > > > > > >> > > > > > >> > > > > > >> The archives can be found at: > > > > > >> > > > > > >> https://lists.cs.wisc.edu/archive/htcondor-users/ > > > > > > > > > > > > _______________________________________________ > > > > > > > > > > > > HTCondor-users mailing list > > > > > > > > > > > > To unsubscribe, send a message to > > > > > > htcondor-users-request@xxxxxxxxxxx > > > > > > > > > > > > with a subject: Unsubscribe > > > > > > > > > > > > You can also unsubscribe by visiting > > > > > > > > > > > > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users > > > > > > > > > > > > > > > > > > > > > > > > The archives can be found at: > > > > > > > > > > > > https://lists.cs.wisc.edu/archive/htcondor-users/ > > > > > > -- > > > > > > Harald van Pee > > > > > > Helmholtz-Institut fuer Strahlen- und Kernphysik der Universitaet Bonn > > > > > > Nussallee 14-16 - 53115 Bonn - Tel +49-228-732213 - Fax +49-228-732505 > > > > > > mail: pee@xxxxxxxxxxxxxxxxx > > > > > > > > > > > > _______________________________________________ > > > HTCondor-users mailing list > > > To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx > > > with a subject: Unsubscribe > > > You can also unsubscribe by visiting > > > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users > > > > > > The archives can be found at: > > > https://lists.cs.wisc.edu/archive/htcondor-users/
-- Harald van Pee
Helmholtz-Institut fuer Strahlen- und Kernphysik der Universitaet Bonn Nussallee 14-16 - 53115 Bonn - Tel +49-228-732213 - Fax +49-228-732505 mail: pee@xxxxxxxxxxxxxxxxx
|