[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] does condor_off -peaceful -daemon startd node; works for vanilla jobs?



Hi,

 

it seems it doesn't matter which value I give:

UWCS_MaxJobRetirementTime = 0

MaxJobRetirementTime = 0

 

or

UWCS_MaxJobRetirementTime = 9999999

MaxJobRetirementTime = 9999999

or

UWCS_MaxJobRetirementTime = -1

MaxJobRetirementTime = -1

 

in the condor_config.local file of the node,

the job was killed within a second as before.

 

What helps is if I give

MaxJobRetirementTime = 9999999

in the submit descrition file at least the job is still running now.

 

Where I have to set this value if I want to avoid that every user have to give it in the submit description file?

 

Thanks

Harald

 

 

On Thursday 18 August 2016 22:19:21 Harald van Pee wrote:

> Hi,

>

> logged into the execute node that kicked

> off the job, I tried

> condor_config_val -startd -dump maxjobretirementtime

>

> result:

> # Parameters with names that match maxjobretirementtime:

> UWCS_MaxJobRetirementTime = 0

> # Contributing configuration file(s):

> # <Default>

>

> could this cause the problem? It must be default because its not in the

> configuration files and not in the submit description files because it

> happens also for my jobs and I do not give any MaxJobRetirementTime.

>

> What value would be best, we never want to stop any regular job

>

> UWCS_MaxJobRetirementTime = -1 or just a big number?

>

> Many thanks

> Harald

>

> On Thursday 18 August 2016 21:43:23 Todd Tannenbaum wrote:

> > On 8/18/2016 12:50 PM, Harald van Pee wrote:

> > > Hi,

> > >

> > > here just more information, one can see all happens within a second and

> > > all jobs are gone (or restarted on another node).

> >

> > [snip]

> >

> > > 08/18/16 19:43:03 slot1_1: Changing activity: Busy -> Retiring

> > >

> > > 08/18/16 19:43:03 slot1_1: State change: claim retirement ended/expired

> >

> > ^^^ This line is the smoking gun. This is saying that either the job or

> > the slot has defined a MaxJobRetirementTime, and that the job has

> > already been running for longer than this defined period, so the startd

> > immediately leaves Claimed/Retiring state and goes to

> > Preempting/Vacating (which sends a SIGTERM to the job).

> >

> > For example, I get the exact same results as you when I do a condor_off

> > -peaceful, with the exact same messages in the StartLog, if I submit a

> >

> > job that looks like the following:

> > Executable = /bin/sleep

> > Arguments = 60000

> > # Allow the job to be preempted if HTCondor wants

> > # to shutdown or run a higher priority job if and

> > # only if this job has already run for more than

> > # one second.

> > MaxJobRetirementTime = 1

> > Queue

> >

> > In your tests, are you positive that neither the job(s) being preempted

> > nor the execute node where the condor_startd is running define

> > MaxJobRetirementTime ? Because it really looks that way to me. To

> > check the job, use condor_q (or condor_history if the job left the

> > queue) and pass "-af MaxJobRetirementTime" command-line arg. To check

> > the condor_startd, if you are logged into the execute node that kicked

> > off the job, try

> >

> > condor_config_val -startd -dump maxjobretirementtime

> >

> > regards,

> > Todd

> >

> > > 08/18/16 19:43:03 slot1_1: Changing state and activity:

> > > Claimed/Retiring -> Preempting/Vacating

> > >

> > > 08/18/16 19:43:03 PERMISSION DENIED to submit-side@matchsession from

> > > host 192.168.xxx.xxx for command 403 (DEACTIVATE_CLAIM), access level

> > > DAEMON: reason: cached result for DAEMON; see first case for the full

> > > reason

> > >

> > > 08/18/16 19:43:03 slot1_1: Got DEACTIVATE_CLAIM while in Preempting

> > > state, ignoring.

> > >

> > > 08/18/16 19:43:03 Starter pid 6873 exited with status 0

> > >

> > > 08/18/16 19:43:03 slot1_1: State change: starter exited

> > >

> > > 08/18/16 19:43:03 slot1_1: State change: No preempting claim, returning

> > > to owner

> > >

> > > 08/18/16 19:43:03 slot1_1: Changing state and activity:

> > > Preempting/Vacating -> Owner/Idle

> > >

> > > 08/18/16 19:43:03 slot1_1: State change: IS_OWNER is false

> > >

> > > 08/18/16 19:43:03 slot1_1: Changing state: Owner -> Unclaimed

> > >

> > > 08/18/16 19:43:03 slot1_1: Changing state: Unclaimed -> Delete

> > >

> > > 08/18/16 19:43:03 slot1_1: Resource no longer needed, deleting

> > >

> > > 08/18/16 19:43:03 Deleting cron job manager

> > >

> > > 08/18/16 19:43:03 Cron: Killing all jobs

> > >

> > > 08/18/16 19:43:03 Cron: Killing all jobs

> > >

> > > 08/18/16 19:43:03 CronJobList: Deleting all jobs

> > >

> > > 08/18/16 19:43:03 Cron: Killing all jobs

> > >

> > > 08/18/16 19:43:03 CronJobList: Deleting all jobs

> > >

> > > 08/18/16 19:43:03 Deleting benchmark job mgr

> > >

> > > 08/18/16 19:43:03 Cron: Killing all jobs

> > >

> > > 08/18/16 19:43:03 Killing job mips

> > >

> > > 08/18/16 19:43:03 Killing job kflops

> > >

> > > 08/18/16 19:43:03 Cron: Killing all jobs

> > >

> > > 08/18/16 19:43:03 Killing job mips

> > >

> > > 08/18/16 19:43:03 Killing job kflops

> > >

> > > 08/18/16 19:43:03 CronJobList: Deleting all jobs

> > >

> > > 08/18/16 19:43:03 CronJobList: Deleting job 'mips'

> > >

> > > 08/18/16 19:43:03 CronJob: Deleting job 'mips'

> > > (/usr/lib/condor/libexec/condor_mips), timer -1

> > >

> > > 08/18/16 19:43:03 CronJobList: Deleting job 'kflops'

> > >

> > > 08/18/16 19:43:03 CronJob: Deleting job 'kflops'

> > > (/usr/lib/condor/libexec/condor_kflops), timer -1

> > >

> > > 08/18/16 19:43:03 Cron: Killing all jobs

> > >

> > > 08/18/16 19:43:03 CronJobList: Deleting all jobs

> > >

> > > 08/18/16 19:43:03 All resources are free, exiting.

> > >

> > > 08/18/16 19:43:03 **** condor_startd (condor_STARTD) pid 6818 EXITING

> > > WITH STATUS 0

> > >

> > > On Thursday 18 August 2016 19:12:55 Harald van Pee wrote:

> > > > @Bop: I also give the command from the central manager.

> > > >

> > > >

> > > >

> > > > @Todd:

> > > >

> > > > I have no MaxJobRetirementTime defined (nothing with retire or time

> > > > found

> > > >

> > > > on condor_config*, not on node, scheduler or central manager.

> > > >

> > > >

> > > >

> > > > condor_status| grep node

> > > >

> > > > slot1@node LINUX X86_64 Unclaimed Idle 0.230 63507 0+00:00:04

> > > >

> > > > slot1_1@node LINUX X86_64 Claimed Busy 0.000 1024 0+00:00:03

> > > >

> > > >

> > > >

> > > > after

> > > >

> > > > condor_off -peaceful -daemon startd node

> > > >

> > > > condor_status shows no node anymore (within 1 second, as fast as I

> > > > can

> > > >

> > > > type).

> > > >

> > > >

> > > >

> > > > We use

> > > >

> > > >

> > > >

> > > > CLAIM_WORKLIFE = 120

> > > >

> > > > and

> > > >

> > > > STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler

> > > >

> > > >

> > > >

> > > > NUM_SLOTS = 1

> > > >

> > > > SLOT_TYPE_1 = 100%

> > > >

> > > > SLOT_TYPE_1_PARTITIONABLE = true

> > > >

> > > > NUM_SLOTS_TYPE_1 = 1

> > > >

> > > >

> > > >

> > > > Any help is welcome.

> > > >

> > > >

> > > >

> > > > Harald

> > > >

> > > > On Thursday 18 August 2016 18:29:04 Todd Tannenbaum wrote:

> > > > > As another data point, it also seemed to work for me running a

> > > > >

> > > > > pre-release of HTCondor v8.5.7 on Scientific Linux 6.8.

> > > > >

> > > > > Behold the simple test below; note the node went from Claimed/Busy

> > > > > to

> > > > >

> > > > > Claimed/Retiring, which is expected. "Retiring" activity is

> > > > > defined in

> > > > >

> > > > >

> > > > >

> > > > > the Manual (from https://is.gd/mi7mVk ):

> > > > >

> > > > > Retiring

> > > > >

> > > > >

> > > > >

> > > > > When an active claim is about to be preempted for any reason, it

> > > > >

> > > > > enters

> > > > >

> > > > >

> > > > >

> > > > > retirement, while it waits for the current job to finish. The

> > > > >

> > > > > MaxJobRetirementTime _expression_ determines how long to wait

> > > > > (counting

> > > > >

> > > > > since the time the job started). Once the job finishes or the

> > >

> > > retirement

> > >

> > > > > time expires, the Preempting state is entered.

> > > > >

> > > > >

> > > > >

> > > > > Perhaps you have a MaxJobRetirementTime defined ?

> > > > >

> > > > >

> > > > >

> > > > > regards,

> > > > >

> > > > > Todd

> > > > >

> > > > >

> > > > >

> > > > > [tannenba@localhost test]$ condor_status

> > > > >

> > > > > Name OpSys Arch State Activity LoadAv Mem

> > > > >

> > > > > ActvtyTime

> > > > >

> > > > >

> > > > >

> > > > > slot1@localhost LINUX X86_64 Claimed Busy 0.000 330

> > > > >

> > > > > 0+00:00:04 slot2@localhost LINUX X86_64 Unclaimed Idle 0.000

> > > > >

> > > > > 330 0+00:00:05 slot3@localhost LINUX X86_64 Unclaimed Idle

> > > > >

> > > > > 0.000 330 0+00:00:06

> > > > >

> > > > >

> > > > >

> > > > > Total Owner Claimed Unclaimed Matched Preempting

> > > > >

> > > > >

> > > > >

> > > > > Backfill Drain

> > > > >

> > > > >

> > > > >

> > > > > X86_64/LINUX 3 0 1 2 0 0

> > > > >

> > > > >

> > > > >

> > > > > 0 0

> > > > >

> > > > >

> > > > >

> > > > > Total 3 0 1 2 0 0

> > > > >

> > > > >

> > > > >

> > > > > 0 0

> > > > >

> > > > >

> > > > >

> > > > > [tannenba@localhost test]$ condor_off -peaceful -daemon startd

> > > > >

> > > > > Sent "Set-Peaceful-Shutdown" command to local startd

> > > > >

> > > > > Sent "Kill-Daemon-Peacefully" command to local master

> > > > >

> > > > >

> > > > >

> > > > > [tannenba@localhost test]$ condor_status

> > > > >

> > > > > Name OpSys Arch State Activity LoadAv Mem

> > > > >

> > > > > ActvtyTime

> > > > >

> > > > >

> > > > >

> > > > > slot1@localhost LINUX X86_64 Claimed Retiring 0.000 330

> > > > >

> > > > > 0+00:00:03 slot2@localhost LINUX X86_64 Unclaimed Idle 0.000

> > > > >

> > > > > 330 0+00:02:49 slot3@localhost LINUX X86_64 Unclaimed Idle

> > > > >

> > > > > 0.000 330 0+00:00:06

> > > > >

> > > > >

> > > > >

> > > > > Total Owner Claimed Unclaimed Matched Preempting

> > > > >

> > > > >

> > > > >

> > > > > Backfill Drain

> > > > >

> > > > >

> > > > >

> > > > > X86_64/LINUX 3 0 1 2 0 0

> > > > >

> > > > >

> > > > >

> > > > > 0 0

> > > > >

> > > > >

> > > > >

> > > > > Total 3 0 1 2 0 0

> > > > >

> > > > >

> > > > >

> > > > > 0 0

> > > > >

> > > > > On 8/18/2016 11:11 AM, Bob Ball wrote:

> > > > > > Just as a data point, I do, from our central manager machine,

> > > > > >

> > > > > > condor_off -peaceful -daemon startd -name $publicName

> > > > > >

> > > > > > and it runs just fine. All our jobs are vanilla. HTCondor is

> > > > > > version

> > > > > >

> > > > > > 8.4.6 on Scientific Linux.

> > > > > >

> > > > > >

> > > > > >

> > > > > > bob

> > > > > >

> > > > > > On 8/18/2016 11:54 AM, Harald van Pee wrote:

> > > > > >> Hi,

> > > > > >>

> > > > > >>

> > > > > >>

> > > > > >> I want to set a job running node offline, but only after all

> > > > > >> running

> > > > > >>

> > > > > >> jobs have finished. Of course until then no new jobs should be

> > > > > >>

> > > > > >> accepted on that node.

> > > > > >>

> > > > > >>

> > > > > >>

> > > > > >> I tried the command:

> > > > > >>

> > > > > >>

> > > > > >>

> > > > > >> condor_off -peaceful -daemon startd node

> > > > > >>

> > > > > >>

> > > > > >>

> > > > > >> and got the message:

> > > > > >>

> > > > > >>

> > > > > >>

> > > > > >> Sent "Set-Peaceful-Shutdown" command to startd node

> > > > > >>

> > > > > >>

> > > > > >>

> > > > > >> Sent "Kill-Daemon-Peacefully" command to master node

> > > > > >>

> > > > > >>

> > > > > >>

> > > > > >> On node I see in StartLog

> > > > > >>

> > > > > >>

> > > > > >>

> > > > > >> 08/18/16 17:20:49 Got SIGTERM. Performing graceful shutdown.

> > > > > >>

> > > > > >>

> > > > > >>

> > > > > >> 08/18/16 17:20:49 shutdown graceful

> > > > > >>

> > > > > >>

> > > > > >>

> > > > > >> And indeed all jobs running in vannilla universe (we have no

> > > > > >> others)

> > > > > >>

> > > > > >>

> > > > > >>

> > > > > >> are killed directly and started from the beginning. This is

> > > > > >> what a

> > > > > >>

> > > > > >>

> > > > > >>

> > > > > >> graceful shutdown will do with vanilla jobs. But I want to have

> > > > > >> a

> > > > > >>

> > > > > >> peaceful shutdown.

> > > > > >>

> > > > > >>

> > > > > >>

> > > > > >> Is a peaceful shutdown not possible for vanilla jobs?

> > > > > >>

> > > > > >>

> > > > > >>

> > > > > >> Do I have to change the configuration? We use:

> > > > > >>

> > > > > >>

> > > > > >>

> > > > > >> PREEMPT = FALSE

> > > > > >>

> > > > > >>

> > > > > >>

> > > > > >> PREEMPTION_REQUIREMENTS = False

> > > > > >>

> > > > > >>

> > > > > >>

> > > > > >> KILL = FALSE

> > > > > >>

> > > > > >>

> > > > > >>

> > > > > >> WANT_SUSPEND = false

> > > > > >>

> > > > > >>

> > > > > >>

> > > > > >> WANT_VACATE = false

> > > > > >>

> > > > > >>

> > > > > >>

> > > > > >> Or can I use just a different command?

> > > > > >>

> > > > > >>

> > > > > >>

> > > > > >> We use condor 8.4.8 on debian 8 (AMD64).

> > > > > >>

> > > > > >>

> > > > > >>

> > > > > >> Thanks

> > > > > >>

> > > > > >>

> > > > > >>

> > > > > >> Harald

> > > > > >>

> > > > > >>

> > > > > >>

> > > > > >>

> > > > > >>

> > > > > >>

> > > > > >>

> > > > > >> _______________________________________________

> > > > > >>

> > > > > >> HTCondor-users mailing list

> > > > > >>

> > > > > >> To unsubscribe, send a message

> > > > > >> tohtcondor-users-request@xxxxxxxxxxx

> > > > > >>

> > > > > >> with a subject: Unsubscribe

> > > > > >>

> > > > > >> You can also unsubscribe by visiting

> > > > > >>

> > > > > >> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

> > > > > >>

> > > > > >>

> > > > > >>

> > > > > >> The archives can be found at:

> > > > > >>

> > > > > >> https://lists.cs.wisc.edu/archive/htcondor-users/

> > > > > >

> > > > > > _______________________________________________

> > > > > >

> > > > > > HTCondor-users mailing list

> > > > > >

> > > > > > To unsubscribe, send a message to

> > > > > > htcondor-users-request@xxxxxxxxxxx

> > > > > >

> > > > > > with a subject: Unsubscribe

> > > > > >

> > > > > > You can also unsubscribe by visiting

> > > > > >

> > > > > > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

> > > > > >

> > > > > >

> > > > > >

> > > > > > The archives can be found at:

> > > > > >

> > > > > > https://lists.cs.wisc.edu/archive/htcondor-users/

> > >

> > > --

> > >

> > > Harald van Pee

> > >

> > > Helmholtz-Institut fuer Strahlen- und Kernphysik der Universitaet Bonn

> > >

> > > Nussallee 14-16 - 53115 Bonn - Tel +49-228-732213 - Fax +49-228-732505

> > >

> > > mail: pee@xxxxxxxxxxxxxxxxx

> > >

> > >

> > >

> > > _______________________________________________

> > > HTCondor-users mailing list

> > > To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx

> > > with a subject: Unsubscribe

> > > You can also unsubscribe by visiting

> > > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

> > >

> > > The archives can be found at:

> > > https://lists.cs.wisc.edu/archive/htcondor-users/

 

--

Harald van Pee

 

Helmholtz-Institut fuer Strahlen- und Kernphysik der Universitaet Bonn

Nussallee 14-16 - 53115 Bonn - Tel +49-228-732213 - Fax +49-228-732505

mail: pee@xxxxxxxxxxxxxxxxx