[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Stop Vanilla jobs from eviction/restart



Hi Todd, thank you again.  

Did you remember to do a condor_reconfig -all (from a trusted machine, aka your central manager) when making the config file edits?  The condor_config_val -dump is just reading from the config file, if the file has been edited more recently than a reconfig...

all of these settings have been in place since its inception, and I have had initiated reconfig several times after that, although i think there could be other reasons as you point out below. 
 

Also, do you have the same config file setup on all nodes, or do you have a different config file on your CM -vs- your execute nodes?


we have global config file shared across all nodes, local configs on execute host,submit hosts and CM is very close, except local config on submit hosts and CM has more daemons than the execution hosts, and grouping information on submit hosts is another addition compared to execution. 

Could the job restarts have been from before you made the config changes?

Could the job restarts be a result of something outside of HTCondor's control, such as reboot of an execute node or restart of the HTCondor service?

Could the job restarts be a result of the jobs going on hold (for some error reason like NFS server temporarily being down) and then released?

If this is happening how can i capture this in a job or as a Admin? 
 

What version of HTCondor are you running?

v.7.6.x 
 

If you are running v7.8 or earlier and you never want to interrupt a running job, make certain of your central manager condor_config you have:
   PREEMPTION_REQUIREMENTS = False
and on all of your execute node condor_config you have:
   PREEMPT=FALSE
   KILL=FALSE
   RANK=0

this is one thing i am missing on execution host, i do not have these macros defined. i will include them and reconfig and see how it goes. thank you!
 

If you are running HTCondor v8.0+ and you never want to interrupt a running job, life can be simpler - I would suggest making certain all your execute nodes condor_config have something like
  MAXJOBRETIREMENTTIME = 172800
which specifies how many seconds a job can run uninterrupted (172800 is 2 days, set to whatever). 

In the current developer series, we are adding to the startd classad information about how many times a job was interrupted by HTCondor - this will make it easier to confirm that the system is indeed doing what you think you are telling it :).


looking forward for this once we upgrade. 

best!