Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] update of condor-version and job-behaviour
- Date: Mon, 30 Aug 2021 22:43:02 +0200 (CEST)
- From: Martin Flemming <martin.flemming@xxxxxxx>
- Subject: Re: [HTCondor-users] update of condor-version and job-behaviour
HI, Greg !
Thanks for clarification ...
Indeed, I would like to prevent a restart of the jobs respectively a
proper running of the Jobs until the end on each workernode, when i
upgrade the cluster to an upper version of condor ...
So, if i understand your right, i've got to disable several bunches of
Nodes before i start the Update in order to carry out the update as
inconspicuously as possible .. in other words, every condor/package-update
contains an auotmaticly restart of the daemon and so also of the running
jobs on the workernodes ?
... by the way, on the master or sched's we use the workflow
1) condor_off -master -fast
2) Upgrade the binaries
3) restart the master
ALL WITHIN 20 MINUTES.
If you are more concerned about the badput from restarting a running job,
than the potential loss of throughput from keeping cores idle, you can run
"condor_off -peaceful" on the worker node before your upgrade, and condor
will wait until all the jobs exit before it, itself exits, at which time you
could upgrade the machine.
i didn't know the command
condor_off -peaceful
In general we use to disable Workernodes with
condor_config_val -startd -name bird055.desy.de -set "StartJobs = false"
condor reconfig -startd -name bird055.desy.de
condor_drain -graceful bird055.desy.de
Is this equally significant ?
All in all .. the workflow should be
- disable the workernode
- wait until all jobs are finished
- update
- enable the workernode again
- ?
thanks & cheers,
Martin
On Mon, 30 Aug 2021, Greg Thain wrote:
fg> Hi Martin:
When HTCondor is upgraded *on the worker node*, or, more generally, when the
HTCondor worker node daemons restart for any reason:
Any running jobs are killed, will go back to the "I"dle state in the queue,
and HTCondor will restart them, perhaps on another machine.
If you are more concerned about the badput from restarting a running job,
than the potential loss of throughput from keeping cores idle, you can run
"condor_off -peaceful" on the worker node before your upgrade, and condor
will wait until all the jobs exit before it, itself exits, at which time you
could upgrade the machine.
And just for completeness, upgrading the central manager will not evict
jobs. Upgrading the access point (where the schedd runs) will not evict
jobs, if the new daemons restart quickly enough.
-greg
Hi !
Which is the default behaviour of running jobs on an working-node on which
the condor-packages will be updated ...?
a) the running jobs are running well with the old version, and each job
after update of the packages, they will start with the new installed
condor-version ?
b) the running jobs will be canceld after the update and would be
re-scheduled with the new version?
c) the running jobs will be cancled and will be lost
d) ....
cheers & thanks,
Martin
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with
a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
Gruss
Martin Flemming
______________________________________________________
Martin Flemming
DESY / IT office : Building 2b / 008a
Notkestr. 85 phone : 040 - 8998 - 4667
22603 Hamburg mail : martin.flemming@xxxxxxx
______________________________________________________