There has been no change in the shutdown logic as far as I know.
It sounds like you are killing the HTCondor daemons rather than actually shutting them down. If you hard kill the daemons, (or if you terminate the VM without shutting it down) â then the HTCondor daemons
never get a chance to send a DELETE_AD notification to the collector and so their ClassAds will remain visible until they expire. That sure sounds like what is happening here. To shut the daemons down cleanly, you can use condor_off -master. Or you can send the condor_master a SIGTERM signal, which has the same effect. Then you have to give the condor_startd and condor_master
time to shutdown cleanly. I believe that this can take as much as 2 minutes if the condor_startd is running a job and the job doesnât respond to SIGTERM, (or doesnât respond quickly). If the daemons shut down cleanly, you should see that reflected in the MasterLog and StartdLog. I would look there first. You should also see the DELETE_AD notifications in the collector.
A clean shutdown will show this message in the StartLog 10/05/17 12:51:12.038 (D_ALWAYS) Got SIGTERM. Performing graceful shutdown. and this message, which will be the last message in the log. 10/05/17 12:51:12.170 (D_ALWAYS) **** condor_startd.exe (condor_STARTD) pid 10328 EXITING WITH STATUS 0 If the second message is missing, thatâs a strong indicator that the shutdown was not clean. -tj From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx]
On Behalf Of Mary Romelfanger Hi Everyone, My apologies if this has been asked already, or if I missed a notification. I have searched and not found any references to this question. There appears to be a new delay in the updates on availability of a core in condor_status for the pool, when the htcondor on a machine is stopped? I am pretty sure that delay was not there before? Example: If I have 80 cores (16 cores split over 5 VMs that are only running startdâs) and they are all up, then condor_status correctly shows 80 cores. If I then shutdown HTCondor on one of the VMs â
a ps shows that the condor processes are gone, but condor_status does not update and reflect that the number of cores is down to 64 for many minutes (as many as 10 or 15 minutes). I believe that this is new behavior in 8.6 (we are currently running 8.6.6). I double checked in our 8.4 pool before we updated it and I am pretty sure that it did not have that behavior, meaning a shutdown
of HTCondor on a VM in a pool was immediately reflected in condor_status. Is this behavior expected? Is there a better way (other than the ps) to determine what cores are really there with a reliable immediate answer? We have been troubleshooting some issues which have required
a number of shutdowns and startups and it has become an issue (really just a pain in theâ. - there are other ways to tell) that the condor_status result is not a true current reflection of the status of the pool. Did I miss a new knob or a new command?
:) Thank You -- Mary Mary Romelfanger Deputy Branch Manager Data Systems Branch .___. {o,o} Phone 410-338-6708 Space Telescope Science Institute 3700 San Martin Drive Baltimore, MD 21218 |