Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] node gone from condor_status?
- Date: Mon, 06 Mar 2017 15:36:57 +0000
- From: John M Knoeller <johnkn@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] node gone from condor_status?
Nodes disappear from condor_status when they fail to send an update ad to the collector for too long and the Collector discards the expired ad. There are a variety of reasons why that can happen.
Is the node still running a HTCondor starter? Check the StartLog on the execute node.
Did the ALLOW_ list on the collector change? Maybe it's no longer allowed to send updates? Check the CollectorLog on the central manger node.
Is the node configured to send updates via TCP. Check the UPDATE_COLLECTOR_WITH_TCP knob on the execute node.
If updates are UDP, then it's possible that only the initial update (right after the reconfig) got through to the collector
Because the initial update is always TCP, but if UPDATE_COLLECTOR_WITH_TCP=false, the subsequent updates with be UDP.
-tj
-----Original Message-----
From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of Dimitri Maziuk
Sent: Friday, March 3, 2017 4:11 PM
To: htcondor-users@xxxxxxxxxxx
Subject: [HTCondor-users] node gone from condor_status?
Hi all,
> [root@turkey ~]# condor_status turkey ; echo $?
> 0
> [root@turkey ~]# ps -AF | grep condor
> condor 789 1 0 17379 6336 2 Feb06 ? 00:00:23 /usr/sbin/condor_master -f
> root 934 789 0 6202 4452 3 Feb06 ? 00:39:01 condor_procd -A /var/run/condor/procd_pipe -L /var/log/condor/ProcLog -R 1000000 -S 60 -C 501
> condor 935 789 0 12347 5568 3 Feb06 ? 00:00:26 condor_shared_port -f
> condor 1003 789 0 12829 7848 2 Feb06 ? 00:44:20 condor_startd -f
> condor 179933 1003 0 12705 6480 3 Mar01 ? 00:00:59 condor_starter -f -a slot7 exocet.bmrb.wisc.edu
> bbee 179940 179933 0 4493 1468 3 Mar01 ? 00:00:00 /bin/sh /var/lib/condor/execute/dir_179933/condor_exec.exe 208 22
The last line's one last job still running.
I changed the START=FALSE and did a condor_reconfig for 8.6.1 update yesterday, it's taking a while for the jobs to taper off. Couple of hours ago there were two running jobs and the rest of the cores were in Owner. Sometime between then and now the node has disappeared from condor_status output. Any idea why?
TIA
--
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu