---------- Forwarded message ----------
From:
O'Donnell, Michael <odonnellm@xxxxxxxx>
Date: Fri, May 31, 2013 at 6:37 AM
Subject: Problems with HTCondor schedd or collector [tracking of submit machines]
To: HTCondor-Users Mail List <
htcondor-users@xxxxxxxxxxx>
Cc: Michael O'Donnell <
odonnellm@xxxxxxxx>
I have a windows pool with mostly HTCondor 7.8.7 and there seems to be a problem with the central manager tracking the submit machines. The schedd service is always running on these machines, but the central manager/collector cannot detect them after some time (there seems to be no pattern with machines or time). I am using a scheduled executable that runs every 30 minutes which tries to fix these problem, but I really need to find a better solution. The executable uses a condor_restart -schedd and condor_reconfig, which corrects the problem temporarily but this is not sustainable.
I posted about this earlier this week (see below) but basically I cannot find any error messages in log files on the submit machine or central manager.
Does anyone have any thoughts as to what I can do to figure out what is causing this problem?
thank you for the help,
Mike
May 29:
I am primarily using 7.8.7 on windows OS within our HTCondor pool and I am noticing that the condor_status -daemon (e.g., -schedd, -master) is not reporting accurately. For example, if I run condor_status, I see all the machines/slots in the pool, but I do not see most of these machine when I run condor_status -master. When I run condor_status -schedd, I do not pick up all the condor submit machines within the pool. However, the schedd service is running on the submit machine and condor_q on the local machine is accurately reporting--I can also submit jobs. I do not see any errors in the collector log (on central manager) or the schedd log (on submit machines).
Could there be something going on that I am missing, or is it possible this is a bug. I have noticed this problem for a little while and right now I am able to usually (not always) fix the problem by running a condor_restart -schedd. Everything else seems to be functioning as expected.
Any ideas how to troubleshoot? Thanks,
mike