Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[HTCondor-users] Jobs do not execute, they sit idle in the queue indefinitely
- Date: Fri, 17 May 2013 14:21:51 -0400
- From: Dan Shea <daniel_shea2@xxxxxxxxxxxxxxx>
- Subject: [HTCondor-users] Jobs do not execute, they sit idle in the queue indefinitely
Hi,
I'm attempting to configure a test condor cluster. I have 10 machines
all running Centos 6.4
They are not configured with DNS records, they all have /etc/hosts files
that contain the relevant ip addresses for each node in the cluster.
I've configured the stable repo and used that to install the condor
software.
I then modified the /etc/condor/condor_config so that the subnet these
machines reside on was enabled for write access.
A quick test showed everything was working and jobs would execute as
expected.
However, this was with the following condor_config.local entry on each
of the 10 nodes
DAEMON_LIST = COLLECTOR, MASTER, NEGOTIATOR, SCHEDD, STARTD
I am now attempting to configured one node as a gatekeeper
DAEMON_LIST = COLLECTOR, MASTER, NEGOTIATOR, SCHEDD
And the other 9 nodes as execution only nodes
DAEMON_LIST = MASTER, STARTD
After restarting services I now no longer see jobs executing. They sit
idle in the queue indefinitely.
[root@node00 condor]# condor_q
-- Submitter: node00 : <10.11.114.220:44213> : node00
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE
CMD
2.0 mfs 5/17 13:41 0+00:00:00 I 0 0.0 myprog
Example.2.0
1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended
condor_q -analyze is not much help
-- Submitter: node00 : <10.11.114.220:44213> : node00
---
002.000: Request has not yet been considered by the matchmaker.
I did notice the following warning in the SchedLog
SchedLog:05/17/13 13:41:21 (pid:9037) WARNING: forward resolution of
localhost.localdomain doesn't match 10.11.114.220!
I also found this entry which makes no sense to me since schedd is not
setup to run on node00 in the local config.
SchedLog:05/17/13 13:56:21 (pid:9037) Can't find address for startd node00
The test job itself is from the tutorial here:
http://research.cs.wisc.edu/htcondor/tutorials/scotland-admin-tutorial-2003-10-23/scotland-admin-tutorial-2003-10-23.DEMO.html
Any assistance pointing me in the right direction is greatly appreciated.
Regards,
Dan Shea
--
Dan Shea - daniel_shea2@xxxxxxxxxxxxxxx
Senior Systems Administrator, West Quad Computing Group
Harvard Medical School
"Charlie was a chemist, But Charlie is no more. For what he thought was H2O, Was H2SO4."