Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] condor_submit hangs / condor_q hangs
- Date: Thu, 18 Mar 2010 12:13:10 -0500
- From: Adam Yates <yates@xxxxxxxxxxx>
- Subject: Re: [Condor-users] condor_submit hangs / condor_q hangs
Here's my SchedLog on the submit machine:
03/18 12:01:44 (pid:12062)
******************************************************
03/18 12:01:44 (pid:12062) Using config source:
/opt/packages/condor-7.4.1/etc/condor_config
03/18 12:01:44 (pid:12062) Using local config sources:
03/18 12:01:44 (pid:12062) /home/condor/n00/condor_config.local
03/18 12:01:44 (pid:12062) DaemonCore: Command Socket at <10.254.0.10:35322>
03/18 12:01:44 (pid:12062) History file rotation is enabled.
03/18 12:01:44 (pid:12062) Maximum history file size is: 20971520 bytes
03/18 12:01:44 (pid:12062) Number of rotated history files is: 2
03/18 12:01:49 (pid:12062) Sent ad to central manager for
LSU760000@xxxxxxxxxxxxxxxxx
03/18 12:01:49 (pid:12062) Sent ad to 1 collectors for
LSU760000@xxxxxxxxxxxxxxxxx
03/18 12:02:35 (pid:12062) Negotiating for owner:
LSU760000@xxxxxxxxxxxxxxxxx
03/18 12:02:35 (pid:12062) AutoCluster:config() significant atttributes
changed to JobUniverse,LastCheckpointPlatform,NumCkpts
03/18 12:02:35 (pid:12062) Out of jobs - 0 jobs matched, 0 jobs idle,
flock level = 0
03/18 12:02:35 (pid:12062) Sent ad to central manager for
LSU760000@xxxxxxxxxxxxxxxxx
03/18 12:02:35 (pid:12062) Sent ad to 1 collectors for
LSU760000@xxxxxxxxxxxxxxxxx
03/18 12:03:24 (pid:12094)
******************************************************
03/18 12:03:24 (pid:12094) ** condor_schedd (CONDOR_SCHEDD) STARTING UP
03/18 12:03:24 (pid:12094) ** /opt/packages/condor-7.4.1/sbin/condor_schedd
03/18 12:03:24 (pid:12094) ** SubsystemInfo: name=SCHEDD type=SCHEDD(5)
class=DAEMON(1)
03/18 12:03:24 (pid:12094) ** Configuration: subsystem:SCHEDD
local:<NONE> class:DAEMON
03/18 12:03:24 (pid:12094) ** $CondorVersion: 7.4.1 Dec 17 2009 BuildID:
204351 $
03/18 12:03:24 (pid:12094) ** $CondorPlatform: X86_64-LINUX_RHEL5 $
03/18 12:03:24 (pid:12094) ** PID = 12094
03/18 12:03:24 (pid:12094) ** Log last touched 3/18 12:02:44
03/18 12:03:24 (pid:12094)
******************************************************
03/18 12:03:24 (pid:12094) Using config source:
/opt/packages/condor-7.4.1/etc/condor_config
03/18 12:03:24 (pid:12094) Using local config sources:
03/18 12:03:24 (pid:12094) /home/condor/n00/condor_config.local
03/18 12:03:24 (pid:12094) DaemonCore: Command Socket at <10.254.0.10:55082>
03/18 12:03:24 (pid:12094) History file rotation is enabled.
03/18 12:03:24 (pid:12094) Maximum history file size is: 20971520 bytes
03/18 12:03:24 (pid:12094) Number of rotated history files is: 2
03/18 12:03:24 (pid:12094) About to rotate ClassAd log
/home/condor/n00//spool/job_queue.log
Thanks!
Adam
Steven Timm wrote:
What's the content of SchedLog on the submit machine, sounds
like it could be some kind of authentication issue between
condor_submit, condor_q and the schedd, or else a schedd that's
just totally hosed for some reason.
also you can
export _TOOL_DEBUG=D_ALL ; condor_submit -debug <args>
and you can get some debugging info from condor_submit.
Steve
On Thu, 18 Mar 2010, Adam Yates wrote:
Hi everyone;
I'm having a problem with a fresh condor install.
My setup is this:
1 master node (master)
2 interactive nodes (submit only- n00 and n01)
64 worker nodes (execute only n02-n66)
Config structure:
master - /home/condor is nfs exported to all nodes. local configs are in
/home/condor/$HOSTNAME/condor_config.local
Whenever I use condor_submit, it hangs on "Submitting job.." and then
eventually times
out, saying that it failed to connect to the local machine on port
x. and failed to fetch
ads from the localhost on port x.
Whenever I use condor_q, it will hang but if I give condor_q -global,
it will return the
status of my condor pool and show some of the nodes in use.
The daemon listing in my config on the submit nodes is as such:
DAEMON_LIST = MASTER, SCHEDD
There are no errors in the local log (log/*)
Does anybody make sense of this? Please let me know if any more info
is needed.
--
Adam Yates
Systems Administrator -- Research Infrastructure
Center for Computation and Technology
232 Johnston Hall,
Baton Rouge, LA 70803
W: 225.578.8235 C: 225.663.0218
<yates@xxxxxxxxxxx>