Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] condor_submit hangs when Queue > 1
- Date: Wed, 29 Dec 2010 21:29:06 -0500
- From: Michael Hanke <michael.hanke@xxxxxxxxx>
- Subject: [Condor-users] condor_submit hangs when Queue > 1
Hi,
when I submit a vanilla job with a submit file like this
Executable = /bin/sleep
Arguments = 30
Universe = vanilla
output = runner.out
error = runner.error
Log = runner.log
Queue 1
everything is fine. However when I change the last line to 'Queue 2' or
any other number larger than 1 I cannot submit the job anymore.
condor_submit hangs. strace shows that it is waiting to read from a
socket, and SchedLog has this:
12/29/10 21:10:41 (pid:12071) condor_read(): timeout reading 5 bytes from <10.0.0.1:50781>.
12/29/10 21:10:41 (pid:12071) IO: Failed to read packet header
It seems that Schedd cannot talk to condor_submit:
mih@head1 ~/debian/condor % sudo netstat -anp |grep 53992
tcp 0 0 0.0.0.0:53992 0.0.0.0:* LISTEN 12071/condor_schedd
tcp 0 0 10.0.0.1:57682 10.0.0.1:53992 ESTABLISHED 12072/condor_negoti
tcp 0 0 10.0.0.1:53992 10.0.0.1:57682 ESTABLISHED 12071/condor_schedd
tcp 1 0 10.0.0.1:50781 10.0.0.1:53992 CLOSE_WAIT 20398/condor_submit
udp 0 0 0.0.0.0:53992 0.0.0.0:* 12071/condor_schedd
Enabling some debugging in condor_submit doesn't shed more light:
mih@head1 ~/debian/condor % _TOOL_DEBUG=D_ALL ; condor_submit -debug job
12/29/10 21:10:21 Can't find CondorPlatform in classad for schedd head1.xxxxx.xxxxxxxxx.xxx
Submitting job(s).
[hangs]
Submission happens on the central manager of the pool -- deviation from
the default configuration is fairly minimal:
DAEMON_LIST = MASTER, SCHEDD, COLLECTOR, NEGOTIATOR
UID_DOMAIN = xxxxx.xxxxxxxxx.xxx
FILESYSTEM_DOMAIN = xxxxx.xxxxxxxxx.xxx
ALLOW_WRITE = *xxxxx.xxxxxxxxx.xxx
NETWORK_INTERFACE = 10.0.0.1
Enabling D_FULLDEBUG for SCHEDD doesn't add much more:
12/29/10 21:21:56 Adding to resolved authorization table: mih@xxxxxxxxxxxxxxxxxxx/10.0.0.1: WRITE
12/29/10 21:21:56 Received TCP command 1112 (QMGMT_WRITE_CMD) from mih@xxxxxxxxxxxxxxxxxxx <10.0.0.1:38560>, access level WRITE
12/29/10 21:21:56 OwnerCheck retval 1 (success),no ad
12/29/10 21:21:56 OwnerCheck retval 1 (success),no ad
12/29/10 21:21:56 OwnerCheck retval 1 (success),no ad
12/29/10 21:22:16 condor_read(): timeout reading 5 bytes from <10.0.0.1:38560>.
12/29/10 21:22:16 IO: Failed to read packet header
12/29/10 21:22:16 QMGR Connection closed
I'd be glad if somebody could point me to the problem.
Thanks in advance,
Michael
--
Michael Hanke
http://mih.voxindeserto.de