Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Need help for job disconnection and reconnection failure! Argent...
- Date: Tue, 14 May 2013 10:58:10 -0400
- From: Diego Bello <dbello@xxxxxxxxx>
- Subject: Re: [HTCondor-users] Need help for job disconnection and reconnection failure! Argent...
On Tue, May 14, 2013 at 3:56 AM, 钱晓明 <kyleqian@xxxxxxxxx> wrote:
> I submit jobs to my cluster but no job can run because they all
> disconnected. Here is my condor version(I am using Rocks to manage my
> cluster):
> [kyle@imagegrid ~]$ condor_version
> $CondorVersion: 7.8.5 Oct 09 2012 BuildID: 68720 $
> $CondorPlatform: x86_64_rhap_6.3 $
> [kyle@imagegrid ~]$ condor_status
> Name OpSys Arch State Activity LoadAv Mem
> ActvtyTime
> slot10@xxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 499
> 0+00:25:05
> slot11@xxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 499
> 0+00:25:06
> slot12@xxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 499
> 0+00:25:07
> slot13@xxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 499
> 0+00:25:08
> slot14@xxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 499
> 0+00:25:09
> slot15@xxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 499
> 0+00:25:10
> slot16@xxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 499
> 0+00:25:03
> slot1@xxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 499
> 0+00:00:04
> slot2@xxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 499
> 0+00:00:05
> slot3@xxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 499
> 0+00:00:06
> slot4@xxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 499
> 0+00:00:06
> slot5@xxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.020 499
> 0+00:25:08
> slot6@xxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 499
> 0+00:25:09
> slot7@xxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 499
> 0+00:25:10
> slot8@xxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 499
> 0+00:25:03
> slot9@xxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 499
> 0+00:25:04
> slot10@xxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 499
> 0+00:15:06
> slot11@xxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 499
> 0+00:15:07
> slot12@xxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 499
> 0+00:15:08
> slot13@xxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 499
> 0+00:15:09
> slot14@xxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 499
> 0+00:15:10
> slot15@xxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 499
> 0+00:15:11
> slot16@xxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 499
> 0+00:15:04
> slot1@xxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 499
> 0+00:14:41
> slot2@xxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 499
> 0+00:15:06
> slot3@xxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 499
> 0+00:15:07
> slot4@xxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 499
> 0+00:15:08
> slot5@xxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 499
> 0+00:15:09
> slot6@xxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 499
> 0+00:15:10
> slot7@xxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 499
> 0+00:15:11
> slot8@xxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 499
> 0+00:15:04
> slot9@xxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 499
> 0+00:15:05
> Total Owner Claimed Unclaimed Matched Preempting
> Backfill
> X86_64/LINUX 32 0 0 32 0 0
> 0
> Total 32 0 0 32 0 0
> 0
> [kyle@imagegrid ~]$ condor_q
> -- Submitter: imagegrid.otitan.com : <192.168.1.100:40073> :
> imagegrid.otitan.com
> ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
> 2.0 kyle 5/14 23:24 0+00:00:00 I 0 0.0 showpwd.sh
> 2.1 kyle 5/14 23:24 0+00:00:08 I 0 0.0 showpwd.sh
> 2.2 kyle 5/14 23:24 0+00:00:17 I 0 0.0 showpwd.sh
> 2.3 kyle 5/14 23:24 0+00:00:01 I 0 0.0 showpwd.sh
> 4 jobs; 0 completed, 0 removed, 4 idle, 0 running, 0 held, 0 suspended
>
> The log content of my job is:
> [kyle@imagegrid ~]$ cat showpwd.log
> 000 (002.000.000) 05/14 23:24:57 Job submitted from host:
> <192.168.1.100:40073>
> ...
> 000 (002.001.000) 05/14 23:24:57 Job submitted from host:
> <192.168.1.100:40073>
> ...
> 000 (002.002.000) 05/14 23:24:57 Job submitted from host:
> <192.168.1.100:40073>
> ...
> 000 (002.003.000) 05/14 23:24:57 Job submitted from host:
> <192.168.1.100:40073>
> ...
> 022 (002.000.000) 05/14 23:24:57 Job disconnected, attempting to reconnect
> Socket between submit and execute hosts closed unexpectedly
> Trying to reconnect to slot1@xxxxxxxxxxxxxxx <10.255.255.254:45256>
> ...
> 024 (002.000.000) 05/14 23:24:57 Job reconnection failed
> Job not found at execution machine
> Can not reconnect to slot1@xxxxxxxxxxxxxxx, rescheduling job
> ...
> 022 (002.001.000) 05/14 23:24:57 Job disconnected, attempting to reconnect
> Socket between submit and execute hosts closed unexpectedly
> Trying to reconnect to slot2@xxxxxxxxxxxxxxx <10.255.255.254:45256>
> ...
> 024 (002.001.000) 05/14 23:24:57 Job reconnection failed
> Job not found at execution machine
> Can not reconnect to slot2@xxxxxxxxxxxxxxx, rescheduling job
> ...
> 022 (002.002.000) 05/14 23:24:58 Job disconnected, attempting to reconnect
> Socket between submit and execute hosts closed unexpectedly
> Trying to reconnect to slot3@xxxxxxxxxxxxxxx <10.255.255.254:45256>
> ...
> 022 (002.003.000) 05/14 23:24:58 Job disconnected, attempting to reconnect
> Socket between submit and execute hosts closed unexpectedly
> Trying to reconnect to slot4@xxxxxxxxxxxxxxx <10.255.255.254:45256>
> ...
> 024 (002.003.000) 05/14 23:24:58 Job reconnection failed
> Job not found at execution machine
> Can not reconnect to slot4@xxxxxxxxxxxxxxx, rescheduling job
> ...
> 024 (002.002.000) 05/14 23:25:06 Job reconnection failed
> Job not found at execution machine
> Can not reconnect to slot3@xxxxxxxxxxxxxxx, rescheduling job
> ...
> 022 (002.000.000) 05/14 23:26:58 Job disconnected, attempting to reconnect
> Socket between submit and execute hosts closed unexpectedly
> Trying to reconnect to slot1@xxxxxxxxxxxxxxx <10.255.255.254:45256>
> ...
> 024 (002.000.000) 05/14 23:26:58 Job reconnection failed
> Job not found at execution machine
> Can not reconnect to slot1@xxxxxxxxxxxxxxx, rescheduling job
> ...
> 022 (002.001.000) 05/14 23:26:58 Job disconnected, attempting to reconnect
> Socket between submit and execute hosts closed unexpectedly
> Trying to reconnect to slot2@xxxxxxxxxxxxxxx <10.255.255.254:45256>
> ...
> 022 (002.002.000) 05/14 23:26:58 Job disconnected, attempting to reconnect
> Socket between submit and execute hosts closed unexpectedly
> Trying to reconnect to slot3@xxxxxxxxxxxxxxx <10.255.255.254:45256>
> ...
> 022 (002.003.000) 05/14 23:26:58 Job disconnected, attempting to reconnect
> Socket between submit and execute hosts closed unexpectedly
> Trying to reconnect to slot4@xxxxxxxxxxxxxxx <10.255.255.254:45256>
> ...
> 024 (002.003.000) 05/14 23:26:58 Job reconnection failed
> Job not found at execution machine
> Can not reconnect to slot4@xxxxxxxxxxxxxxx, rescheduling job
> ...
> 024 (002.001.000) 05/14 23:27:06 Job reconnection failed
> Job not found at execution machine
> Can not reconnect to slot2@xxxxxxxxxxxxxxx, rescheduling job
> ...
> 024 (002.002.000) 05/14 23:27:06 Job reconnection failed
> Job not found at execution machine
> Can not reconnect to slot3@xxxxxxxxxxxxxxx, rescheduling job
> ...
>
> I can see that after submission, some slots became claimed, but after few
> seconds, they became Unclaimed again.
> Here is my local configure(generated by Rocks):
>
> ALLOW_WRITE = $(HOSTALLOW_WRITE)
> AMAZON_GAHP = $(SBIN)/amazon_gahp
> AMAZON_GAHP_LOG = /tmp/AmazonGahpLog.$(USERNAME)
> COLLECTOR_NAME = Collector at imagegrid.otitan.com
> COLLECTOR_SOCKET_CACHE_SIZE = 1000
> CONDOR_ADMIN = condor@xxxxxxxxxxxxxxxxxxxx
> CONDOR_DEVELOPERS = NONE
> CONDOR_DEVELOPERS_COLLECTOR = NONE
> CONDOR_HOST = imagegrid.otitan.com
> CONDOR_IDS = 407.500
> CONDOR_SSHD = /usr/sbin/sshd
> CONDOR_SSH_KEYGEN = /usr/bin/ssh-keygen
> CONTINUE = True
> DAEMON_LIST = MASTER, STARTD
> EMAIL_DOMAIN = $(FULL_HOSTNAME)
> FILESYSTEM_DOMAIN = otitan.com
> HIGHPORT = 50000
> HOSTALLOW_WRITE = imagegrid.otitan.com, *.local, *.local
> JAVA = /usr/bin/java
> KILL = False
> LOCAL_DIR = /var/opt/condor
> LOCK = /tmp/condor-lock.$(HOSTNAME)
> LOWPORT = 40000
> MAIL = /bin/mail
> NEGOTIATOR_INTERVAL = 120
> NETWORK_INTERFACE = 10.255.255.254
> PREEMPT = False
> RANK = None
> RELEASE_DIR = /opt/condor
> SOAP_SSL_CA_FILE = /etc/pki/tls/cert.pem
> START = True
> STARTD_EXPRS = $(STARTD_EXPRS)
> SUSPEND = False
> UID_DOMAIN = local
> UPDATE_COLLECTOR_WITH_TCP = True
> WANT_SUSPEND = False
> WANT_VACATE = False
> # First set JAVA_MAXHEAP_ARGUMENT to null, to disable the default of max RAM
> JAVA_MAXHEAP_ARGUMENT =
> JAVA_EXTRA_ARGUMENTS = -Xmx1906m
> Can some one help me? Thanks!
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/
Two things to check:
- Did you enable file transfer in your submit files? Please send one
to check it out.
- Did you enable the ALLOW_WRITE parameter? It has to allow the
network of your servers to write.
--
Diego Bello Carreño