Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] not starting jobs in condor ver 8.3.6
- Date: Sat, 04 Jul 2015 13:12:22 -0400
- From: Jan Balewski <janstar1122@xxxxxxxxx>
- Subject: Re: [HTCondor-users] not starting jobs in condor ver 8.3.6
Hi Todd,
yes, I meant issue with transition from 8.3.5 to 8.3.6.
Answering in the order, also changes in the config once made were not undone.
a)================
SUGGESTION: A quick thing to check would be if turning off TCP_FORWARDING_HOST allows the pool to work.
I first checked what condor sees for a regular user:
[cosy11@oswrk121 0x]$ condor_config_val -v TCP_FORWARDING_HOST
TCP_FORWARDING_HOST = 198.125.163.121
# at: /etc/condor/condor_config.local, line 26
# raw: TCP_FORWARDING_HOST = 198.125.163.121
Next I disabled TCP_FORWARDING_HOST in config as root and
service condor stop/start
Next verified it did worked:
[cosy11@oswrk121 0x]$ condor_config_val -v TCP_FORWARDING_HOST
Not defined: TCP_FORWARDING_HOST
Now I verify the pool works - it does:
[cosy11@oswrk121 0x]$ condor_status
Name OpSys Arch State Activity LoadAv Mem ActvtyTime
slot1@xxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.030 2421 0+00:00:04
slot2@xxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 2421 0+00:00:31
slot3@xxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 2421 0+00:00:32
slot4@xxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 2421 0+00:00:33
slot5@xxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 2421 0+00:00:34
slot6@xxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 2421 0+00:00:35
Machines Owner Claimed Unclaimed Matched Preempting
X86_64/LINUX 6 0 0 6 0 0
Total 6 0 0 6 0 0
b)==============
SUGGESTION: You might also try adding the public IPs to the ALLOW_READ and ALLOW_WRITE settings.
Done:
[root@oswrk121 ~]# condor_config_val -v ALLOW_WRITE
ALLOW_WRITE = *.lns.mit.edu,10.200.60.*,198.125.163.121
# at: /etc/condor/condor_config.local, line 34
# raw: ALLOW_WRITE = *.lns.mit.edu,10.200.60.*,198.125.163.121
[root@oswrk121 ~]# condor_config_val -v ALLOW_READ
ALLOW_READ = *.lns.mit.edu,10.200.60.*,198.125.163.121
# at: /etc/condor/condor_config.local, line 33
# raw: ALLOW_READ = *.lns.mit.edu,10.200.60.*,198.125.163.121
Note, the access fro the host was already given via DNS name, since:
[root@oswrk121 ~]# nslookup 198.125.163.121
Server: 10.200.60.21
Address: 10.200.60.21#53
121.163.125.198.in-addr.arpa name = oswrk121.lns.mit.edu.
After this change I re-submitted jobs and they all still idle despite 6 jobs slots are opened.
c)============
SUGGESTION: starting up a fourth VM with 8.3.6 installed to check if the two 8.3.6 machines to communicate.
Started new VM w/ IP=122 as 8.3.6 and connfigured to be worker reporting to IP=121
The worker node (IP=122) is seen by the condor master (IP=121), now the que has 14 job-slots and the 12 jobs still await execution:
[cosy11@oswrk121 nice-simple]$ condor_status
Name OpSys Arch State Activity LoadAv Mem ActvtyTime
slot1@xxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.010 2421 0+00:10:10
slot2@xxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 2421 0+00:10:37
slot3@xxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 2421 0+00:10:38
slot4@xxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 2421 0+00:10:39
slot5@xxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 2421 0+00:10:40
slot6@xxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 2421 0+00:10:41
slot1@xxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.010 1815 0+00:04:18
slot2@xxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 1815 0+00:04:45
slot3@xxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 1815 0+00:04:46
slot4@xxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 1815 0+00:04:47
slot5@xxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 1815 0+00:04:48
slot6@xxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 1815 0+00:04:49
slot7@xxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 1815 0+00:04:50
slot8@xxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 1815 0+00:04:43
Machines Owner Claimed Unclaimed Matched Preempting
X86_64/LINUX 14 0 0 14 0 0
Total 14 0 0 14 0 0
[cosy11@oswrk121 nice-simple]$ condor_q
-- Submitter: oswrk121.lns.mit.edu : <10.200.60.19:9916?addrs=10.200.60.19-9916> : oswrk121.lns.mit.edu
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
5.0 cosy11 7/4 12:34 0+00:00:00 I 0 0.0 oneA_job.sh A222,o
5.1 cosy11 7/4 12:34 0+00:00:00 I 0 0.0 oneA_job.sh A222,o
5.2 cosy11 7/4 12:34 0+00:00:00 I 0 0.0 oneA_job.sh A222,o
5.3 cosy11 7/4 12:34 0+00:00:00 I 0 0.0 oneA_job.sh A222,o
5.4 cosy11 7/4 12:34 0+00:00:00 I 0 0.0 oneA_job.sh A222,o
5.5 cosy11 7/4 12:34 0+00:00:00 I 0 0.0 oneA_job.sh A222,o
5.6 cosy11 7/4 12:34 0+00:00:00 I 0 0.0 oneA_job.sh A222,o
5.7 cosy11 7/4 12:34 0+00:00:00 I 0 0.0 oneA_job.sh A222,o
5.8 cosy11 7/4 12:34 0+00:00:00 I 0 0.0 oneA_job.sh A222,o
5.9 cosy11 7/4 12:34 0+00:00:00 I 0 0.0 oneA_job.sh A222,o
5.10 cosy11 7/4 12:34 0+00:00:00 I 0 0.0 oneA_job.sh A222,o
5.11 cosy11 7/4 12:34 0+00:00:00 I 0 0.0 oneA_job.sh A222,o
12 jobs; 0 completed, 0 removed, 12 idle, 0 running, 0 held, 0 suspended
c)=================
SUGGESTION: add = D_NETWORK D_FULLDEBUG D_HOSTNAME and report logs
For the master (who holds 6 job slots) I did this change at
[root@oswrk121 ~]# date
Sat Jul 4 12:49:34 EDT 2015
[root@oswrk121 ~]# service condor restart
Stopping Condor daemons: [ OK ]
Starting Condor daemons: [ OK ]
For the worker (IP=122) I did the same:
[root@oswrk122 ~]# date
Sat Jul 4 12:51:58 EDT 2015
[root@oswrk122 ~]# service condor restart
Stopping Condor daemons: [ OK ]
Starting Condor daemons: [ OK ]
Next, I submitted 12 jobs and they are stuck in waiting
[cosy11@oswrk121 nice-simple]$ date
Sat Jul 4 12:52:40 EDT 2015
[cosy11@oswrk121 nice-simple]$ condor_submit script_oneA.condor
Submitting job(s)............
12 job(s) submitted to cluster 6.
[cosy11@oswrk121 nice-simple]$ condor_q
-- Submitter: oswrk121.lns.mit.edu : <10.200.60.19:10119?addrs=10.200.60.19-10119> : oswrk121.lns.mit.edu
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
6.0 cosy11 7/4 12:52 0+00:00:00 I 0 0.0 oneA_job.sh A222,o
6.1 cosy11 7/4 12:52 0+00:00:00 I 0 0.0 oneA_job.sh A222,o
6.2 cosy11 7/4 12:52 0+00:00:00 I 0 0.0 oneA_job.sh A222,o
6.3 cosy11 7/4 12:52 0+00:00:00 I 0 0.0 oneA_job.sh A222,o
6.4 cosy11 7/4 12:52 0+00:00:00 I 0 0.0 oneA_job.sh A222,o
6.5 cosy11 7/4 12:52 0+00:00:00 I 0 0.0 oneA_job.sh A222,o
6.6 cosy11 7/4 12:52 0+00:00:00 I 0 0.0 oneA_job.sh A222,o
6.7 cosy11 7/4 12:52 0+00:00:00 I 0 0.0 oneA_job.sh A222,o
6.8 cosy11 7/4 12:52 0+00:00:00 I 0 0.0 oneA_job.sh A222,o
6.9 cosy11 7/4 12:52 0+00:00:00 I 0 0.0 oneA_job.sh A222,o
6.10 cosy11 7/4 12:52 0+00:00:00 I 0 0.0 oneA_job.sh A222,o
6.11 cosy11 7/4 12:52 0+00:00:00 I 0 0.0 oneA_job.sh A222,o
12 jobs; 0 completed, 0 removed, 12 idle, 0 running, 0 held, 0 suspended
I copied all condor log files from both VMs :
$ scp -rp root@xxxxxxxxxxxxxxxxxxxx:/var/log/condor condor-122
$ scp -rp root@xxxxxxxxxxxxxxxxxxxx:/var/log/condor condor-121
and posted them here:
https://www.dropbox.com/sh/gy16cqxgm1vbvj3/AAAn-gioo-BIY0EdCF9ML4eQa?dl=0
I hope you can figure out what I need to change to my my condor jobs to start
Thanks for looking in to my issue
Jan