[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] not starting jobs in condor ver 8.3.6



Hi Todd,
yes, I meant issue with transition from 8.3.5 to 8.3.6.

Answering in the order, also changes in the config once made were not undone.

a)================
SUGGESTION: A quick thing to check would be if turning off TCP_FORWARDING_HOST allows the pool to work.

I first checked what condor sees for a regular user:

[cosy11@oswrk121 0x]$ condor_config_val -v TCP_FORWARDING_HOST
TCP_FORWARDING_HOST = 198.125.163.121
 # at: /etc/condor/condor_config.local, line 26
 # raw: TCP_FORWARDING_HOST = 198.125.163.121

Next I disabled TCP_FORWARDING_HOST in config as root  and
service condor stop/start

Next verified it did worked:

[cosy11@oswrk121 0x]$ condor_config_val -v TCP_FORWARDING_HOST
Not defined: TCP_FORWARDING_HOST

Now I verify the pool works - it does:

[cosy11@oswrk121 0x]$ condor_status
Name               OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime

slot1@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.030 2421  0+00:00:04
slot2@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000 2421  0+00:00:31
slot3@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000 2421  0+00:00:32
slot4@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000 2421  0+00:00:33
slot5@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000 2421  0+00:00:34
slot6@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000 2421  0+00:00:35
                     Machines Owner Claimed Unclaimed Matched Preempting

        X86_64/LINUX        6     0       0         6       0          0

               Total        6     0       0         6       0          0

b)==============
SUGGESTION: You might also try adding the public IPs to the ALLOW_READ and ALLOW_WRITE settings. 

Done:
[root@oswrk121 ~]# condor_config_val -v ALLOW_WRITE 
ALLOW_WRITE = *.lns.mit.edu,10.200.60.*,198.125.163.121
 # at: /etc/condor/condor_config.local, line 34
 # raw: ALLOW_WRITE = *.lns.mit.edu,10.200.60.*,198.125.163.121

[root@oswrk121 ~]# condor_config_val -v ALLOW_READ 
ALLOW_READ = *.lns.mit.edu,10.200.60.*,198.125.163.121
 # at: /etc/condor/condor_config.local, line 33
 # raw: ALLOW_READ = *.lns.mit.edu,10.200.60.*,198.125.163.121


Note, the access fro the host was already given via DNS name, since:
[root@oswrk121 ~]# nslookup 198.125.163.121
Server:		10.200.60.21
Address:	10.200.60.21#53

121.163.125.198.in-addr.arpa	name = oswrk121.lns.mit.edu.

After this change I re-submitted jobs and they all still idle despite 6 jobs slots are opened.

c)============
SUGGESTION: starting up a fourth VM with 8.3.6 installed to check if the two 8.3.6 machines to communicate.

Started new VM w/  IP=122 as 8.3.6 and connfigured to be worker reporting to IP=121
The worker node (IP=122) is seen by the condor master (IP=121), now the que has 14 job-slots and the 12 jobs still await execution:
[cosy11@oswrk121 nice-simple]$ condor_status
Name               OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime

slot1@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.010 2421  0+00:10:10
slot2@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000 2421  0+00:10:37
slot3@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000 2421  0+00:10:38
slot4@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000 2421  0+00:10:39
slot5@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000 2421  0+00:10:40
slot6@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000 2421  0+00:10:41
slot1@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.010 1815  0+00:04:18
slot2@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000 1815  0+00:04:45
slot3@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000 1815  0+00:04:46
slot4@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000 1815  0+00:04:47
slot5@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000 1815  0+00:04:48
slot6@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000 1815  0+00:04:49
slot7@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000 1815  0+00:04:50
slot8@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000 1815  0+00:04:43
                     Machines Owner Claimed Unclaimed Matched Preempting

        X86_64/LINUX       14     0       0        14       0          0

               Total       14     0       0        14       0          0
[cosy11@oswrk121 nice-simple]$ condor_q


-- Submitter: oswrk121.lns.mit.edu : <10.200.60.19:9916?addrs=10.200.60.19-9916> : oswrk121.lns.mit.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   5.0   cosy11          7/4  12:34   0+00:00:00 I  0   0.0  oneA_job.sh A222,o
   5.1   cosy11          7/4  12:34   0+00:00:00 I  0   0.0  oneA_job.sh A222,o
   5.2   cosy11          7/4  12:34   0+00:00:00 I  0   0.0  oneA_job.sh A222,o
   5.3   cosy11          7/4  12:34   0+00:00:00 I  0   0.0  oneA_job.sh A222,o
   5.4   cosy11          7/4  12:34   0+00:00:00 I  0   0.0  oneA_job.sh A222,o
   5.5   cosy11          7/4  12:34   0+00:00:00 I  0   0.0  oneA_job.sh A222,o
   5.6   cosy11          7/4  12:34   0+00:00:00 I  0   0.0  oneA_job.sh A222,o
   5.7   cosy11          7/4  12:34   0+00:00:00 I  0   0.0  oneA_job.sh A222,o
   5.8   cosy11          7/4  12:34   0+00:00:00 I  0   0.0  oneA_job.sh A222,o
   5.9   cosy11          7/4  12:34   0+00:00:00 I  0   0.0  oneA_job.sh A222,o
   5.10  cosy11          7/4  12:34   0+00:00:00 I  0   0.0  oneA_job.sh A222,o
   5.11  cosy11          7/4  12:34   0+00:00:00 I  0   0.0  oneA_job.sh A222,o

12 jobs; 0 completed, 0 removed, 12 idle, 0 running, 0 held, 0 suspended


c)=================
SUGGESTION:  add  = D_NETWORK D_FULLDEBUG D_HOSTNAME and report logs

For the master (who holds 6 job slots) I did this change at
[root@oswrk121 ~]# date
Sat Jul  4 12:49:34 EDT 2015
[root@oswrk121 ~]# service  condor restart
Stopping Condor daemons:                                   [  OK  ]
Starting Condor daemons:                                   [  OK  ]

For the worker (IP=122) I did the same:
[root@oswrk122 ~]# date
Sat Jul  4 12:51:58 EDT 2015
[root@oswrk122 ~]# service  condor restart
Stopping Condor daemons:                                   [  OK  ]
Starting Condor daemons:                                   [  OK  ]

Next, I submitted 12 jobs and they are stuck in waiting
[cosy11@oswrk121 nice-simple]$ date
Sat Jul  4 12:52:40 EDT 2015
[cosy11@oswrk121 nice-simple]$ condor_submit script_oneA.condor 
Submitting job(s)............
12 job(s) submitted to cluster 6.
[cosy11@oswrk121 nice-simple]$ condor_q


-- Submitter: oswrk121.lns.mit.edu : <10.200.60.19:10119?addrs=10.200.60.19-10119> : oswrk121.lns.mit.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   6.0   cosy11          7/4  12:52   0+00:00:00 I  0   0.0  oneA_job.sh A222,o
   6.1   cosy11          7/4  12:52   0+00:00:00 I  0   0.0  oneA_job.sh A222,o
   6.2   cosy11          7/4  12:52   0+00:00:00 I  0   0.0  oneA_job.sh A222,o
   6.3   cosy11          7/4  12:52   0+00:00:00 I  0   0.0  oneA_job.sh A222,o
   6.4   cosy11          7/4  12:52   0+00:00:00 I  0   0.0  oneA_job.sh A222,o
   6.5   cosy11          7/4  12:52   0+00:00:00 I  0   0.0  oneA_job.sh A222,o
   6.6   cosy11          7/4  12:52   0+00:00:00 I  0   0.0  oneA_job.sh A222,o
   6.7   cosy11          7/4  12:52   0+00:00:00 I  0   0.0  oneA_job.sh A222,o
   6.8   cosy11          7/4  12:52   0+00:00:00 I  0   0.0  oneA_job.sh A222,o
   6.9   cosy11          7/4  12:52   0+00:00:00 I  0   0.0  oneA_job.sh A222,o
   6.10  cosy11          7/4  12:52   0+00:00:00 I  0   0.0  oneA_job.sh A222,o
   6.11  cosy11          7/4  12:52   0+00:00:00 I  0   0.0  oneA_job.sh A222,o

12 jobs; 0 completed, 0 removed, 12 idle, 0 running, 0 held, 0 suspended


I copied all condor log files from both VMs :
$ scp -rp root@xxxxxxxxxxxxxxxxxxxx:/var/log/condor condor-122
$ scp -rp root@xxxxxxxxxxxxxxxxxxxx:/var/log/condor condor-121 
and posted them here:
https://www.dropbox.com/sh/gy16cqxgm1vbvj3/AAAn-gioo-BIY0EdCF9ML4eQa?dl=0


I hope you can figure out what I need to change to my my condor jobs to start
Thanks for looking in to my issue
Jan