[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Condor Master dying



Hi,

Quite strange, but I had all daemons running in all hosts on Friday,
and today I found just 14 VM, so I restarted condor in most hosts and
found that it's not able to start...

If I run condor_master (I have my own start script) it does not start,
and it does not write anything in log:

# ls -lsa
total 28
   4 drwxrwxr-x    3 cdfcaf   cdfcaf       4096 Dec  4 16:50 .
   4 drwxrwxr-x    5 cdfcaf   cdfcaf       4096 Feb 10  2006 ..
   0 -rw-r--r--    1 cdfcaf   cdfcaf          0 Dec  4 16:51 dprintf_failure.MASTER
   4 -rw-r--r--    1 cdfcaf   cdfcaf         93 Dec  1 12:37 .master_address
   0 -rw-r--r--    1 cdfcaf   cdfcaf          0 Dec  4 16:51 MasterLog
   4 drwxr-xr-x    2 root     root         4096 Dec  4 15:33 old
   4 -rw-r--r--    1 cdfcaf   cdfcaf         93 Dec  1 12:37 .startd_address
   0 -rw-r--r--    1 cdfcaf   cdfcaf          0 Dec  1 20:25 .startd_claim_id.vm1
   0 -rw-r--r--    1 cdfcaf   cdfcaf          0 Dec  1 20:25 .startd_claim_id.vm2
   4 -rw-r--r--    1 cdfcaf   cdfcaf         36 Dec  1 12:54 .startd_claim_id.vm3
   4 -rw-r--r--    1 cdfcaf   cdfcaf         36 Dec  1 12:54 .startd_claim_id.vm4

# cat .master_address
<193.146.197.65:32807>
$CondorVersion: 6.8.2 Oct 12 2006 $
$CondorPlatform: I386-LINUX_RH9 $

[root@cdf-bcn015 log]# cat .startd_claim_id.vm3
<193.146.197.65:32808>#1164973042#6

[root@cdf-bcn015 log]# cat .startd_claim_id.vm4
<193.146.197.65:32808>#1164973042#7

And return code is 44 ...

I've rebooted, but the problem persists.

There's no difference between hosts which are running condor and those
where condor_master does not want to start...

Any idea about why are daemons dying?

thanks in advance,
Arnau