I inherited a cluster of condor machines. I don't know anything about condor and I have a mess on my hands.
Is there access to an easy setup guide to just set up a simple, no-nonsense, nothing special, just the basics cluster?
We have 3 windows hosts and a linux host, and I'm having all kinds of issues. Some of them I have solved, others still exist.
1) Can't submit jobs on linux rholloway@rebelbase:~$ condor_submit submit
Submitting job(s) ERROR: Failed to connect to local queue manager CEDAR:6001:Failed to connect to <127.0.1.1:60211>
2) Jobs get submitted to the cluster and then show up as "held" and never do anything.
3) I get all kinds of errors in the Collector Log on what is supposed to be the master:
03/30 12:55:54 DC_AUTHENTICATE: attempt to open invalid session Dagobah:1228:1333129419:23, failing. 03/30 12:55:54 DC_AUTHENTICATE: attempt to open invalid session Dagobah:1228:1333129427:24, failing.
03/30 12:55:54 DC_AUTHENTICATE: attempt to open invalid session Dagobah:1228:1333129427:24, failing. 03/30 12:55:54 DC_AUTHENTICATE: attempt to open invalid session Dagobah:1228:1333129427:24, failing. 03/30 12:55:54 DC_AUTHENTICATE: attempt to open invalid session Dagobah:1228:1333129427:24, failing.
03/30 12:55:54 DC_AUTHENTICATE: attempt to open invalid session Dagobah:1228:1333129427:24, failing. 03/30 12:55:54 DC_AUTHENTICATE: attempt to open invalid session Dagobah:1228:1333129427:24, failing. 03/30 12:55:54 DC_AUTHENTICATE: attempt to open invalid session Dagobah:1228:1333129427:24, failing.
03/30 12:55:54 DC_AUTHENTICATE: attempt to open invalid session Dagobah:1228:1333129427:24, failing.
03/30 12:55:56 Failed to send DC_INVALIDATE_KEY to daemon at <127.0.1.1:53521>: SECMAN:2003:TCP connection to daemon at <127.0.1.1:53521> failed.
03/30 12:55:56 Failed to send DC_INVALIDATE_KEY to daemon at <127.0.1.1:53521>: SECMAN:2003:TCP connection to daemon at <127.0.1.1:53521> failed.
03/30 12:55:56 Failed to send DC_INVALIDATE_KEY to daemon at <127.0.1.1:53521>: SECMAN:2003:TCP connection to daemon at <127.0.1.1:53521> failed.
03/30 12:55:56 Failed to send DC_INVALIDATE_KEY to daemon at <127.0.1.1:53521>: SECMAN:2003:TCP connection to daemon at <127.0.1.1:53521> failed.
03/30 12:55:56 Failed to send DC_INVALIDATE_KEY to daemon at <127.0.1.1:53521>: SECMAN:2003:TCP connection to daemon at <127.0.1.1:53521> failed.
03/30 12:55:56 Failed to send DC_INVALIDATE_KEY to daemon at <127.0.1.1:53521>: SECMAN:2003:TCP connection to daemon at <127.0.1.1:53521> failed.
03/30 12:55:56 Failed to send DC_INVALIDATE_KEY to daemon at <127.0.1.1:53521>: SECMAN:2003:TCP connection to daemon at <127.0.1.1:53521> failed.
But I have no problems joining other machines to the cluster.
And if there are any contractors out there that do this for a living, I'll even pay to have someone fix the environment. We just need it to work.
If I shouldn't be running this cluster at all and have no business doing it, I'll accept that as an answer as well.