[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Going from Condor 7.7 to HTCondor 8.8



Background: I'm the sysadmin of a small CentOS 6 computing farm. For years our small condor pool was running Condor 7.7; higher versions offered no new features we needed. Then the user required a new (unrelated) software installation for which the old CentOS 5 condor 7.7 libraries were incompatible and they requested I upgrade to HTCondor 8.8.
From that point until now, I have not been able to get HTCondor 8.8 to fully 
run on the farm. My debugging steps included erasing the condor_config* files 
and replacing them with those from the RPMs and completely wiping the contents 
of LOCAL_DIR.
Where I'm at now: Although the condor services start up properly, I can't submit 
any jobs. The error is:
# condor_submit myfile.cmd
Submitting job(s)
ERROR: Failed to connect to local queue manager
SECMAN:2007:Failed to end classad message.

The results of web searches on this error have not helped. For the record:

- I've followed the instructions at <https://lists.cs.wisc.edu/archive/htcondor-users/2008-March/msg00178.shtml> multiple times. Since I had started with a fresh LOCAL_DIR, the file LOCAL_DIR/spool/job_queue.log had no invalid entries, but I gave it a try anyway.
- At present, the users are not submitting any condor jobs, so schedd is not busy.

- Schedd is running:

# ps -elf | grep schedd
4 S condor 60019 59973 0 80 0 - 13065 poll_s May22 ? 00:00:07 condor_schedd -f
- The firewall is off. Neither iptables nor netfilter is running. (Our site has 
Cisco firewall that I've configured to block off port 9618 from the outside, so 
I'm concerned.)
- nmap tells me that port 9618 on the CONDOR_HOST is open.

- The only error in SchedLog is
DC_AUTHENTICATE: Unable to reconcile!

- I turned on debugging in condor_config.local:
  TOOL_DEBUG = D_ALL
  SUBMIT_DEBUG = D_ALL

and ran the job with
# condor_submit -debug myfile.cmd

I can post the results on request. I'm no expert, but the relevant lines appear to be:
05/23/19 15:57:02 (fd:5) (pid:863797) (D_SECURITY) SECMAN: command 1112 
QMGMT_WTE_CMD to schedd at <129.236.252.84:9618> from TCP port 19038 (blocking).
05/23/19 15:57:02 (fd:5) (pid:863797) (D_SECURITY) SECMAN:: default CLIENT 
meths: FS,KERBEROS,GSI,CLAIMTOBE
05/23/19 15:57:02 (fd:5) (pid:863797) (D_NETWORK) condor_write(fd=4 schedd at 
<9.236.252.84:9618>,,size=416,timeout=0,flags=0,non_blocking=0)
05/23/19 15:57:02 (fd:5) (pid:863797) (D_NETWORK) condor_read(fd=4 schedd at 
<1.236.252.84:9618>,,size=5,timeout=0,flags=0,non_blocking=0)
05/23/19 15:57:02 (fd:5) (pid:863797) (D_NETWORK) Stream::get(int) failed to re 
padding
05/23/19 15:57:02 (fd:5) (pid:863797) (D_ALWAYS) SECMAN: no classad from 
serverfailing

- The only non-default lines in the condor_config file are:

BIND_ALL_INTERFACES = TRUE
SEC_DEFAULT_AUTHENTICATION = NEVER


Is there anything else I can do?

Thanks!

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature