From that point until now, I have not been able to get HTCondor 8.8 to fully run on the farm. My debugging steps included erasing the condor_config* files and replacing them with those from the RPMs and completely wiping the contents of LOCAL_DIR.
Where I'm at now: Although the condor services start up properly, I can't submit any jobs. The error is:
# condor_submit myfile.cmd Submitting job(s) ERROR: Failed to connect to local queue manager SECMAN:2007:Failed to end classad message. The results of web searches on this error have not helped. For the record:- I've followed the instructions at <https://lists.cs.wisc.edu/archive/htcondor-users/2008-March/msg00178.shtml> multiple times. Since I had started with a fresh LOCAL_DIR, the file LOCAL_DIR/spool/job_queue.log had no invalid entries, but I gave it a try anyway.
- At present, the users are not submitting any condor jobs, so schedd is not busy. - Schedd is running: # ps -elf | grep schedd4 S condor 60019 59973 0 80 0 - 13065 poll_s May22 ? 00:00:07 condor_schedd -f
- The firewall is off. Neither iptables nor netfilter is running. (Our site has Cisco firewall that I've configured to block off port 9618 from the outside, so I'm concerned.)
- nmap tells me that port 9618 on the CONDOR_HOST is open. - The only error in SchedLog is DC_AUTHENTICATE: Unable to reconcile! - I turned on debugging in condor_config.local: TOOL_DEBUG = D_ALL SUBMIT_DEBUG = D_ALL and ran the job with # condor_submit -debug myfile.cmdI can post the results on request. I'm no expert, but the relevant lines appear to be:
05/23/19 15:57:02 (fd:5) (pid:863797) (D_SECURITY) SECMAN: command 1112 QMGMT_WTE_CMD to schedd at <129.236.252.84:9618> from TCP port 19038 (blocking). 05/23/19 15:57:02 (fd:5) (pid:863797) (D_SECURITY) SECMAN:: default CLIENT meths: FS,KERBEROS,GSI,CLAIMTOBE 05/23/19 15:57:02 (fd:5) (pid:863797) (D_NETWORK) condor_write(fd=4 schedd at <9.236.252.84:9618>,,size=416,timeout=0,flags=0,non_blocking=0) 05/23/19 15:57:02 (fd:5) (pid:863797) (D_NETWORK) condor_read(fd=4 schedd at <1.236.252.84:9618>,,size=5,timeout=0,flags=0,non_blocking=0) 05/23/19 15:57:02 (fd:5) (pid:863797) (D_NETWORK) Stream::get(int) failed to re padding 05/23/19 15:57:02 (fd:5) (pid:863797) (D_ALWAYS) SECMAN: no classad from serverfailing
- The only non-default lines in the condor_config file are: BIND_ALL_INTERFACES = TRUE SEC_DEFAULT_AUTHENTICATION = NEVER Is there anything else I can do? Thanks!
Attachment:
smime.p7s
Description: S/MIME Cryptographic Signature