Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Going from Condor 7.7 to HTCondor 8.8
- Date: Fri, 24 May 2019 13:46:01 +0000
- From: Zach Miller <zmiller@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] Going from Condor 7.7 to HTCondor 8.8
Hi William,
Running "condor_submit -debug ..." shows you the client side of the conversation. The other side would be in the SchedLog file and will probably explain why the SchedD appears to be closing the connection during submit. It could be due to a permissions issue. You may need to set new ALLOW_WRITE or ALLOW_* parameters since some of the defaults may be different for that big of a version jump.
To get more useful information in the SchedLog, you may need to set "SCHEDD_DEBUG=D_ALL" in your condor_config, perform a condor_reconfig, then repeat the test, and then see if anything in the log jumps out. (PERMISSION DENIED, perhaps?)
Feel free to send it to me offline and I'd be happy to take a look because as you said, it's an insane amount of information, especially if you don't know what you are looking for. If you are going to do that you could also attach your config files, or the output of "condor_config_val -dump".
Thanks!
Cheers,
-zach
ïOn 5/23/19, 3:23 PM, "HTCondor-users on behalf of William Seligman" <htcondor-users-bounces@xxxxxxxxxxx on behalf of seligman@xxxxxxxxxxxxxxxxxx> wrote:
Background: I'm the sysadmin of a small CentOS 6 computing farm. For years our
small condor pool was running Condor 7.7; higher versions offered no new
features we needed. Then the user required a new (unrelated) software
installation for which the old CentOS 5 condor 7.7 libraries were incompatible
and they requested I upgrade to HTCondor 8.8.
From that point until now, I have not been able to get HTCondor 8.8 to fully
run on the farm. My debugging steps included erasing the condor_config* files
and replacing them with those from the RPMs and completely wiping the contents
of LOCAL_DIR.
Where I'm at now: Although the condor services start up properly, I can't submit
any jobs. The error is:
# condor_submit myfile.cmd
Submitting job(s)
ERROR: Failed to connect to local queue manager
SECMAN:2007:Failed to end classad message.
The results of web searches on this error have not helped. For the record:
- I've followed the instructions at
<https://lists.cs.wisc.edu/archive/htcondor-users/2008-March/msg00178.shtml>
multiple times. Since I had started with a fresh LOCAL_DIR, the file
LOCAL_DIR/spool/job_queue.log had no invalid entries, but I gave it a try anyway.
- At present, the users are not submitting any condor jobs, so schedd is not busy.
- Schedd is running:
# ps -elf | grep schedd
4 S condor 60019 59973 0 80 0 - 13065 poll_s May22 ? 00:00:07
condor_schedd -f
- The firewall is off. Neither iptables nor netfilter is running. (Our site has
Cisco firewall that I've configured to block off port 9618 from the outside, so
I'm concerned.)
- nmap tells me that port 9618 on the CONDOR_HOST is open.
- The only error in SchedLog is
DC_AUTHENTICATE: Unable to reconcile!
- I turned on debugging in condor_config.local:
TOOL_DEBUG = D_ALL
SUBMIT_DEBUG = D_ALL
and ran the job with
# condor_submit -debug myfile.cmd
I can post the results on request. I'm no expert, but the relevant lines appear
to be:
05/23/19 15:57:02 (fd:5) (pid:863797) (D_SECURITY) SECMAN: command 1112
QMGMT_WTE_CMD to schedd at <129.236.252.84:9618> from TCP port 19038 (blocking).
05/23/19 15:57:02 (fd:5) (pid:863797) (D_SECURITY) SECMAN:: default CLIENT
meths: FS,KERBEROS,GSI,CLAIMTOBE
05/23/19 15:57:02 (fd:5) (pid:863797) (D_NETWORK) condor_write(fd=4 schedd at
<9.236.252.84:9618>,,size=416,timeout=0,flags=0,non_blocking=0)
05/23/19 15:57:02 (fd:5) (pid:863797) (D_NETWORK) condor_read(fd=4 schedd at
<1.236.252.84:9618>,,size=5,timeout=0,flags=0,non_blocking=0)
05/23/19 15:57:02 (fd:5) (pid:863797) (D_NETWORK) Stream::get(int) failed to re
padding
05/23/19 15:57:02 (fd:5) (pid:863797) (D_ALWAYS) SECMAN: no classad from
serverfailing
- The only non-default lines in the condor_config file are:
BIND_ALL_INTERFACES = TRUE
SEC_DEFAULT_AUTHENTICATION = NEVER
Is there anything else I can do?
Thanks!