Hi Pascal,
Can you set D_SECURITY:2 for STARTD_DEBUG and send us the Start log for that and we can get a lot more information on what is happening. It seems like some security issues are happening with this large of a version difference.
Thanks,
Joe Reuss
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Pascal Schweizer via HTCondor-users <htcondor-users@xxxxxxxxxxx>
Sent: Wednesday, May 22, 2024 4:40 AM To: htcondor-users@xxxxxxxxxxx <htcondor-users@xxxxxxxxxxx> Cc: Pascal Schweizer <schweizer@xxxxxxxxxxxxxxx> Subject: [HTCondor-users] Condor 8.0.4 & 23.0.4 compatibility Hi
We have a submitter node running Condor 23.0.4 and are trying to run jobs on an old Windows XP machine running Condor 8.0.4 (please don’t ask why). We’re using 8.0.4 because that’s the most recent binary that we managed to get running on the Windows XP machine. Other executor nodes in the pool are running Condor 8.8 or newer without any problems.
Our problem with the 8.0.4 machine is that it’s not running any jobs, even though they match. The machine is listed when using condor_status and commands like condor_restart also work. But when submitting a job for that specific machine, it doesn’t get picked up. As a general question: Are Condor 8.0.4 and 23.0.4 compatible and this setup should theoretically work if configured correctly?
Here are some logs/errors we see every few minutes (negotiation cycle) after submitting a job for this machine. It looks like a communication/authentication error. Does anyone know what could be causing those? Security wise, we’re using a very basic config that shouldn’t cause any problems: use SECURITY : HOST_BASED ALLOW_READ = * ALLOW_WRITE = * ALLOW_ADMINISTRATOR = *
--- Executor Logs --- StartLog: DC_AUTHENTICATE: attempt to open invalid session <192.168.1.157:1076>#1716304270#1, failing; this session was requested by <192.168.1.30:64664> with return address <192.168.1.30:9618?addrs=192.168.1.30-9618&alias=submitter&noUDP&sock=schedd_5816_9e76>
NegotiatorLog: ---------- Started Negotiation Cycle ---------- Phase 1: Obtaining ads from collector ... Getting Scheduler, Submitter and Machine ads ... condor_read() failed: recv(fd=448) returned -1, errno = 10054 , reading 5 bytes from collector at <192.168.1.30:9618>. IO: Failed to read packet header Couldn't fetch ads: communication error Aborting negotiation cycle ---------- Started Negotiation Cycle ---------- Phase 1: Obtaining ads from collector ... Getting Scheduler, Submitter and Machine ads ... Sorting 37 ads ... Getting startd private ads ... condor_write(): Socket closed when trying to write 87 bytes to collector at <192.168.1.30:9618>, fd is 580 Buf::write(): condor_write() failed Couldn't fetch ads: communication error Aborting negotiation cycle
--- Submitter Logs --- SchedLog: (pid:6336) Negotiating for owner: a@submitter (pid:6336) condor_read(): Socket closed abnormally when trying to read 5 bytes from startd slot1@executor <192.168.1.157:1587> for p, errno=10054 (pid:6336) Response problem from startd when requesting claim slot1@executor <192.168.1.157:1587> for p 2071.0. (pid:6336) Failed to send REQUEST_CLAIM to startd slot1@executor <192.168.1.157:1587> for p: CEDAR:6004:failed reading from socket (pid:6336) Match record (slot1@executor <192.168.1.157:1587> for p, 2071.0) deleted
NegotiatorLog: ---------- Started Negotiation Cycle ---------- Phase 1: Obtaining ads from collector ... Not considering preemption, therefore constraining idle machines with ifThenElse((State == "Claimed"&&PartitionableSlot=!=true),"Name MyType State Activity StartdIpAddr AccountingGroup Owner RemoteUser Requirements SlotWeight ConcurrencyLimits","") Getting startd private ads ... Getting Scheduler, Submitter and Machine ads ... Sorting 19 ads ... Got ads: 19 public and 17 private Public ads include 2 submitter, 17 startd Phase 2: Performing accounting ... Phase 3: Sorting submitter ads by priority ... Starting prefetch round; 2 potential prefetches to do. Starting prefetch negotiation for p@submitter. Got NO_MORE_JOBS; schedd has no more requests Starting prefetch negotiation for a@submitter. Got NO_MORE_JOBS; schedd has no more requests Prefetch summary: 2 attempted, 2 successful. Phase 4.1: Negotiating with schedds ... Negotiating with p@submitter at <192.168.1.30:9618?addrs=192.168.1.30-9618&alias=submitter&noUDP&sock=schedd_5816_9e76> 0 seconds so far for this submitter 0 seconds so far for this schedd Request 02071.00000: autocluster 487 (request count 1 of 15) Matched 2071.0 p@submitter <192.168.1.30:9618?addrs=192.168.1.30-9618&alias=submitter&noUDP&sock=schedd_5816_9e76> preempting none <192.168.1.157:1587> slot1@executor Successfully matched with slot1@executor Request 02071.00000: autocluster 487 (request count 2 of 15) Rejected 2071.0 p@submitter <192.168.1.30:9618?addrs=192.168.1.30-9618&alias=submitter&noUDP&sock=schedd_5816_9e76>: no match found Negotiating with a@submitter at <192.168.1.30:9618?addrs=192.168.1.30-9618&alias=submitter&noUDP&sock=schedd_5816_9e76> 0 seconds so far for this submitter 0 seconds so far for this schedd Reached submitter resource limit: 0.000000 ... stopping <------- don’t think this has anything to do with this problem, but we’re not seeing this line for jobs submitted for other nodes and don’t know what it means Starting prefetch round; 1 potential prefetches to do. Starting prefetch negotiation for a@submitter. Got NO_MORE_JOBS; schedd has no more requests Prefetch summary: 1 attempted, 1 successful. Phase 4.2: Negotiating with schedds ... Negotiating with a@submitter at <192.168.1.30:9618?addrs=192.168.1.30-9618&alias=submitter&noUDP&sock=schedd_5816_9e76> 0 seconds so far for this submitter 0 seconds so far for this schedd Request 02069.00624: autocluster 1 (request count 1 of 142) Rejected 2069.624 a@submitter <192.168.1.30:9618?addrs=192.168.1.30-9618&alias=submitter&noUDP&sock=schedd_5816_9e76>: no match found negotiateWithGroup resources used submitterAds length 0 ---------- Finished Negotiation Cycle ----------
Regards, Pascal |