[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Condor 8.0.4 & 23.0.4 compatibility



It was the exact opposite. We had not set run_as_owner=true before, so setting it to false (which is the default) did nothing. But we actually need the jobs to run with the ânormalâ Windows user to access network shares. We accomplished that by running the condor-service as that specific user (without setting run_as_owner). But it looks like the configuration with which user the Windows service is started, was lost during a reinstall.

Thanks again for your help.

 

- Pascal

 

From: Jaime Frey <jfrey@xxxxxxxxxxx>
Sent: Wednesday, 5 June 2024 17:15
To: Pascal Schweizer <schweizer@xxxxxxxxxxxxxxx>
Cc: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Condor 8.0.4 & 23.0.4 compatibility

 

The SEC_ENABLE_MATCH_PASSWORD_AUTHENTICATION configuration knob controls how the HTCondor daemons on different machines authenticate with each other. The error 1326 in CreateProcess is coming from when HTCondor attempts to launch the job under the userâs OS account. It means that HTCondor doesnât have the correct user password to provide to the OS.

 

You either need to set run_as_owner=false in your submit files or configure a condor_credd that has the user passwords, which the execute node can query.

 

 - Jaime



On Jun 5, 2024, at 9:51âAM, Pascal Schweizer <schweizer@xxxxxxxxxxxxxxx> wrote:

 

Setting the flag on the submitter to false did get rid of the DC_AUTHENTICATE error and it actually tries to start the job now. But then it fails with âCreateProcess failed, errno=1326â, which means username/password are incorrect. This doesnât make much sense to me, as we just disabled password authentication with this flag, right?

This error only occurs on Windows XP. Other executors (Windows 7/10/11, MacOS) still work as expected. Setting the flag to true/false on the Windows XP executor doesnât change anything.

Any idea whatâs causing this?

 

You also mentioned that âall execution nodes will need to be configured to authenticate and authorize the submit machine and vice versaâ.

Weâre using a very basic security config (same for submitter & executor), so I donât think thatâs the issue, but I canât say for sure.

  use SECURITY : HOST_BASED

  ALLOW_READ = *

  ALLOW_WRITE = *

  ALLOW_ADMINISTRATOR = *

  ALLOW_DAEMON = $(ALLOW_WRITE)

 

 

--- job-log on submitter ---

001 (2606.000.000) 2024-06-05 15:59:08 Job executing on host: <192.168.1.157:3361>

      SlotName: slot1_1@xxxxxxxxxxxxxxxxxxxxxxx

...

022 (2606.000.000) 2024-06-05 15:59:17 Job disconnected, attempting to reconnect

    Socket between submit and execute hosts closed unexpectedly

    Trying to reconnect to slot1@xxxxxxxxxxxxxxxxxxxxxxx <192.168.1.157:3361>

...

024 (2606.000.000) 2024-06-05 15:59:26 Job reconnection failed

    Job not found at execution machine

    Can not reconnect to slot1@xxxxxxxxxxxxxxxxxxxxxxx, rescheduling job

 

 

--- StarterLog.Slot1_1 ---

06/05/24 15:59:08 Job 2606.0 set to execute immediately

06/05/24 15:59:08 Starting a VANILLA universe job with ID: 2606.0

06/05/24 15:59:08 Tracking process family by login "condor-reuse-slot1_1"

06/05/24 15:59:08 IWD: C:\condor\execute\dir_2996

06/05/24 15:59:08 Output file: C:\condor\execute\dir_2996\_condor_stdout

06/05/24 15:59:08 Error file: C:\condor\execute\dir_2996\_condor_stderr

06/05/24 15:59:08 Renice expr "10" evaluated to 10

06/05/24 15:59:08 About to exec \\submitter\xyz.exe

06/05/24 15:59:08 Setting job's virtual memory rlimit to 0 megabytes

06/05/24 15:59:08 Running job as user condor-reuse-slot1_1

06/05/24 15:59:08 Create_Process: CreateProcess failed, errno=1326

06/05/24 15:59:08 Create_Process(\\submitter\xyz.exe, ...) failed:

06/05/24 15:59:08 Failed to start job, exiting

06/05/24 15:59:08 ShutdownFast all jobs.

 

 

--- StartLog (each block appears 8 times for slot1_1 to slot1_8) ---

06/05/24 15:59:07 slot1_1: New machine resource of type -1 allocated

06/05/24 15:59:07 slot1_1: Request accepted.

06/05/24 15:59:07 slot1_1: Remote owner is p@submitter

06/05/24 15:59:07 slot1_1: State change: claiming protocol successful

06/05/24 15:59:07 slot1_1: Changing state: Owner -> Claimed

 

06/05/24 15:59:07 slot1_1: Got activate_claim request from shadow (192.168.1.30)

06/05/24 15:59:07 slot1_1: Remote job ID is 2606.0

06/05/24 15:59:07 slot1_1: Got universe "VANILLA" (5) from request classad

06/05/24 15:59:07 slot1_1: State change: claim-activation protocol successful

06/05/24 15:59:07 slot1_1: Changing activity: Idle -> Busy

 

06/05/24 15:59:17 condor_read() failed: recv(fd=140) returned -1, errno = 10054 , reading 5 bytes from <127.0.0.1:3485>.

06/05/24 15:59:17 IO: Failed to read packet header

 

06/05/24 15:59:17 Starter pid 2996 exited with status 0

06/05/24 15:59:17 slot1_1: State change: starter exited

06/05/24 15:59:17 slot1_1: Changing activity: Busy -> Idle

 

06/05/24 15:59:26 Aborting CA_LOCATE_STARTER

06/05/24 15:59:26 ClaimId (<192.168.1.157:3361>#1717595809#1#0367f66cdfb3dda89a8049c47dc6cb1d9697c785) and GlobalJobId (a@submitter#2606.0#1717595944 ) not found

 

- Pascal

 

From: Jaime Frey <jfrey@xxxxxxxxxxx>
Sent: Monday, 3 June 2024 21:24
To: Pascal Schweizer <
schweizer@xxxxxxxxxxxxxxx>
Cc: HTCondor-Users Mail List <
htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Condor 8.0.4 & 23.0.4 compatibility

 

This has turned up a bug in the condor_schedd code, which we will fix for an upcoming release.

 

As a work around, you can try setting SEC_ENABLE_MATCH_PASSWORD_AUTHENTICATION=False on your submit machine. This will mean that all execution nodes will need to be configured to authenticate and authorize the submit machine and vice versa.

 

 - Jaime




On Jun 3, 2024, at 6:39âAM, Pascal Schweizer <schweizer@xxxxxxxxxxxxxxx> wrote:

 

Hi Jaime

 

Thanks for the suggestion. I added the setting to the global condor.conf of the XP machine, restarted it and submitted a new job, but nothing changed. Weâre still getting the same error: âDC_AUTHENTICATE: attempt to open invalid session ...â.

 

06/03/24 13:30:41 DC_AUTHENTICATE: received UDP packet from <192.168.1.157:1645>.

06/03/24 13:30:41 DC_AUTHENTICATE: received DC_AUTHENTICATE from <192.168.1.157:1645>

06/03/24 13:30:41 DC_AUTHENTICATE: received following ClassAd:

User = "unauthenticated@unmapped"

AuthMethodsList = "NTSSPI,KERBEROS"

MyRemoteUserName = "unauthenticated@unmapped"

UseSession = "YES"

Integrity = "NO"

CurrentTime = time()

AuthCommand = 427

RemoteVersion = "$CondorVersion: 8.0.4 Oct 19 2013 BuildID: 189770 $"

ServerCommandSock = "<192.168.1.157:1155>"

Subsystem = "KBDD"

Command = 427

SessionDuration = "86400"

Encryption = "NO"

Authentication = "NO"

SessionLease = 3600

ValidCommands = "60002,60003,60011,60014,427"

OutgoingNegotiation = "PREFERRED"

CryptoMethods = "3DES,BLOWFISH"

Enact = "YES"

AuthMethods = "NTSSPI"

Sid = "executor:128:1717413696:1"

06/03/24 13:30:41 DC_AUTHENTICATE: resuming session id executor:128:1717413696:1 with return address <192.168.1.157:1155>:

06/03/24 13:30:41 DC_AUTHENTICATE: Cached Session:

Enact = "YES"

Encryption = "NO"

Integrity = "NO"

AuthMethodsList = "NTSSPI,KERBEROS"

ServerPid = 2864

AuthMethods = "NTSSPI"

Sid = "executor:128:1717413696:1"

Subsystem = "KBDD"

CryptoMethods = "3DES,BLOWFISH"

SessionDuration = "86400"

SessionLease = 3600

Authentication = "NO"

RemoteVersion = "$CondorVersion: 8.0.4 Oct 19 2013 BuildID: 189770 $"

ServerCommandSock = "<192.168.1.157:1155>"

CurrentTime = time()

User = "unauthenticated@unmapped"

ValidCommands = "60002,60003,60011,60014,427"

06/03/24 13:30:41 DC_AUTHENTICATE: Success.

06/03/24 13:30:41 PERMISSION GRANTED to unauthenticated@unmapped from host 192.168.1.157 for command 427 (X_EVENT_NOTIFICATION), access level ALLOW: reason: 

06/03/24 13:30:41 DC_AUTHENTICATE: received DC_AUTHENTICATE from <192.168.1.30:55133>

06/03/24 13:30:41 DC_AUTHENTICATE: received following ClassAd:

Sid = "<192.168.1.157:1222>#1717413607#1"

RemoteVersion = "$CondorVersion: 23.0.4 2024-02-08 BuildID: 712251 $"

CryptoMethods = "AES"

UseSession = "YES"

ServerCommandSock = "<192.168.1.30:9618?addrs=192.168.1.30-9618&alias=submitter&noUDP&sock=schedd_6028_e840>"

ResumeResponse = false

Nonce = "+r15r8qQxCoUpWMHBWIJtN1f13fXS4cIOwzzysnPTQML"

ConnectSinful = "<192.168.1.157:1222>"

Command = 442

CurrentTime = time()

06/03/24 13:30:41 DC_AUTHENTICATE: attempt to open invalid session <192.168.1.157:1222>#1717413607#1, failing; this session was requested by <192.168.1.30:55133> with return address <192.168.1.30:9618?addrs=192.168.1.30-9618&alias=submitter&noUDP&sock=schedd_6028_e840>

06/03/24 13:30:41 SECMAN: command 60014 DC_INVALIDATE_KEY to daemon at <192.168.1.30:9618> from TCP port 1647 (non-blocking, raw).

06/03/24 13:30:41 SECMAN: waiting for TCP connection to daemon at <192.168.1.30:9618>.

06/03/24 13:30:41 SECMAN: resuming command 60014 DC_INVALIDATE_KEY to daemon at <192.168.1.30:9618> from TCP port 1647 (non-blocking, raw).

06/03/24 13:30:41 SECMAN: no cached key for {<192.168.1.30:9618?addrs=192.168.1.30-9618&alias=submitter&noUDP&sock=schedd_6028_e840>,<60014>}.

06/03/24 13:30:41 SECMAN: Security Policy:

NewSession = "YES"

SessionDuration = "86400"

ServerPid = 128

Enact = "NO"

OutgoingNegotiation = "NEVER"

ParentUniqueID = "executor:1244:1717413606"

Encryption = "NEVER"

SessionLease = 3600

Authentication = "NEVER"

Integrity = "NEVER"

AuthMethods = "NTSSPI,KERBEROS"

Subsystem = "STARTD"

CryptoMethods = "3DES,BLOWFISH"

CurrentTime = time()

06/03/24 13:30:41 SECMAN: not negotiating, just sending command (60014)

06/03/24 13:30:41 Authorizing server 'unauthenticated@unmapped/192.168.1.30'.

06/03/24 13:30:41 Completed DC_INVALIDATE_KEY to daemon at <192.168.1.30:9618>

 

Regards,

Pascal

 

From: Jaime Frey <jfrey@xxxxxxxxxxx>
Sent: Friday, 31 May 2024 22:45
To: HTCondor-Users Mail List <
htcondor-users@xxxxxxxxxxx>
Cc: Pascal Schweizer <
schweizer@xxxxxxxxxxxxxxx>
Subject: Re: [HTCondor-users] Condor 8.0.4 & 23.0.4 compatibility

 

There are some strange things going on with the pre-negotiated security session during scheddâs attempt to start a job on this machine.

Try setting this on your old XP machine:

 

SEC_ENABLE_MATCH_PASSWORD_AUTHENTICATION = False

 

That will side-step the current failure indicated in the log.

 

 - Jaime





On May 27, 2024, at 4:37âAM, Pascal Schweizer via HTCondor-users <htcondor-users@xxxxxxxxxxx> wrote:

 

Hi Joe

 

I set STARTD_DEBUG = D_SECURITY:2, restarted condor and submitted a new job for that executor.

I attached the StartLog from when condor was restarted + ~3min.

 

Regards,

Pascal

 

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Joe Reuss via HTCondor-users
Sent: Thursday, 23 May 2024 23:21
To: HTCondor-Users Mail List <
htcondor-users@xxxxxxxxxxx>
Cc: Joe Reuss <
jrreuss@xxxxxxxx>
Subject: Re: [HTCondor-users] Condor 8.0.4 & 23.0.4 compatibility

 

Hi Pascal,

 

Can you set D_SECURITY:2 for STARTD_DEBUG and send us the Start log for that and we can get a lot more information on what is happening. It seems like some security issues are happening with this large of a version difference.

 

Thanks,

Joe Reuss


From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Pascal Schweizer via HTCondor-users <htcondor-users@xxxxxxxxxxx>
Sent: Wednesday, May 22, 2024 4:40 AM
To: 
htcondor-users@xxxxxxxxxxx <htcondor-users@xxxxxxxxxxx>
Cc: Pascal Schweizer <
schweizer@xxxxxxxxxxxxxxx>
Subject: [HTCondor-users] Condor 8.0.4 & 23.0.4 compatibility

 

Hi

 

We have a submitter node running Condor 23.0.4 and are trying to run jobs on an old Windows XP machine running Condor 8.0.4 (please donât ask why).

Weâre using 8.0.4 because thatâs the most recent binary that we managed to get running on the Windows XP machine. Other executor nodes in the pool are running Condor 8.8 or newer without any problems.

 

Our problem with the 8.0.4 machine is that itâs not running any jobs, even though they match. The machine is listed when using condor_status and commands like condor_restart also work. But when submitting a job for that specific machine, it doesnât get picked up.

As a general question: Are Condor 8.0.4 and 23.0.4 compatible and this setup should theoretically work if configured correctly?

 

Here are some logs/errors we see every few minutes (negotiation cycle) after submitting a job for this machine.

It looks like a communication/authentication error. Does anyone know what could be causing those?

Security wise, weâre using a very basic config that shouldnât cause any problems:

  use SECURITY : HOST_BASED

  ALLOW_READ = *

  ALLOW_WRITE = *

  ALLOW_ADMINISTRATOR = *

 

--- Executor Logs ---

StartLog:

    DC_AUTHENTICATE: attempt to open invalid session <192.168.1.157:1076>#1716304270#1, failing; this session was requested by <192.168.1.30:64664> with return address <192.168.1.30:9618?addrs=192.168.1.30-9618&alias=submitter&noUDP&sock=schedd_5816_9e76>

 

NegotiatorLog:

  ---------- Started Negotiation Cycle ----------

  Phase 1:  Obtaining ads from collector ...

    Getting Scheduler, Submitter and Machine ads ...

  condor_read() failed: recv(fd=448) returned -1, errno = 10054 , reading 5 bytes from collector at <192.168.1.30:9618>.

  IO: Failed to read packet header

  Couldn't fetch ads: communication error

  Aborting negotiation cycle

  ---------- Started Negotiation Cycle ----------

  Phase 1:  Obtaining ads from collector ...

    Getting Scheduler, Submitter and Machine ads ...

    Sorting 37 ads ...

    Getting startd private ads ...

  condor_write(): Socket closed when trying to write 87 bytes to collector at <192.168.1.30:9618>, fd is 580

  Buf::write(): condor_write() failed

  Couldn't fetch ads: communication error

  Aborting negotiation cycle

 

--- Submitter Logs ---

SchedLog:

  (pid:6336) Negotiating for owner: a@submitter

  (pid:6336) condor_read(): Socket closed abnormally when trying to read 5 bytes from startd slot1@executor <192.168.1.157:1587> for p, errno=10054

  (pid:6336) Response problem from startd when requesting claim slot1@executor <192.168.1.157:1587> for p 2071.0.

  (pid:6336) Failed to send REQUEST_CLAIM to startd slot1@executor <192.168.1.157:1587> for p: CEDAR:6004:failed reading from socket

  (pid:6336) Match record (slot1@executor <192.168.1.157:1587> for p, 2071.0) deleted

 

NegotiatorLog:

  ---------- Started Negotiation Cycle ----------

  Phase 1:  Obtaining ads from collector ...

  Not considering preemption, therefore constraining idle machines with ifThenElse((State == "Claimed"&&PartitionableSlot=!=true),"Name MyType State Activity StartdIpAddr AccountingGroup Owner RemoteUser Requirements SlotWeight ConcurrencyLimits","") 

    Getting startd private ads ...

    Getting Scheduler, Submitter and Machine ads ...

    Sorting 19 ads ...

  Got ads: 19 public and 17 private

  Public ads include 2 submitter, 17 startd

  Phase 2:  Performing accounting ...

  Phase 3:  Sorting submitter ads by priority ...

  Starting prefetch round; 2 potential prefetches to do.

  Starting prefetch negotiation for p@submitter.

      Got NO_MORE_JOBS;  schedd has no more requests

  Starting prefetch negotiation for a@submitter.

      Got NO_MORE_JOBS;  schedd has no more requests

  Prefetch summary: 2 attempted, 2 successful.

  Phase 4.1:  Negotiating with schedds ...

    Negotiating with p@submitter at <192.168.1.30:9618?addrs=192.168.1.30-9618&alias=submitter&noUDP&sock=schedd_5816_9e76>

  0 seconds so far for this submitter

  0 seconds so far for this schedd

      Request 02071.00000: autocluster 487 (request count 1 of 15)

        Matched 2071.0 p@submitter <192.168.1.30:9618?addrs=192.168.1.30-9618&alias=submitter&noUDP&sock=schedd_5816_9e76> preempting none <192.168.1.157:1587> slot1@executor

        Successfully matched with slot1@executor

      Request 02071.00000: autocluster 487 (request count 2 of 15)

        Rejected 2071.0 p@submitter <192.168.1.30:9618?addrs=192.168.1.30-9618&alias=submitter&noUDP&sock=schedd_5816_9e76>: no match found

    Negotiating with a@submitter at <192.168.1.30:9618?addrs=192.168.1.30-9618&alias=submitter&noUDP&sock=schedd_5816_9e76>

  0 seconds so far for this submitter

  0 seconds so far for this schedd

      Reached submitter resource limit: 0.000000 ... stopping               <------- donât think this has anything to do with this problem, but weâre not seeing this line for jobs submitted for other nodes and donât know what it means

  Starting prefetch round; 1 potential prefetches to do.

  Starting prefetch negotiation for a@submitter.

      Got NO_MORE_JOBS;  schedd has no more requests

  Prefetch summary: 1 attempted, 1 successful.

  Phase 4.2:  Negotiating with schedds ...

    Negotiating with a@submitter at <192.168.1.30:9618?addrs=192.168.1.30-9618&alias=submitter&noUDP&sock=schedd_5816_9e76>

  0 seconds so far for this submitter

  0 seconds so far for this schedd

      Request 02069.00624: autocluster 1 (request count 1 of 142)

        Rejected 2069.624 a@submitter <192.168.1.30:9618?addrs=192.168.1.30-9618&alias=submitter&noUDP&sock=schedd_5816_9e76>: no match found

   negotiateWithGroup resources used submitterAds length 0

  ---------- Finished Negotiation Cycle ----------

 

Regards,

Pascal

<StartLog.txt>_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to 
htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/