Re: [HTCondor-users] Unexpected job preemption on pslots

Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

Date: Tue, 20 Jun 2023 13:08:50 +0200

From: Jan Behrend <jbehrend@xxxxxxxxxxxxxxxxx>

Subject: Re: [HTCondor-users] Unexpected job preemption on pslots

Hi everyone,

I might have found the problem: It's with the authentication:

SEC_DEFAULT_AUTHENTICATION_METHODS = KERBEROS, SSL

Kerberos is working fine, but after logging out of the submit node the credentials get destroyed and as the shadow daemons run as the sumitting users,

the daemon fails to authenticate itself to the execution hosts after a timeout which is probably one of these:

root@msched:~# condor_config_val -dump | grep 3600

DEAD_COLLECTOR_MAX_AVOIDANCE_TIME = 3600

ECRYPTFS_KEY_TIMEOUT = 3600

HOUR = 3600

MASTER_BACKOFF_CEILING = 3600

NOT_RESPONDING_TIMEOUT = 3600

SEC_CREDENTIAL_SWEEP_DELAY = 3600

SHADOW_WORKLIFE = 3600

TRANSFER_IO_REPORT_TIMESPANS = 1m:60 5m:300 1h:3600 1d:86400

Test is running now ... keep your fingers crossed ;)

Cheers, Jan

On Tue, 2023-06-20 at 12:53 +0200, Jan Behrend wrote:

Hi TJ,

thank you very much for your answer!

On Mon, 2023-06-19 at 20:20 +0000, John M Knoeller via HTCondor-users wrote:
What does the StartLog and StarterLog.slot* say during and right before that job preemption?

I strongly suspect that the reason will show up there.
-tj

I checked the log files on the execution node and I could find these lines:
root@pssproto04:~# grep -E 'Claimed/Busy -> Preempting/Vacating' log/syslog.1
Jun 19 06:54:21 pssproto04 condor_startd[6137]: slot1_2: Changing state and activity: Claimed/Busy -> Preempting/Vacating
Jun 19 08:57:23 pssproto04 condor_startd[6137]: slot1_2: Changing state and activity: Claimed/Busy -> Preempting/Vacating
Jun 19 11:09:24 pssproto04 condor_startd[6137]: slot1_2: Changing state and activity: Claimed/Busy -> Preempting/Vacating
[...]
This is just the symptom of my problem, I guess.

This I could find on the schedd, which to me looks like a smoking gun:
Jun 19 06:54:21 msched condor_schedd[1005]: ERROR: Child pid 1211506 appears hung! Killing it hard.
Jun 19 06:54:21 msched condor_schedd[1005]: Shadow pid 1211506 successfully killed because the Shadow was hung.
Jun 19 06:54:21 msched condor_schedd[1005]: Shadow pid 1211506 for job 360.0 exited with status 4
Jun 19 08:57:23 msched condor_schedd[1005]: ERROR: Child pid 1216722 appears hung! Killing it hard.
Jun 19 08:57:23 msched condor_schedd[1005]: Shadow pid 1216722 successfully killed because the Shadow was hung.
Jun 19 08:57:23 msched condor_schedd[1005]: Shadow pid 1216722 for job 360.0 exited with status 4
Jun 19 11:09:23 msched condor_schedd[1005]: ERROR: Child pid 1221143 appears hung! Killing it hard.
Jun 19 11:09:23 msched condor_schedd[1005]: Shadow pid 1221143 successfully killed because the Shadow was hung.
Jun 19 11:09:23 msched condor_schedd[1005]: Shadow pid 1221143 for job 360.0 exited with status 4
Is this a local problem on the schedd machine running the shadow daemon?
At the same time I get this on the execution hosts:
Jun 19 06:54:21 pssproto04 condor_starter[24368]: Connection to shadow may be lost, will test by sending whoami request.
Jun 19 08:57:23 pssproto04 condor_starter[7887]: Connection to shadow may be lost, will test by sending whoami request.
Jun 19 11:09:23 pssproto04 condor_starter[24220]: Connection to shadow may be lost, will test by sending whoami request.
[...]
To be network or resource related the time intervals are too even, I think. Any ideas?
For reference find the logs here:
https://cloud.mpifr-bonn.mpg.de/index.php/s/MgbApwBd9FYTKYD
Thanks for your help!
Cheers, Jan
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

--

MAX-PLANCK-INSTITUT fuer Radioastronomie

Jan Behrend - Backend Development Group

----------------------------------------

Auf dem Huegel 69, D-53121 Bonn

Tel: +49 (228) 525 248

http://www.mpifr-bonn.mpg.de

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Mailing List Archives

Authenticated access

Re: [HTCondor-users] Unexpected job preemption on pslots