[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Unexpected job preemption on pslots



Hi everyone,

I might have found the problem:  It's with the authentication:
SEC_DEFAULT_AUTHENTICATION_METHODS = KERBEROS, SSL

Kerberos is working fine, but after logging out of the submit node the credentials get destroyed and as the shadow daemons run as the sumitting users,
the daemon fails to authenticate itself to the execution hosts after a timeout which is probably one of these:

root@msched:~# condor_config_val -dump | grep 3600
DEAD_COLLECTOR_MAX_AVOIDANCE_TIME = 3600
ECRYPTFS_KEY_TIMEOUT = 3600
HOUR = 3600
MASTER_BACKOFF_CEILING = 3600
NOT_RESPONDING_TIMEOUT = 3600
SEC_CREDENTIAL_SWEEP_DELAY = 3600
SHADOW_WORKLIFE = 3600
TRANSFER_IO_REPORT_TIMESPANS = 1m:60 5m:300 1h:3600 1d:86400


Test is running now ... keep your fingers crossed ;)

Cheers, Jan



On Tue, 2023-06-20 at 12:53 +0200, Jan Behrend wrote:
Hi TJ,

thank you very much for your answer!

On Mon, 2023-06-19 at 20:20 +0000, John M Knoeller via HTCondor-users wrote:

What does the StartLog and StarterLog.slot* say during and right before that job preemption? 

 

I strongly suspect that the reason will show up there.

-tj


I checked the log files on the execution node and I could find these lines:

root@pssproto04:~# grep -E 'Claimed/Busy -> Preempting/Vacating' log/syslog.1
Jun 19 06:54:21 pssproto04 condor_startd[6137]: slot1_2: Changing state and activity: Claimed/Busy -> Preempting/Vacating
Jun 19 08:57:23 pssproto04 condor_startd[6137]: slot1_2: Changing state and activity: Claimed/Busy -> Preempting/Vacating
Jun 19 11:09:24 pssproto04 condor_startd[6137]: slot1_2: Changing state and activity: Claimed/Busy -> Preempting/Vacating
[...]

This is just the symptom of my problem, I guess.

This I could find on the schedd, which to me looks like a smoking gun:

Jun 19 06:54:21 msched condor_schedd[1005]: ERROR: Child pid 1211506 appears hung! Killing it hard.
Jun 19 06:54:21 msched condor_schedd[1005]: Shadow pid 1211506 successfully killed because the Shadow was hung.
Jun 19 06:54:21 msched condor_schedd[1005]: Shadow pid 1211506 for job 360.0 exited with status 4

Jun 19 08:57:23 msched condor_schedd[1005]: ERROR: Child pid 1216722 appears hung! Killing it hard.
Jun 19 08:57:23 msched condor_schedd[1005]: Shadow pid 1216722 successfully killed because the Shadow was hung.
Jun 19 08:57:23 msched condor_schedd[1005]: Shadow pid 1216722 for job 360.0 exited with status 4

Jun 19 11:09:23 msched condor_schedd[1005]: ERROR: Child pid 1221143 appears hung! Killing it hard.
Jun 19 11:09:23 msched condor_schedd[1005]: Shadow pid 1221143 successfully killed because the Shadow was hung.
Jun 19 11:09:23 msched condor_schedd[1005]: Shadow pid 1221143 for job 360.0 exited with status 4


Is this a local problem on the schedd machine running the shadow daemon?
At the same time I get this on the execution hosts:

Jun 19 06:54:21 pssproto04 condor_starter[24368]: Connection to shadow may be lost, will test by sending whoami request.
Jun 19 08:57:23 pssproto04 condor_starter[7887]: Connection to shadow may be lost, will test by sending whoami request.
Jun 19 11:09:23 pssproto04 condor_starter[24220]: Connection to shadow may be lost, will test by sending whoami request.
[...]


To be network or resource related the time intervals are too even, I think.  Any ideas?

For reference find the logs here:
https://cloud.mpifr-bonn.mpg.de/index.php/s/MgbApwBd9FYTKYD


Thanks for your help!
Cheers, Jan


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting

The archives can be found at:

-- 
MAX-PLANCK-INSTITUT fuer Radioastronomie
Jan Behrend - Backend Development Group
----------------------------------------
Auf dem Huegel 69, D-53121 Bonn                                  
Tel: +49 (228) 525 248

Attachment: smime.p7s
Description: S/MIME cryptographic signature