[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Unexpected job preemption on pslots



Hi TJ,

thank you very much for your answer!

On Mon, 2023-06-19 at 20:20 +0000, John M Knoeller via HTCondor-users wrote:

What does the StartLog and StarterLog.slot* say during and right before that job preemption? 

 

I strongly suspect that the reason will show up there.

-tj


I checked the log files on the execution node and I could find these lines:

root@pssproto04:~# grep -E 'Claimed/Busy -> Preempting/Vacating' log/syslog.1
Jun 19 06:54:21 pssproto04 condor_startd[6137]: slot1_2: Changing state and activity: Claimed/Busy -> Preempting/Vacating
Jun 19 08:57:23 pssproto04 condor_startd[6137]: slot1_2: Changing state and activity: Claimed/Busy -> Preempting/Vacating
Jun 19 11:09:24 pssproto04 condor_startd[6137]: slot1_2: Changing state and activity: Claimed/Busy -> Preempting/Vacating
[...]

This is just the symptom of my problem, I guess.

This I could find on the schedd, which to me looks like a smoking gun:

Jun 19 06:54:21 msched condor_schedd[1005]: ERROR: Child pid 1211506 appears hung! Killing it hard.
Jun 19 06:54:21 msched condor_schedd[1005]: Shadow pid 1211506 successfully killed because the Shadow was hung.
Jun 19 06:54:21 msched condor_schedd[1005]: Shadow pid 1211506 for job 360.0 exited with status 4

Jun 19 08:57:23 msched condor_schedd[1005]: ERROR: Child pid 1216722 appears hung! Killing it hard.
Jun 19 08:57:23 msched condor_schedd[1005]: Shadow pid 1216722 successfully killed because the Shadow was hung.
Jun 19 08:57:23 msched condor_schedd[1005]: Shadow pid 1216722 for job 360.0 exited with status 4

Jun 19 11:09:23 msched condor_schedd[1005]: ERROR: Child pid 1221143 appears hung! Killing it hard.
Jun 19 11:09:23 msched condor_schedd[1005]: Shadow pid 1221143 successfully killed because the Shadow was hung.
Jun 19 11:09:23 msched condor_schedd[1005]: Shadow pid 1221143 for job 360.0 exited with status 4


Is this a local problem on the schedd machine running the shadow daemon?
At the same time I get this on the execution hosts:

Jun 19 06:54:21 pssproto04 condor_starter[24368]: Connection to shadow may be lost, will test by sending whoami request.
Jun 19 08:57:23 pssproto04 condor_starter[7887]: Connection to shadow may be lost, will test by sending whoami request.
Jun 19 11:09:23 pssproto04 condor_starter[24220]: Connection to shadow may be lost, will test by sending whoami request.
[...]


To be network or resource related the time intervals are too even, I think.  Any ideas?

For reference find the logs here:
https://cloud.mpifr-bonn.mpg.de/index.php/s/MgbApwBd9FYTKYD


Thanks for your help!
Cheers, Jan


Attachment: smime.p7s
Description: S/MIME cryptographic signature