Re: [HTCondor-users] Unexpected job preemption on pslots

Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

What does the StartLog and StarterLog.slot* say during and right before that job preemption?

I strongly suspect that the reason will show up there.

-tj

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Jan Behrend
Sent: Saturday, June 17, 2023 4:53 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: [HTCondor-users] Unexpected job preemption on pslots

Hi everyone,

I have 20 nodes with partitioned slots only (no static slots).

When I start a job that runs over an hour these get preempted after 1 hour, and restarted after this:

This problem was described in [1] and solved by an upgrade which I cannot do because we are stuck with Debian Stretch on the compute nodes for various reasons.

So I am running the following HTCondor version 8.9.13-1 on the compute nodes (STARTD) and 10.0.3-1 on the rest (SCHEDD, COLLECTOR, NEGOTIATOR, ...)

The job shown in the graph is a test 'sleep' job. The submit file looks like this:

jbehrend@msched:~/condor_verification$ cat test.sub

executable              = test.sh

arguments               = 4500

log                     = logs/log.$(Cluster).$(Process)

output                  = logs/out.$(Cluster).$(Process)

error                   = logs/err.$(Cluster).$(Process)

when_to_transfer_output = ON_EXIT

should_transfer_files   = YES

request_disk            = 1G

request_memory          = 1G

requirements = regexp("pssproto[0-9]{2}", Machine)

queue 1

There is no HTCondor defragmentation process running anywhere.

I had the suspicion the automatic pslot preemption was causing this effect but even after explicitly disabling the feature (ALLOW_PSLOT_PREEMPTION = False) the behavior did not change.

The job log shows this over and over again:

040 (361.000.000) 2023-06-17 10:32:41 Started transferring input files

        Transferring to host: <10.98.68.10:9618?addrs=10.98.68.10-9618&alias=pssproto10.protonip.mkat.karoo.kat.ac.za&noUDP&sock=slot1_2_15194_4c23_69>

...

040 (361.000.000) 2023-06-17 10:32:41 Finished transferring input files

...

001 (361.000.000) 2023-06-17 10:32:42 Job executing on host: <10.98.68.10:9618?addrs=10.98.68.10-9618&alias=pssproto10.protonip.mkat.karoo.kat.ac.za&noUDP&sock=startd_15026_2a27>

...

040 (361.000.000) 2023-06-17 11:36:43 Started transferring input files

        Transferring to host: <10.98.68.10:9618?addrs=10.98.68.10-9618&alias=pssproto10.protonip.mkat.karoo.kat.ac.za&noUDP&sock=slot1_3_15194_4c23_71>

...

040 (361.000.000) 2023-06-17 11:36:43 Finished transferring input files

...

001 (361.000.000) 2023-06-17 11:36:44 Job executing on host: <10.98.68.10:9618?addrs=10.98.68.10-9618&alias=pssproto10.protonip.mkat.karoo.kat.ac.za&noUDP&sock=startd_15026_2a27>

...

Find config dumps (condor_config_val -dump) of all node types attached.

If you need more information I am happy to help.

Any help is greatly appreciated!

Cheers Jan

[1] https://www-auth.cs.wisc.edu/lists/htcondor-users/2021-January/msg00013.shtml

--

MAX-PLANCK-INSTITUT fuer Radioastronomie

Jan Behrend - Backend Development Group

----------------------------------------

Auf dem Huegel 69, D-53121 Bonn

Tel: +49 (228) 525 248

https://www.mpifr-bonn.mpg.de

Mailing List Archives

Authenticated access

Re: [HTCondor-users] Unexpected job preemption on pslots