[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Unexpected job preemption on pslots



What does the StartLog and StarterLog.slot* say during and right before that job preemption? 

 

I strongly suspect that the reason will show up there.

-tj

 

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Jan Behrend
Sent: Saturday, June 17, 2023 4:53 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: [HTCondor-users] Unexpected job preemption on pslots

 

Hi everyone,

 

I have 20 nodes with partitioned slots only (no static slots).

When I start a job that runs over an hour these get preempted after 1 hour, and restarted after this:

 

 

 

This problem was described in [1] and solved by an upgrade which I cannot do because we are stuck with Debian Stretch on the compute nodes for various reasons.

So I am running the following HTCondor version 8.9.13-1 on the compute nodes (STARTD) and 10.0.3-1 on the rest (SCHEDD, COLLECTOR, NEGOTIATOR, ...)

The job shown in the graph is a test 'sleep' job.  The submit file looks like this:

 

jbehrend@msched:~/condor_verification$ cat test.sub
executable              = test.sh
arguments               = 4500
 
log                     = logs/log.$(Cluster).$(Process)
output                  = logs/out.$(Cluster).$(Process)
error                   = logs/err.$(Cluster).$(Process)
 
when_to_transfer_output = ON_EXIT
should_transfer_files   = YES
 
request_disk            = 1G
request_memory          = 1G
 
requirements = regexp("pssproto[0-9]{2}", Machine)
 
queue 1
 

 

There is no HTCondor defragmentation process running anywhere. 

I had the suspicion the automatic pslot preemption was causing this effect but even after explicitly disabling the feature (ALLOW_PSLOT_PREEMPTION = False) the behavior did not change.

 

The job log shows this over and over again:

 

040 (361.000.000) 2023-06-17 10:32:41 Started transferring input files
        Transferring to host: <10.98.68.10:9618?addrs=10.98.68.10-9618&alias=pssproto10.protonip.mkat.karoo.kat.ac.za&noUDP&sock=slot1_2_15194_4c23_69>
...
040 (361.000.000) 2023-06-17 10:32:41 Finished transferring input files
...
001 (361.000.000) 2023-06-17 10:32:42 Job executing on host: <10.98.68.10:9618?addrs=10.98.68.10-9618&alias=pssproto10.protonip.mkat.karoo.kat.ac.za&noUDP&sock=startd_15026_2a27>
...
040 (361.000.000) 2023-06-17 11:36:43 Started transferring input files
        Transferring to host: <10.98.68.10:9618?addrs=10.98.68.10-9618&alias=pssproto10.protonip.mkat.karoo.kat.ac.za&noUDP&sock=slot1_3_15194_4c23_71>
...
040 (361.000.000) 2023-06-17 11:36:43 Finished transferring input files
...
001 (361.000.000) 2023-06-17 11:36:44 Job executing on host: <10.98.68.10:9618?addrs=10.98.68.10-9618&alias=pssproto10.protonip.mkat.karoo.kat.ac.za&noUDP&sock=startd_15026_2a27>
...
 

 

Find config dumps (condor_config_val -dump) of all node types attached.

If you need more information I am happy to help.

 

Any help is greatly appreciated!

 

Cheers Jan

 

 

-- 
MAX-PLANCK-INSTITUT fuer Radioastronomie
Jan Behrend - Backend Development Group
----------------------------------------
Auf dem Huegel 69, D-53121 Bonn                                  
Tel: +49 (228) 525 248
https://www.mpifr-bonn.mpg.de