[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Unexpected job preemption on pslots



Hi everyone,

I have downgraded central manager and submit node to the compute node
version but still no change.
Can somebody shed light on this? I am out of ideas ...

Cheers, Jan

On Sat, 2023-06-17 at 11:52 +0200, Jan Behrend wrote:
> I have 20 nodes with partitioned slots only (no static slots).
> When I start a job that runs over an hour these get preempted after 1
> hour, and restarted after this:
> 
> 
> 
> This problem was described in [1] and solved by an upgrade which I
> cannot do because we are stuck with Debian Stretch on the compute
> nodes for various reasons.
> So I am running the following HTCondor version 8.9.13-1 on the
> compute nodes (STARTD) and 10.0.3-1 on the rest (SCHEDD, COLLECTOR,
> NEGOTIATOR, ...)
> The job shown in the graph is a test 'sleep' job. ÂThe submit file
> looks like this:
> 
> jbehrend@msched:~/condor_verification$ cat test.sub
> executable              = test.sh
> arguments               = 4500
> 
> log                     = logs/log.$(Cluster).$(Process)
> output                  = logs/out.$(Cluster).$(Process)
> error                   = logs/err.$(Cluster).$(Process)
> 
> when_to_transfer_output = ON_EXIT
> should_transfer_files   = YES
> 
> request_disk            = 1G
> request_memory          = 1G
> 
> requirements = regexp("pssproto[0-9]{2}", Machine)
> 
> queue 1
> 
> 
> There is no HTCondor defragmentation process running anywhere.Â
> I had the suspicion the automatic pslot preemption was causing this
> effect but even after explicitly disabling the feature
> (ALLOW_PSLOT_PREEMPTION = False) the behavior did not change.
> 
> The job log shows this over and over again:
> 
> 040 (361.000.000) 2023-06-17 10:32:41 Started transferring input
> files
> ÂÂÂÂÂÂÂÂTransferring to host: <10.98.68.10:9618?addrs=10.98.68.10-
> 9618&alias=pssproto10.protonip.mkat.karoo.kat.ac.za&noUDP&sock=slot1_
> 2_15194_4c23_69>
> ...
> 040 (361.000.000) 2023-06-17 10:32:41 Finished transferring input
> files
> ...
> 001 (361.000.000) 2023-06-17 10:32:42 Job executing on host:
> <10.98.68.10:9618?addrs=10.98.68.10-
> 9618&alias=pssproto10.protonip.mkat.karoo.kat.ac.za&noUDP&sock=startd
> _15026_2a27>
> ...
> 040 (361.000.000) 2023-06-17 11:36:43 Started transferring input
> files
> ÂÂÂÂÂÂÂÂTransferring to host: <10.98.68.10:9618?addrs=10.98.68.10-
> 9618&alias=pssproto10.protonip.mkat.karoo.kat.ac.za&noUDP&sock=slot1_
> 3_15194_4c23_71>
> ...
> 040 (361.000.000) 2023-06-17 11:36:43 Finished transferring input
> files
> ...
> 001 (361.000.000) 2023-06-17 11:36:44 Job executing on host:
> <10.98.68.10:9618?addrs=10.98.68.10-
> 9618&alias=pssproto10.protonip.mkat.karoo.kat.ac.za&noUDP&sock=startd
> _15026_2a27>
> ...
> 
> 
> Find config dumps (condor_config_val -dump) of all node types
> attached.
> If you need more information I am happy to help.
> 
> Any help is greatly appreciated!
> 
> Cheers Jan
> 
> [1]
> https://www-auth.cs.wisc.edu/lists/htcondor-users/2021-January/msg00013.shtml
> 
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to
> htcondor-users-request@xxxxxxxxxxxÂwith a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/

-- 
MAX-PLANCK-INSTITUT fuer Radioastronomie
Jan Behrend - Backend Development Group
----------------------------------------
Auf dem Huegel 69, D-53121 BonnÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ
Tel: +49 (228) 525 248
https://www.mpifr-bonn.mpg.de

Attachment: smime.p7s
Description: S/MIME cryptographic signature