Hi everyone, I have downgraded central manager and submit node to the compute node version but still no change. Can somebody shed light on this? I am out of ideas ... Cheers, Jan On Sat, 2023-06-17 at 11:52 +0200, Jan Behrend wrote: > I have 20 nodes with partitioned slots only (no static slots). > When I start a job that runs over an hour these get preempted after 1 > hour, and restarted after this: > > > > This problem was described in [1] and solved by an upgrade which I > cannot do because we are stuck with Debian Stretch on the compute > nodes for various reasons. > So I am running the following HTCondor version 8.9.13-1 on the > compute nodes (STARTD) and 10.0.3-1 on the rest (SCHEDD, COLLECTOR, > NEGOTIATOR, ...) > The job shown in the graph is a test 'sleep' job. ÂThe submit file > looks like this: > > jbehrend@msched:~/condor_verification$ cat test.sub > executable = test.sh > arguments = 4500 > > log = logs/log.$(Cluster).$(Process) > output = logs/out.$(Cluster).$(Process) > error = logs/err.$(Cluster).$(Process) > > when_to_transfer_output = ON_EXIT > should_transfer_files = YES > > request_disk = 1G > request_memory = 1G > > requirements = regexp("pssproto[0-9]{2}", Machine) > > queue 1 > > > There is no HTCondor defragmentation process running anywhere. > I had the suspicion the automatic pslot preemption was causing this > effect but even after explicitly disabling the feature > (ALLOW_PSLOT_PREEMPTION = False) the behavior did not change. > > The job log shows this over and over again: > > 040 (361.000.000) 2023-06-17 10:32:41 Started transferring input > files > ÂÂÂÂÂÂÂÂTransferring to host: <10.98.68.10:9618?addrs=10.98.68.10- > 9618&alias=pssproto10.protonip.mkat.karoo.kat.ac.za&noUDP&sock=slot1_ > 2_15194_4c23_69> > ... > 040 (361.000.000) 2023-06-17 10:32:41 Finished transferring input > files > ... > 001 (361.000.000) 2023-06-17 10:32:42 Job executing on host: > <10.98.68.10:9618?addrs=10.98.68.10- > 9618&alias=pssproto10.protonip.mkat.karoo.kat.ac.za&noUDP&sock=startd > _15026_2a27> > ... > 040 (361.000.000) 2023-06-17 11:36:43 Started transferring input > files > ÂÂÂÂÂÂÂÂTransferring to host: <10.98.68.10:9618?addrs=10.98.68.10- > 9618&alias=pssproto10.protonip.mkat.karoo.kat.ac.za&noUDP&sock=slot1_ > 3_15194_4c23_71> > ... > 040 (361.000.000) 2023-06-17 11:36:43 Finished transferring input > files > ... > 001 (361.000.000) 2023-06-17 11:36:44 Job executing on host: > <10.98.68.10:9618?addrs=10.98.68.10- > 9618&alias=pssproto10.protonip.mkat.karoo.kat.ac.za&noUDP&sock=startd > _15026_2a27> > ... > > > Find config dumps (condor_config_val -dump) of all node types > attached. > If you need more information I am happy to help. > > Any help is greatly appreciated! > > Cheers Jan > > [1] > https://www-auth.cs.wisc.edu/lists/htcondor-users/2021-January/msg00013.shtml > > _______________________________________________ > HTCondor-users mailing list > To unsubscribe, send a message to > htcondor-users-request@xxxxxxxxxxxÂwith a > subject: Unsubscribe > You can also unsubscribe by visiting > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users > > The archives can be found at: > https://lists.cs.wisc.edu/archive/htcondor-users/ -- MAX-PLANCK-INSTITUT fuer Radioastronomie Jan Behrend - Backend Development Group ---------------------------------------- Auf dem Huegel 69, D-53121 Bonn Tel: +49 (228) 525 248 https://www.mpifr-bonn.mpg.de
Attachment:
smime.p7s
Description: S/MIME cryptographic signature