Hi again, we just had another job behave like this. It was submitted (requesting 32 nodes which were free at that point), one could watch condor_status -const 'PartitionableSlot isnt true' -af ClientMachine RemoteUser Cpus JobId with a rising number of slots with an undefined JobId until it reached 30. At that point condor_q showed this job as running and within seconds it re-appeared to be 'idle' and 12 nodes were left from condor_status' view without a defined JobId. Looking further through the logs, not much is seen, e.g.Schedlog: 10/02/20 14:43:44 (pid:1398) Starting add_shadow_birthdate(969.0) 10/02/20 14:43:44 (pid:1398) Started shadow for job 969.0 on slot1@xxxxxxxxxxxxxxxxx <10.10.82.1:9618?addrs=10.10.82.1-9618&noUDP&sock=2209_c4cc_3> for DedicatedSchedule r, (shadow pid = 1864058) 10/02/20 14:43:45 (pid:1398) Received a superuser command 10/02/20 14:43:45 (pid:1398) Number of Active Workers 0 10/02/20 14:43:46 (pid:1398) In DedicatedScheduler::reaper pid 1864058 has status 27648 10/02/20 14:43:46 (pid:1398) Shadow pid 1864058 exited with status 108 10/02/20 14:43:46 (pid:1398) Dedicated job abnormally ended, releasing claim 10/02/20 14:43:46 (pid:1398) Dedicated job abnormally ended, releasing claim [..] Thus, still being puzzled about it. Anyone with an idea, where to dig out more information about what may have gotten wrong? Cheers Carsten
Attachment:
smime.p7s
Description: S/MIME Cryptographic Signature