Hi Todd, Iâm resurrecting this thread because I think weâre still seeing related problems. One of our users has a parallel universe job that has been idle for almost a day. The StartLog on the available nodes seem to indicate that the nodes are held for a wile and then released without ever having enough nodes to start the job 03/06/18 15:26:04 slot1_1: State change: received RELEASE_CLAIM command 03/06/18 15:26:04 slot1_1: Changing state and activity: Claimed/Idle -> Preempting/Vacating 03/06/18 15:26:04 slot1_1: State change: No preempting claim, returning to owner 03/06/18 15:26:04 slot1_1: Changing state and activity: Preempting/Vacating -> Owner/Idle 03/06/18 15:26:04 slot1_1: Changing state: Owner -> Delete 03/06/18 15:26:04 slot1_1: Resource no longer needed, deleting 03/06/18 15:26:21 slot1_1: New machine resource of type -1 allocated 03/06/18 15:26:21 slot1: Changing state: Owner -> Unclaimed 03/06/18 15:26:21 slot1: State change: IS_OWNER is TRUE 03/06/18 15:26:21 slot1: Changing state: Unclaimed -> Owner 03/06/18 15:26:21 Setting up slot pairings 03/06/18 15:26:21 slot1_1: Request accepted. 03/06/18 15:26:21 slot1_1: Remote owner is soumi.de@xxxxxxxxxxxxxxxxxxxxxxxxx 03/06/18 15:26:21 slot1_1: State change: claiming protocol successful 03/06/18 15:26:21 slot1_1: Changing state: Owner -> Claimed Restarting condor on the submit node which is configured as the DedicatedScheduler results in the following in the SchedLog: 03/09/18 12:01:15 (pid:3405053) Delaying scheduling of parallel jobs because startd query time is long (1) seconds 03/09/18 12:01:29 (pid:3405053) Using negotiation protocol: NEGOTIATE 03/09/18 12:01:29 (pid:3405053) Negotiating for owner: DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxx 03/09/18 12:01:32 (pid:3405053) Finished negotiating for DedicatedScheduler in local pool: 0 matched, 17 rejected The DedicatedScheduler is running [root@sugwg-condor ~]# condor_version $CondorVersion: 8.7.7 Jan 24 2018 BuildID: 429803 PRE-RELEASE-UWCS $ $CondorPlatform: x86_64_RedHat7 $ The worker nodes have [root@CRUSH-SUGWG-OSG-10-5-187-144 ~]# condor_version $CondorVersion: 8.6.7 Oct 29 2017 BuildID: 422776 $ $CondorPlatform: x86_64_RedHat7 $ Any suggestions? If you need any additional information please let me know. Cheers, - Larne
|
Attachment:
smime.p7s
Description: S/MIME cryptographic signature