Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[HTCondor-users] slots stay claimed/idle even after UNUSED_CLAIM_TIMEOUT expired
- Date: Fri, 06 Aug 2021 18:18:18 +0300 (MSK)
- From: "Stanislav V. Markevich" <stanislav.markevich@xxxxxxxxxxxxxx>
- Subject: [HTCondor-users] slots stay claimed/idle even after UNUSED_CLAIM_TIMEOUT expired
Hi,
I set UNUSED_CLAIM_TIMEOUT to 180 but some (dynamic) slots are staying in Clamed/Idle state forever (see the last column):
condor_status -af:h Name OpSys State Activity Cpus Memory TotalTimeClaimedIdle
Name OpSys State Activity Cpus Memory TotalTimeClaimedIdle
slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx LINUX Unclaimed Idle 191 107 undefined
slot1_1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx LINUX Claimed Idle 1 512 5
slot1_2@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx LINUX Claimed Idle 1 512 5
slot1_3@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx LINUX Claimed Idle 1 128 5
slot1_4@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx LINUX Claimed Idle 1 128 73153
slot1_5@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx LINUX Claimed Idle 1 128 73153
slot1_6@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx LINUX Claimed Idle 1 128 73153
slot1_8@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx LINUX Claimed Idle 1 128 84669
slot1_9@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx LINUX Claimed Idle 1 128 84669
slot1_10@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx LINUX Claimed Idle 1 128 84669
slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx LINUX Unclaimed Idle 191 107 undefined
slot1_1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx LINUX Claimed Idle 1 512 18
slot1_2@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx LINUX Claimed Idle 1 128 18
slot1_3@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx LINUX Claimed Idle 1 128 65209
slot1_4@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx LINUX Claimed Idle 1 128 65209
slot1_5@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx LINUX Claimed Idle 1 128 83370
slot1_6@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx LINUX Claimed Idle 1 512 65209
slot1_7@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx LINUX Claimed Idle 1 128 65209
slot1_8@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx LINUX Claimed Idle 1 128 65209
slot1_9@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx LINUX Claimed Idle 1 128 16593
slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx LINUX Unclaimed Idle 191 107 undefined
slot1_1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx LINUX Claimed Idle 1 512 23
slot1_2@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx LINUX Claimed Idle 1 512 73171
slot1_3@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx LINUX Claimed Idle 1 128 23
slot1_4@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx LINUX Claimed Idle 1 128 84608
slot1_5@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx LINUX Claimed Idle 1 128 84608
slot1_9@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx LINUX Claimed Idle 1 128 84608
slot1_10@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx LINUX Claimed Idle 1 128 84608
slot1_11@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx LINUX Claimed Idle 1 128 84608
slot1_12@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx LINUX Claimed Idle 1 128 84608
Normally when slot exceeds UNUSED_CLAIM_TIMEOUT there is a record in the log saying that this slot is released:
2021-08-06T14:45:50.956165411Z condor_schedd[3032]: Resource slot1_3@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx has been unused for 182 seconds, limit is 180, releasing
But for problematic slots the last records in the log was hours ago (~24h):
2021-08-05T16:34:12.960897270Z condor_startd[859]: slot1_12: State change: starter exited
2021-08-05T16:34:12.960904225Z condor_startd[859]: slot1_12: Changing activity: Busy -> Idle
2021-08-05T16:34:12.960968125Z condor_startd[859]: slot1_12: State change: idle claim shutting down due to CLAIM_WORKLIFE
2021-08-05T16:34:12.960974666Z condor_startd[859]: slot1_12: Changing state and activity: Claimed/Idle -> Preempting/Vacating
2021-08-05T16:34:12.962018643Z condor_startd[859]: slot1_12: State change: No preempting claim, returning to owner
2021-08-05T16:34:12.962359058Z condor_startd[859]: slot1_12: Changing state and activity: Preempting/Vacating -> Owner/Idle
2021-08-05T16:34:12.962697322Z condor_startd[859]: slot1_12: State change: IS_OWNER is false
2021-08-05T16:34:12.962706591Z condor_startd[859]: slot1_12: Changing state: Owner -> Unclaimed
2021-08-05T16:34:12.962748296Z condor_startd[859]: slot1_12: Changing state: Unclaimed -> Delete
2021-08-05T16:34:12.962880429Z condor_startd[859]: slot1_12: Resource no longer needed, deleting
and then nothing. The slots are still there and claimed.
Is this a bug? Is there a way to release these slots forcefully?
Best regards,
Stanislav V. Markevich