Fellow condor users, I have a medium sized (~150 cpus) windows cluster running condor version 7.6.1. Recently, I have noticed that I cannot utilize all of the resources. A number of the cpu’s remain in a persistent “unclaimed” state. The most relevant log
entry I can find relating to this is in StartLog: 01/09/12 13:51:02 slot1: Changing state: Owner -> Unclaimed 01/09/12 13:51:02 slot2: State change: received RELEASE_CLAIM command 01/09/12 13:51:02 slot2: Changing state and activity: Claimed/Idle -> Preempting/Vacating 01/09/12 13:51:02 slot2: State change: No preempting claim, returning to owner 01/09/12 13:51:02 slot2: Changing state and activity: Preempting/Vacating -> Owner/Idle 01/09/12 13:51:02 slot2: State change: IS_OWNER is false 01/09/12 13:51:02 slot2: Changing state: Owner -> Unclaimed This sequence repeats indefinitely for the resource in question. My guess is that the RELEASE_CLAIM is the culprit, but is the origin of the RELEASE_CLAIM? What’s truly mysterious, is that it will affect only a few cpus in a multiple
core system, the rest of which are behaving normally. I spent a few hours combing the log files and past forums, but have not been able to find a suitable solution to this problem. Has anyone encountered this before? Any solutions? Thanks Eric |