Hi Stephen,
We ran into this too. In our case the condor_starter process that was handling each of those jobs didn't exit properly and was still running. Connecting to the host and killing the stuck condor_starter process fixed the issue.Â
Alternatively, restarting Condor on the hosts will also get rid of anything still running and update the collector.
Hope that helps,
Collin
More Information for the curious:
Here's the end of the StarterLog for one of the affected slots:
08/28/18 18:51:52 ChildAliveMsg: failed to send DC_CHILDALIVE to parent daemon at <IP removed> (try 1 of 3): SECMAN:2003:TCP connection to daemon at <IP removed> failed.
08/28/18 18:53:34 ChildAliveMsg: giving up because deadline expired for sending DC_CHILDALIVE to parent.
08/28/18 18:53:34 Process exited, pid=43481, status=0
The pid listed was for the job running on that slot, which successfully exited and finished elsewhere.
We noticed this happening because it was affecting the group accounting during negotiation. The negotiator would allocate the correct number of slots using the number of jobs from the Schedd, but would then skip negotiation for that group because it used the incorrect number of jobs from the collector when determining current resource usage.
Here's an example where the submitter had only one pending 1-core job and no running jobs, but there were two stuck slots with 32-core jobs from that submitter:
11/26/18 12:40:59 group quotas: group= prod.<group removed>Â quota= 511.931Â requested= 1Â allocated= 1Â unallocated= 0
<...>
11/26/18 12:40:59 subtree_usage at prod.<group removed>Âis 64
<...>
11/26/18 12:41:01 Group prod.<group removed>Â- skipping, at or over quota (quota=511.931) (usage=64) (allocation=1)