[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Consistency problems between schedd(s) view and CM?



Hi,

We have ported our âVOViewsâ package from Torque to HTCondor.  Iâve noticed three times now, that there are consistency issues.  Basically it goes like this: if we look at what the CM says about how many dynamic slots there are with how many CPUs, this should agree with what a survey of the schedds says about the number of running jobs and cores occupied by them.  If we then add in the number of offline slots, free slots (hanging around in the Partitionable slot waiting to be occupied) and slots held from running by the defrag daemon, this should equal the total capacity.

This âsum ruleâ is what has been violated a few times.  All three times have been moments where there is a lot of activity on the cluster; for example the most recent one, the cluster was opened back up after a downtime, so thousands of queued jobs suddenly found thousands of empty cores to try and occupy.

I realized I could get a different ârunning jobâ view by collecting information on all the dynamic slots via condor_status.  I did so, and compared that view to a simultaneous collection from the schedds - what I saw was that in the best case, they agreed; otherwise, it has always (always means the ten times I checked) been the case that the schedd has jobs that the CM does not, but never the other way around.

Hypothesis : this must mean that some slots are âdoubly occupiedâ, which I checked, and indeed, in a few cases you could see that one of our CEs (Access Points) had a job from ATLAS in a particular dynamic slot, while a different access point claimed to have a job from LHCb in that exact same slot. 

Question: how fresh is the information supposed to be / guaranteed to be in the schedds and in the CM?  I tried to find that information, as well as a description of whether one of those views was âholierâ than the other, I came up empty handed - anybody here know the answers?

We have four access points btw.

JT