[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] condor_status -schedd and running shadow processes mismatch



Hello,

we are here running one submitting only host (P4, 2.6Ghz, 512MB RAM), as well as condor master (dual Xeon 2.8Ghz, 2GB Ram) (which is also submitting jobs). 
If I do a condor_status -schedd it is telling me, that the master is running 762 jobs. When I do a ps uax | grep shadow | wc -l (count the number of shadow processes) there are around 200 runing. 

Around the same mismatch is with the submission machine (600 to 200). 
also condor_status -total is showing different numbers (I think, the information for condor_status -total and condor_status -schedd come from differnt sources?). 

We are running about 1400 clients in the moment with jobs taking about 15 minutes (so we need to start around 100 jobs a minute). I have put down the SHADOW_DELAY to 0.1 to allow to start this shadow processes faster. Is there any reason, why the schedd should wait for 2s between each shadow ?

the ps says now 220 on the master and 200 on the submitter. 
doing a condor_status -schedd tells me 1071 and condor_status -total tells me 1020 claimed. 

Is anyone facing the same problems ? What hardware would one need to scale this up to 4000 machines. Can we still make this one pool, or are many smaller pools better ? What is the best size for a pool ? 
Should I create many smaller pools (what size ?) and flock them, or should I create one big pool with many submitters ?

One advantage of more smaller pools would definetly be that the matchmaking is not taking that long anymore. How well is the flocking working, if you flock 10 pools a 400 machines ? I would like to hide all the technical details from the user as much as possible. I would like for the user to see a web interface to submit jobs and have not to care about the technical details at all.

Best regards,

Michael Hess