On 23/10/2015 22:42, Michael Paterson wrote:
No, the VFS cache is nothing to do with this. The OS will always eject stuff from the cache when it needs more RAM.I'm trying to run 4 single core jobs on a 4cpu box with partitionable slots and ~7500m memory, the jobs have 1500m mem set for the memory requirement.Sometimes all 4 slots will start up on a machine, but others only get 3. PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 7804 slot03 30 10 1381m 787m 26m R 100.0 10.5 63:25.24 basf2 10279 slot02 30 10 1347m 793m 64m R 100.0 10.6 41:47.69 basf2 6322 slot01 30 10 1546m 891m 21m R 98.4 11.9 386:24.91 basf2 # free -mtotal used free shared buffers cachedMem: 7514 6824 689 0 69 3320 -/+ buffers/cache: 3434 4080 Swap: 16383 10 16373 Is the ~3G in cache preventing the 4th job from setting a slot?
To work out what's happening, look at condor_status output: this will show you all the allocated slots and also the top partitionable slot which has all the remaining resources assigned to it. And when three jobs are running but the fourth isn't, look at condor_q -better-analyze <pid> where <pid> is the ID of the job which isn't running.
I can't tell without seeing that output what's happening, but there are lots of reasons why a job might not start.
The one which has bitten me in the past is that condor thinks the machine is in "owner" state (i.e. a human is sitting in front of the machine doing real work) and therefore won't start any new jobs. This is because condor has what appears to be a very rough way of tracking how much load average is due to condor jobs and how much due to non-condor jobs; in my experience it can easily think that the non-condor jobs account for a load average of more than 0.3.
Because these machines were dedicated to htcondor jobs I fixed this problem by setting
START = True in /etc/condor/condor_config.local. Recently after reading some more I think a better way may be to set IS_OWNER = Falsebut I've not tested that. (The benefit is you can still use the START expression to decide whether to run particular jobs, based on the attributes of the job, but you will never end up going into the 'owner' state)
Regards, Brian.