------------------------------------------------------------------
Steven C. Timm, Ph.D (630) 840-8525
timm@xxxxxxxx http://home.fnal.gov/~timm/
Fermilab Computing Division, Scientific Computing Facilities,
Grid Facilities Department, FermiGrid Services Group, Assistant Group
Leader.
On Tue, 2 Sep 2008, Espen Braastad wrote:
Hello,
I'm having the (apparently common) problem that jobs won't run on
certain cluster nodes. I've analyzed the requirements and I've found the
cause. The cause is that one of the job requirements is that Disk >=
DiskUsage.
condor_q -l <jobid> says:
Disk = 10000
Do you mean Requirements = (Disk=10000)
condor_status -l <nodeid> says:
Disk = 8128
TotalDisk = 32512
8128 >= 10000 is false, and the job is rejected.
Now to my question;
I am not able to find -where- condor gets the value 8128 from. On the
node itself, in the $HOME directory of <jobid> there is 16GB available.
I didn't find any explaination of the Disk property of the nodes in the
documentation either.
It's the amount of disk free in the EXECUTE directory divided by
the number of CPU's on the node.
It is possible to partition it differently if you want, to allow
one slot to have more disk, and another to have less, by the SLOTx_
macros.
This is a good way to do a START expression or a REQUIREMENTS
expression but it is not a way to kill a job if it is using too
much disk because it is not fast enough on the draw, the value
can take 45 minutes to get back to the PERIODIC_HOLD or
PERIODIC_REMOVE expressions.