Ever since we installed condor 7.6 (upgraded from 7.4) the new negotiation
algorithm when you have groups enabled has been biting us. I have this ticket
open which I'm hoping to get a response to:
http://www.cs.wisc.edu/condor/fermi-tickets/22715.html
that deals only with not running the jobs we expect to run but at least the
slots are staying full.
Now I see we aren't running our monitoring jobs which use the glideinwms
monitoring slot. Here is the monitoring job which is targeted at the monitoring
slot for a specific job (classads below):
Output from condor_q
1543992.0 willis 11/9 11:00 0+00:00:00 I 0 0.0 mon.sh
From the negotiator log with full_debug
11/09/11 11:03:00 ---------- Started Negotiation Cycle ----------
11/09/11 11:03:00 Phase 1: Obtaining ads from collector ...
11/09/11 11:03:00 Getting all public ads ...
11/09/11 11:03:00 Trying to query collector <131.225.240.215:9618>
11/09/11 11:03:08 Sorting 8584 ads ...
<snip>
11/09/11 11:03:08 Ignoring submitter willis@xxxxxxxx with no requested jobs
The classad of the job
[cdfcaf@fcdfhead10 /export/condor_local/log] condor_q -name
schedd_3@xxxxxxxxxxxxxxxxxxx -l 1543992.0
-- Schedd: schedd_3@xxxxxxxxxxxxxxxxxxx : <131.225.240.215:50394>
PeriodicRemove = ( CurrentTime > 1320858524 )
CommittedSlotTime = 0
Out = "_condor_stdout"
ImageSize_RAW = 1
NumCkpts_RAW = 0
AutoClusterAttrs =
"CAFGroup,CAFAcctGroup,CAF_DEFAULT_START,GLIDEIN_Is_Monitor,CAFDH"
EnteredCurrentStatus = 1320858014
CommittedSuspensionTime = 0
WhenToTransferOutput = "ON_EXIT"
NumSystemHolds = 0
StreamOut = false
NumRestarts = 0
ImageSize = 1
Cmd = "/tmp/glidein_intmon_HzIdSU/mon.sh"
x509UserProxyVOName = "cdf"
CurrentHosts = 0
Iwd = "/tmp/glidein_intmon_HzIdSU"
CumulativeSlotTime = 0
ExecutableSize_RAW = 1
CondorVersion = "$CondorVersion: 7.6.2 Jul 14 2011 BuildID: 351672 $"
RemoteUserCpu = 0.0
NumCkpts = 0
JobStatus = 1
Arguments = ""
RemoteSysCpu = 0.0
OnExitRemove = true
BufferBlockSize = 32768
ClusterId = 1543992
In = "/dev/null"
LocalUserCpu = 0.0
x509UserProxyFQAN =
"/DC=gov/DC=fnal/O=Fermilab/OU=Robots/CN=glidecaf/CN=cdf/CN=Willis K.
Sakumoto/CN=UID:willis,/cdf/Role=NULL/Capability=NULL"
MinHosts = 1
Environment = ""
JobUniverse = 5
RequestDisk = DiskUsage
RootDir = "/"
NumJobStarts = 0
WantRemoteIO = true
RequestMemory = ceiling(ifThenElse(JobVMMemory =!=
undefined,JobVMMemory,ImageSize / 1024.000000))
GlobalJobId = "schedd_3@xxxxxxxxxxxxxxxxxxx#1543992.0#1320858014"
x509UserProxyFirstFQAN = "/cdf/Role=NULL/Capability=NULL"
LocalSysCpu = 0.0
PeriodicRelease = false
DiskUsage = 1
CumulativeSuspensionTime = 0
JobLeaseDuration = 1200
UserLog = "/tmp/glidein_intmon_HzIdSU/mon.log"
GLIDEIN_Is_Monitor = true
ExecutableSize = 1
MaxHosts = 1
ServerTime = 1320858260
CoreSize = 0
DiskUsage_RAW = 1
ProcId = 0
TransferFiles = "ONEXIT"
ShouldTransferFiles = "YES"
CommittedTime = 0
TotalSuspensions = 0
Err = "_condor_stderr"
x509userproxysubject =
"/DC=gov/DC=fnal/O=Fermilab/OU=Robots/CN=glidecaf/CN=cdf/CN=Willis K.
Sakumoto/CN=UID:willis"
AutoClusterId = 496
RequestCpus = 1
StreamErr = false
x509UserProxyExpiration = 1321256898
NiceUser = false
RemoteWallClockTime = 0.0
TargetType = "Machine"
TransferOutputRemaps =
"_condor_stdout=/tmp/glidein_intmon_HzIdSU/mon.out;_condor_stderr=/tmp/glidein_intmon_HzIdSU/mon.err"
PeriodicHold = false
QDate = 1320858014
OnExitHold = false
Rank = 0.0
ExitBySignal = false
CondorPlatform = "$CondorPlatform: x86_64_rhap_5 $"
JobPrio = 0
LastSuspensionTime = 0
CurrentTime = time()
User = "willis@xxxxxxxx"
x509userproxy = "/export/CafCondor/tickets/x509cc_willis"
JobNotification = 0
BufferSize = 524288
WantRemoteSyscalls = false
LeaveJobInQueue = false
ExitStatus = 0
CompletionDate = 0
MyType = "Job"
Requirements = ( ( Name =?= "monitor_30769@xxxxxxxxxxxxxxxxxxxx" ) && ( Arch =!=
"Absurd" ) ) && ( ( Memory >= 1 ) ) && ( TARGET.OpSys == "LINUX" ) && (
TARGET.Disk >= DiskUsage ) && ( ( RequestMemory * 1024 ) >= ImageSize ) && (
TARGET.HasFileTransfer )
WantCheckpoint = false
Owner = "willis"
LastJobStatus = 0
TransferIn = false
The slot it wants is there
[cdfcaf@fcdfhead10 /export/condor_local/log] condor_status -constraint 'name ==
"monitor_30769@xxxxxxxxxxxxxxxxxxxx"'
Name OpSys Arch State Activity LoadAv Mem ActvtyTime
monitor_30769@fcdf LINUX X86_64 Owner Idle 5.870 393 0+23:01:13
Total Owner Claimed Unclaimed Matched Preempting Backfill
X86_64/LINUX 1 1 0 0 0 0 0
Total 1 1 0 0 0 0 0
The slot is free and not usable by anything else but this job won't run in the 8
minutes allowed. It used to run on the next negotiation cycle because there is a
slot sitting there free for it. Why does it say "with no requested jobs" for the
user "willis" when there is one in the queue?
I believe it has to do with the way that now all the slots are parcelled out to
groups (even though not all jobs are in groups because jobs not in groups
getting added to a <none> group) and we have this set:
GROUP_ACCEPT_SURPLUS = True
I'll keep digging but I'm hoping someone has advice.
Thanks,
joe