[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] classAd mismatch




Dear all,

I have set up a Condor pool spanning nodes from two clusters;
each cluster has its own filesystem domain ('bmi.oar.net' and
'cse.oar.net'). The head-node of one of the clusters serves as the central
manager, dedicated scheduler and the submit node for the entire pool.

Now, I wish to execute certain jobs only on a specific cluster. I tried to
achieve this by having these jobs require a specific FileSystemDomain in
their classAd. However, this request is never matched even though
unclaimed candidate resources exist in the pool.

Specific e.g.: One of my jobs must be executed only on the cluster with
filesystem domain 'cse.oar.net'. (The submit node and this target cluster do not share a common filesystem).

The job's specific requirements are as follows:

[vijayskumar@bm-login ~]$ condor_q -long | grep Requirements
Requirements = (regexp("*.cse.oar.net", FileSystemDomain, "i")) && (Arch == "X86_64") && (OpSys == "LINUX") && (Disk >= DiskUsage) && ((Memory * 1024) >= ImageSize)

The requested resources for the job are available:-

[vijayskumar@bm-login ~]$ condor_status -const "regexp(\".cse.oar.net\", FileSystemDomain)"

Name               OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime

slot1@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000  2009  0+00:50:04
slot1@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000  2009  0+00:50:04
slot1@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000  2009  0+00:50:04
slot1@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000  2009  0+00:50:04

[vijayskumar@bm-login ~]$ condor_status -long | grep FileSys | grep cse
FileSystemDomain = "cs41.cse.oar.net"
FileSystemDomain = "cs42.cse.oar.net"
FileSystemDomain = "cs43.cse.oar.net"
FileSystemDomain = "cs44.cse.oar.net"

However, a match never transpires, and the requests by the job keep
getting rejected. (Here, 84544 is the jobID of the job that does not
complete).

[vijayskumar@bm-login ~] condor_q -better-analyze
------------------------------------------------------------------
84544.000:  Run analysis summary.  Of 20 machines,
     20 are rejected by your job's requirements
      0 reject your job because of their own requirements
      0 match but are serving users with a better priority in the pool
      0 match but reject the job for unknown reasons
      0 match but will not currently preempt their existing job
      0 are available to run your job

WARNING:  Be advised:
   No resources matched request's constraints

The Requirements expression for your job is:

( regexp("*.cse.oar.net", FileSystemDomain, "i") ) && ( target.Arch == "X86_64" ) &&
( target.OpSys == "LINUX" ) && ( target.Disk >= DiskUsage ) &&
( ( target.Memory * 1024 ) >= ImageSize )

Job ClassAd Requirements expression evaluates to false
----------------------------------------------------------------------

Why does a match not occur? Is there something wrong with the regular
expression in the job classAd? Any help is appreciated.

Thanks for your time,

-Vijay

PS: here is the complete classAd for the job that just refuses to get
executed:

MyType = "Job"
TargetType = "Machine"
ClusterId = 84544
QDate = 1227117554
CompletionDate = 0
Owner = "vijayskumar"
RemoteWallClockTime = 0.000000
LocalUserCpu = 0.000000
LocalSysCpu = 0.000000
RemoteUserCpu = 0.000000
RemoteSysCpu = 0.000000
ExitStatus = 0
NumCkpts_RAW = 0
NumCkpts = 0
NumJobStarts = 0
NumRestarts = 0
NumSystemHolds = 0
CommittedTime = 0
TotalSuspensions = 0
LastSuspensionTime = 0
CumulativeSuspensionTime = 0
ExitBySignal = FALSE
CondorVersion = "$CondorVersion: 7.0.1 Feb 26 2008 BuildID: 76180 $"
CondorPlatform = "$CondorPlatform: X86_64-LINUX_RHEL3 $"
RootDir = "/"
Iwd = "/home/vijayskumar/pegasusrun/vijayskumar/pegasus/Template_P10runC1/run0001"
JobUniverse = 5
TransferExecutable = FALSE
Cmd = "/home/vijayskumar/installed/pegasus/default/bin/kickstart"
MinHosts = 1
MaxHosts = 1
CurrentHosts = 0
WantRemoteSyscalls = FALSE
WantCheckpoint = FALSE
JobStatus = 1
EnteredCurrentStatus = 1227117554
JobPrio = 0
User = "vijayskumar@.oar.net"
NiceUser = FALSE
EnvDelim = ";"
JobNotification = 0
WantRemoteIO = TRUE
UserLog = "/tmp/Template_P10runC1-053166.log"
CoreSize = 0
KillSig = "SIGTERM"
Rank = 0.000000
In = "/dev/null"
TransferIn = FALSE
Out = "/home/vijayskumar/pegasusrun/vijayskumar/pegasus/Template_P10runC1/run0001/Template_P10runC1_0_cseri_cdir.out"
StreamOut = FALSE
Err = "/home/vijayskumar/pegasusrun/vijayskumar/pegasus/Template_P10runC1/run0001/Template_P10runC1_0_cseri_cdir.err"
StreamErr = FALSE
BufferSize = 524288
BufferBlockSize = 32768
ShouldTransferFiles = "NO"
TransferFiles = "NEVER"
ImageSize_RAW = 172
ImageSize = 175
ExecutableSize_RAW = 172
ExecutableSize = 175
DiskUsage_RAW = 172
DiskUsage = 175
Requirements = (regexp("*.cse.oar.net", FileSystemDomain, "i")) && (Arch == "X86_64") && (OpSys == "LINUX") && (Disk >= DiskU
sage) && ((Memory * 1024) >= ImageSize)
FileSystemDomain = ".bmi.oar.net"
JobLeaseDuration = 1200
PeriodicHold = FALSE
PeriodicRelease = (NumSystemHolds <= 3)
PeriodicRemove = (NumSystemHolds > 3)
OnExitHold = FALSE
OnExitRemove = TRUE
LeaveJobInQueue = FALSE
Arguments = "-n pegasus::dirmanager -N pegasus::dirmanager:1.0 -R cseri -w
/home/vijayskumar/pegasusrun/work /home/vijayskuma
r/installed/pegasus/default/bin/dirmanager --create --dir
/home/vijayskumar/pegasusrun/work/pegasusexec/vijayskumar/pegasus/T
emplate_P10runC1/run0001"
DAGNodeName = "Template_P10runC1_0_cseri_cdir"
pegasus_job_id = "Template_P10runC1_0_cseri_cdir"
pegasus_wf_xformation = "pegasus::dirmanager"
pegasus_site = "cseri"
pegasus_generator = "Pegasus"
pegasus_version = "2.2.0cvs"
DAGManJobId = 84542
pegasus_wf_time = "20081119T125815-0500"
pegasus_wf_name = "Template_P10runC1-0"
DAGParentNodeNames = ""
pegasus_job_class = 6
GlobalJobId = "bm-login.bmi.oar.net#1227117554#84544.0"
ProcId = 0
AutoClusterId = 0
AutoClusterAttrs =
"Scheduler,JobUniverse,LastCheckpointPlatform,NumCkpts,FileSystemDomain,DiskUsage,ImageSize,Requirements,N
iceUser"
WantMatchDiagnostics = TRUE
LastRejMatchReason = "no match found"
LastRejMatchTime = 1227118530
ServerTime = 1227118532