[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] Jobs stay Idle ... been looking for 24 hours....
- Date: Thu, 26 Apr 2007 14:46:33 -0400
- From: "Askar Zaidi" <askar.zaidi@xxxxxxxxx>
- Subject: [Condor-users] Jobs stay Idle ... been looking for 24 hours....
Hi,
My jobs stay idle forever...
here are the stats:
1) condor_status:
Name OpSys Arch State Activity LoadAv Mem ActvtyTime
vm1@xxxxxxxxx
LINUX INTEL Owner Idle 0.000 378 0+00:10:09
vm2@xxxxxxxxx LINUX INTEL Owner Idle 0.000 378 0+00:10:10
vm3@xxxxxxxxx LINUX INTEL Owner Idle 0.000 378 0+00:10:11
vm4@xxxxxxxxx LINUX INTEL Owner Idle 0.000 378 0+00:10:12
comparch.bing LINUX INTEL Owner Idle
0.000 241 0+00:10:04
vm1@xxxxxxxxx LINUX INTEL Owner Idle 0.030 504 0+00:10:09
vm2@xxxxxxxxx LINUX INTEL Owner Idle
0.000 504 0+00:10:10
vm3@xxxxxxxxx LINUX INTEL Owner Idle 0.000 504 0+00:10:11
vm1@xxxxxxxxx LINUX X86_64 Owner Idle
0.890 250 0+00:35:10
vm2@xxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 250 0+00:00:05
vm3@xxxxxxxxx LINUX X86_64 Unclaimed Idle
0.000 250 0+00:00:06
vm4@xxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 250 0+00:00:07
vm1@clouseau. LINUX X86_64 Unclaimed Idle 0.000 250 0+00:10:04
vm2@clouseau. LINUX X86_64 Unclaimed Idle 0.000 250 0+00:10:05
vm3@clouseau. LINUX X86_64 Unclaimed Idle 0.000 250 0+00:10:06
vm4@clouseau. LINUX X86_64 Unclaimed Idle
0.000 250 0+00:10:07
vm1@dogmatix. LINUX X86_64 Owner Idle 0.110 501 0+00:10:10
vm2@dogmatix. LINUX X86_64 Owner Idle 0.000 501 0+00:10:11
vm3@dogmatix. LINUX X86_64 Owner Idle
0.000 501 0+00:10:12
vm4@dogmatix. LINUX X86_64 Owner Idle 0.000 501 0+00:10:13
vm1@xxxxxxxxx LINUX X86_64 Owner Idle 0.000 250 0+00:10:10
vm2@xxxxxxxxx LINUX X86_64 Owner Idle 0.000 250 0+00:10:11
vm3@xxxxxxxxx LINUX X86_64 Owner Idle 0.000 250 0+00:10:12
vm4@xxxxxxxxx LINUX X86_64 Owner Idle 0.000 250 0+00:10:13
Total Owner Claimed Unclaimed Matched Preempting Backfill
INTEL/LINUX 8 8 0 0 0 0 0
X86_64/LINUX 16 9 0 7 0 0 0
Total 24 17 0 7 0 0 0
NOTE: no problem here...all machines are recognized by central manager..
2) condor_q - analyze 2.0
-- Submitter: comparch.binghamton.edu : <128.226.128.31:39183> :
comparch.binghamton.edu
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
---
002.000: Run analysis summary. Of 24 machines,
16 are rejected by your job's requirements
8 reject your job because of their own requirements
0 match but are serving users with a better priority in the pool
0 match but reject the job for unknown reasons
0 match but will not currently preempt their existing job
0 are available to run your job
No successful match recorded.
Last failed match: Thu Apr 26 14:32:52 2007
Reason for last match failure: no match found
---------------------------------------------------------------------------------------------------------------------------------------------------------
NOTE: 8 reject your job because of their own requirements
3) condor_q -better 2.0
-- Submitter: comparch.binghamton.edu : <
128.226.128.31:39183> : comparch.binghamton.edu
---
002.000: Run analysis summary. Of 24 machines,
16 are rejected by your job's requirements
8 reject your job because of their own requirements
0 match but are serving users with a better priority in the pool
0 match but reject the job for unknown reasons
0 match but will not currently preempt their existing job
0 are available to run your job
No successful match recorded.
Last failed match: Thu Apr 26 14:32:52 2007
Reason for last match failure: no match found
The Requirements _expression_ for your job is:
( target.Arch
== "INTEL" ) && ( target.OpSys == "LINUX" ) &&
( ( target.CkptArch == target.Arch ) || ( target.CkptArch is undefined ) ) &&
( ( target.CkptOpSys == target.OpSys ) || ( target.CkptOpSys
is undefined ) ) &&
( target.Disk >= DiskUsage ) && ( ( target.Memory * 1024 ) >= ImageSize )
Condition Machines Matched Suggestion
--------- ---------------- ----------
1 ( target.Arch == "INTEL" ) 8
2 ( target.OpSys == "LINUX" ) 24
3 ( ( target.CkptArch == target.Arch ) || ( target.CkptArch is undefined ) )
24
4 ( ( target.CkptOpSys == target.OpSys ) || ( target.CkptOpSys is undefined ) )
24
5 ( target.Disk >= 10000 ) 24
6 ( ( 1024 * target.Memory ) >= 10000 )24
----------------------------------------------------------------------------------------------------------------------------------
4) SchedLog on central manager:
4/26 14:24:50 (pid:1888) Sent ad to central manager for
condor@xxxxxxxxxxxxxxxxxxxxxxx
4/26 14:24:50 (pid:1888) Sent ad to 1 collectors for condor@xxxxxxxxxxxxxxxxxxxxxxx
4/26 14:24:50 (pid:1888) Called reschedule_negotiator()
4/26 14:24:50 (pid:1888) DaemonCore: Command received via TCP from host <128.226.128.31:42297>
4/26 14:24:50 (pid:1888) DaemonCore: received command 493 (NEGOTIATE_WITH_SIGATTRS), calling handler (doNegotiate)
4/26 14:24:50 (pid:1888) Negotiating for owner: condor@xxxxxxxxxxxxxxxxxxxxxxx
4/26 14:24:50 (pid:1888) AutoCluster:config() significant atttributes changed to JobUniverse,LastCheckpointPlatform,NumCkpts
4/26 14:24:50 (pid:1888) Checking consistency running and runnable jobs
4/26 14:24:50 (pid:1888) Tables are consistent
4/26 14:24:50 (pid:1888) Out of servers - 0 jobs matched, 1 jobs idle, 1 jobs rejected
4/26 14:29:50 (pid:1888) Sent ad to central manager for
condor@xxxxxxxxxxxxxxxxxxxxxxx
4/26 14:29:50 (pid:1888) Sent ad to 1 collectors for condor@xxxxxxxxxxxxxxxxxxxxxxx
4/26 14:29:50 (pid:1888) Activity on stashed negotiator socket
4/26 14:29:50 (pid:1888) Negotiating for owner: condor@xxxxxxxxxxxxxxxxxxxxxxx
4/26 14:29:50 (pid:1888) Checking consistency running and runnable jobs
4/26 14:29:50 (pid:1888) Tables are consistent
4/26 14:29:50 (pid:1888) Out of servers - 0 jobs matched, 1 jobs idle, 1 jobs rejected
4/26 14:32:48 (pid:1888) DaemonCore: Command received via TCP from host <
128.226.128.31:54711>
4/26 14:32:48 (pid:1888) DaemonCore: received command 478 (ACT_ON_JOBS), calling handler (actOnJobs)
4/26 14:32:52 (pid:1888) DaemonCore: Command received via UDP from host <
128.226.128.31:35612>
4/26 14:32:52 (pid:1888) DaemonCore: received command 421 (RESCHEDULE), calling handler (reschedule_negotiator)
4/26 14:32:52 (pid:1888) Sent ad to central manager for
condor@xxxxxxxxxxxxxxxxxxxxxxx
4/26 14:32:52 (pid:1888) Sent ad to 1 collectors for condor@xxxxxxxxxxxxxxxxxxxxxxx
4/26 14:32:52 (pid:1888) Called reschedule_negotiator()
4/26 14:32:52 (pid:1888) Activity on stashed negotiator socket
4/26 14:32:52 (pid:1888) Negotiating for owner: condor@xxxxxxxxxxxxxxxxxxxxxxx
4/26 14:32:52 (pid:1888) Checking consistency running and runnable jobs
4/26 14:32:52 (pid:1888) Tables are consistent
4/26 14:32:52 (pid:1888) Out of servers - 0 jobs matched, 1 jobs idle, 1 jobs rejected
5) StartLog on central manager:
4/26 13:57:49 ******************************************************
4/26 13:57:49 ** condor_startd (CONDOR_STARTD) STARTING UP
4/26 13:57:49 ** /home/condor/condor/sbin/condor_startd
4/26 13:57:49 ** $CondorVersion: 6.8.4 Feb 1 2007 $
4/26 13:57:49 ** $CondorPlatform: I386-LINUX_RHEL3 $
4/26 13:57:49 ** PID = 1887
4/26 13:57:49 ** Log last touched 4/26 13:57:43
4/26 13:57:49 ******************************************************
4/26 13:57:49 Using config source: /home/condor/condor/etc/condor_config
4/26 13:57:49 Using local config sources:
4/26 13:57:49 /home/condor/hosts/comparch/condor_config.local
4/26 13:57:49 DaemonCore: Command Socket at <128.226.128.31:34245
>
4/26 13:57:56 New machine resource allocated
4/26 13:57:56 About to run initial benchmarks.
4/26 13:58:00 Completed initial benchmarks.
4/26 14:13:00 State change: IS_OWNER is false
4/26 14:13:00 Changing state: Owner -> Unclaimed
4/26 14:23:00 State change: IS_OWNER is TRUE
4/26 14:23:00 Changing state: Unclaimed -> Owner
6) condor_q -l 2.0
-- Submitter: comparch.binghamton.edu : <
128.226.128.31:39183> : comparch.binghamton.edu
MyType = "Job"
TargetType = "Machine"
ClusterId = 2
QDate = 1177612372
CompletionDate = 0
Owner = "condor"
RemoteWallClockTime = 0.000000
LocalUserCpu = 0.000000
LocalSysCpu = 0.000000
RemoteUserCpu = 0.000000
RemoteSysCpu = 0.000000
ExitStatus = 0
NumCkpts_RAW = 0
NumCkpts = 0
NumRestarts = 0
NumSystemHolds = 0
CommittedTime = 0
TotalSuspensions = 0
LastSuspensionTime = 0
CumulativeSuspensionTime = 0
ExitBySignal = FALSE
CondorVersion = "$CondorVersion:
6.8.4 Feb 1 2007 $"
CondorPlatform = "$CondorPlatform: I386-LINUX_RHEL3 $"
RootDir = "/"
Iwd = "/home/condor"
JobUniverse = 1
Cmd = "/home/condor/leftouts"
MinHosts = 1
MaxHosts = 1
CurrentHosts = 0
WantRemoteSyscalls = TRUE
WantCheckpoint = TRUE
JobStatus = 1
EnteredCurrentStatus = 1177612372
JobPrio = 0
User = "
condor@xxxxxxxxxxxxxxxxxxxxxxx"
NiceUser = FALSE
MaxJobRetirementTime = 0
Environment = ""
JobNotification = 2
WantRemoteIO = TRUE
UserLog = "/home/condor/leftouts.log"
CoreSize = 0
KillSig = "SIGTSTP"
Rank = 0.000000
In = "/dev/null"
TransferIn = FALSE
Out = "leftouts.out"
StreamOut = FALSE
Err = "/dev/null"
TransferErr = FALSE
BufferSize = 524288
BufferBlockSize = 32768
ShouldTransferFiles = "NO"
TransferFiles = "NEVER"
ImageSize_RAW = 4
ImageSize = 10000
ExecutableSize_RAW = 4
ExecutableSize = 10000
DiskUsage_RAW = 4
DiskUsage = 10000
Requirements = (Arch == "INTEL") && (OpSys == "LINUX") && ((CkptArch == Arch) || (CkptArch =?= UNDEFINED)) && ((CkptOpSys == OpSys) || (CkptOpSys =?= UNDEFINED)) && (Disk >= DiskUsage) && ((Memory * 1024) >= ImageSize)
FileSystemDomain = "comparch.binghamton.edu"
PeriodicHold = FALSE
PeriodicRelease = FALSE
PeriodicRemove = FALSE
>
LeaveJobInQueue = FALSE
Arguments = ""
GlobalJobId = "comparch.binghamton.edu#1177612372#2.0"
ProcId = 0
AutoClusterId = 1
AutoClusterAttrs = "JobUniverse,LastCheckpointPlatform,NumCkpts,DiskUsage,ImageSize,Requirements,NiceUser"
WantMatchDiagnostics = TRUE
LastRejMatchReason = "no match found"
LastRejMatchTime = 1177612672
ServerTime = 1177612764
---------------------------------------------------------------
I think these are all the stats needed to debug ..
I haven't specified any Requirements in the Job submit file.
I don't have any PERMISSION_DENIED errors either...
My condor_config file is correct...its all set...
I have been tryin to debug this for 24 hours now...
Any help would be appreciated ...
thanks,
Askar