Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] condor 6.8.2 + RHEL 4 - jobs stay idle, never run
- Date: Fri, 17 Nov 2006 10:07:00 -0800
- From: Lee Damon <nomad@xxxxxxxxxxxxxxxxxxxxxxxxx>
- Subject: [Condor-users] condor 6.8.2 + RHEL 4 - jobs stay idle, never run
I've got a working 6.6.10 pool but there doesn't seem to be a 6.6.x
release for RHEL 4-amd64 so I'm trying to get 6.8.2 working on those
hosts.
I'm thinking maybe my problem is caused by a new host Requirement for
Checkpoint stuff:
6.6.10:
Requirements = START
6.8.2:
Requirements = (START) && (IsValidCheckpointPlatform)
We're using the vanilla universe here - or trying to, anyway. I have
it set as the default in condor_config and the job control file says
to use it as well. I have, however, not found the magic bullet that
makes whatever is adding "&& (IsValidCheckpointPlatform)" to the host's
requirements. It isn't in condor_config or condor_config.local.
When I look in the logs for what is going on I see the following on
the submit host:
11/17 09:55:02 (pid:15196) Activity on stashed negotiator socket
11/17 09:55:02 (pid:15196) Negotiating for owner: nomad@ee.(domain obscured)
11/17 09:55:02 (pid:15196) Checking consistency running and runnable jobs
11/17 09:55:02 (pid:15196) Tables are consistent
11/17 09:55:02 (pid:15196) Out of jobs - 1 jobs matched, 0 jobs idle, flock level = 0
11/17 09:55:02 (pid:15196) Sent ad to central manager for nomad@ee.(domain obscured)
11/17 09:55:02 (pid:15196) Sent ad to 1 collectors for nomad@ee.(domain obscured)
11/17 09:55:02 (pid:15196) Sent RELEASE_CLAIM to startd on <128.208.233.100:41261>
11/17 09:55:02 (pid:15196) Match record (<128.208.233.100:41261>, 2, 0) deleted
On the host it is negotiating with I see:
11/17 09:55:02 DaemonCore: Command received via UDP from condor from host <128.2
08.232.24:33683>
11/17 09:55:02 DaemonCore: received command 440 (MATCH_INFO), calling handler (c
ommand_match_info)
11/17 09:55:02 vm1: match_info called
11/17 09:55:02 vm1: Received match <128.208.233.100:41261>#1163786034#5
11/17 09:55:02 vm1: State change: match notification protocol successful
11/17 09:55:02 vm1: Changing state: Unclaimed -> Matched
11/17 09:55:02 DaemonCore: Command received via TCP from condor from host <128.2
08.232.90:37122>
11/17 09:55:02 DaemonCore: received command 442 (REQUEST_CLAIM), calling handler
(command_request_claim)
11/17 09:55:02 vm1: Request to claim resource refused.
11/17 09:55:02 vm1: Job requirements not satisfied.
11/17 09:55:02 vm1: State change: claiming protocol failed
11/17 09:55:02 vm1: Changing state: Matched -> Owner
11/17 09:55:02 vm1: State change: IS_OWNER is false
11/17 09:55:02 vm1: Changing state: Owner -> Unclaimed
11/17 09:55:02 DaemonCore: Command received via UDP from condor from host <128.2
08.232.90:35833>
11/17 09:55:02 DaemonCore: received command 443 (RELEASE_CLAIM), calling handler
(command_release_claim)
11/17 09:55:02 Warning: can't find resource with ClaimId (<128.208.233.100:41261
>#1163786034#5)
I turned on D_ALL debugging levels and still don't see what is causing
the rejection. It just says it is rejecting the job.
condor_q -analyze says:
-- Submitter: stefen.ee.washington.edu : <128.208.232.90:37109> : stefen.ee.washington.edu
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
---
002.000: Run analysis summary. Of 385 machines,
385 are rejected by your job's requirements
0 reject your job because of their own requirements
0 match but are serving users with a better priority in the pool
0 match but reject the job for unknown reasons
0 match but will not currently preempt their existing job
0 are available to run your job
Last successful match: Fri Nov 17 09:56:18 2006
WARNING: Be advised:
No resources matched request's constraints
Check the Requirements expression below:
Requirements = ((MY.RESOURCE_GROUP == TARGET.JOB_GROUP)) && (Arch == "X86_64") && (OpSys == "LINUX") && (Disk >= DiskUsage) && ((Memory * 1024) >= ImageSize) && (TARGET.FileSystemDomain == MY.FileSystemDomain)
1 jobs; 1 idle, 0 running, 0 held
When I check the Requirements listed here they all match. I can't find
anything that doesn't match.
I've run this with our production condor_master (6.6.10) as well as
trying with a 6.8.2 master.
Can anyone offer any advice|guidance? Please?
nomad
Sr. System Admin, UWEE SSLI Lab