Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[HTCondor-users] Another jobs stuck in idle issue
- Date: Thu, 18 Jun 2015 12:43:24 +0100
- From: Roderick Johnstone <rmj@xxxxxxxxxxxxx>
- Subject: [HTCondor-users] Another jobs stuck in idle issue
Hi
I just updated my condor cluster from Fedora 20 to Fedora 22 (x86_64).
Condor is installed from the Fedora repos and went from 8.1.1 to 8.3.1.
I kept the same configuration files.
Now my jobs sit in the Idle state on the submit node.
SchedLog on the submit host has:
06/18/15 12:14:57 Matched 6.0 aaa@xxxxxxx <0.0.0.0:57923>
preempting none <0.0.0.0:47591> slot1@xxxxxxxxxxx
On zzz.xxx.xxx
06/18/15 12:16:58 slot1_1: Request to claim resource refused.
06/18/15 12:16:58 slot1_1: Claiming protocol failed
06/18/15 12:16:58 slot1_1: Changing state: Owner -> Delete
06/18/15 12:16:58 Trying to update collector <yyy.xxx.xxx:9618>
06/18/15 12:16:58 Attempting to send update via UDP to collector
yyy.xxx.xxx <0.0.0.0:9618>
06/18/15 12:16:58 slot1_1: Resource no longer needed, deleting
06/18/15 12:16:58 slot1: Total execute space: 11476772
Output from
$ condor_q -better-analyse is below.
I'd appreciate any thoughts on how to diagnose this further.
I have D_FULLDEBUG set for STARTD and STARTER on the execute node and
SCHEDD, COLLECTOR and NEGOTIATOR on the submit node so there is in
principle lots of info but I couldn't see anything obviously relevant,
although StartLog on the execute node does have entries like:
06/18/15 12:38:35 /proc format unknown for kernel version 4.0.4
Thanks
Roderick Johnstone
-- Submitter: yyy.xxx.xxx : <x.x.x.x:57923> : yyy.xxx.xxx
---
006.000: Request has not yet been considered by the matchmaker.
User priority for aaa@xxxxxxx is not available, attempting to analyze
without it.
---
006.000: Run analysis summary. Of 12 machines,
11 are rejected by your job's requirements
0 reject your job because of their own requirements
0 match and are already running your jobs
0 match but are serving other users
1 are available to run your job
The Requirements expression for your job is:
( Machine == "zzz.xxx.xxx" ) && ( TARGET.Arch == "X86_64" ) &&
( TARGET.OpSys == "LINUX" ) && ( TARGET.Disk >= RequestDisk ) &&
( TARGET.Memory >= RequestMemory ) && ( ( TARGET.HasFileTransfer ) ||
( TARGET.FileSystemDomain == MY.FileSystemDomain ) )
Your job defines the following attributes:
FileSystemDomain = "xxx.xxx"
DiskUsage = 1
RequestDisk = 1
RequestMemory = 10
The Requirements expression for your job reduces to these conditions:
Slots
Step Matched Condition
----- -------- ---------
[0] 1 Machine == "zzz.xxx.xxx"
[9] 12 TARGET.HasFileTransfer
Suggestions:
Condition Machines Matched Suggestion
--------- ---------------- ----------
1 ( Machine == "zzz.xxx.xxx" )0 REMOVE
2 ( TARGET.Arch == "X86_64" ) 12
3 ( TARGET.OpSys == "LINUX" ) 12
4 ( TARGET.Disk >= 1 ) 12
5 ( TARGET.Memory >= 10 ) 12
6 ( ( TARGET.HasFileTransfer ) || ( TARGET.FileSystemDomain ==
"xxx.xxx" ) )
12