Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Job requirements not satisfied even when Requirements = TRUE
- Date: Wed, 31 Aug 2011 20:58:18 -0700
- From: Mark Cafaro <cafarm@xxxxxx>
- Subject: Re: [Condor-users] Job requirements not satisfied even when Requirements = TRUE
And here is the job dump:
Out = "sh_loop.out"
LastJobStatus = 0
BufferBlockSize = 32768
JobNotification = 2
JobLeaseDuration = 1200
TransferFiles = "ONEXIT"
ImageSize_RAW = 1
FileSystemDomain = "IP_replaced.washington.edu"
StreamOut = false
NumRestarts = 0
ImageSize = 1
Cmd = "/sh_loop"
LeaveJobInQueue = false
PeriodicRemove = false
Iwd = "/"
PeriodicHold = false
CondorPlatform = "$CondorPlatform: x86_macos_10.4 $"
NumCkpts = 0
JobStatus = 1
ExitBySignal = false
EnteredCurrentStatus = 1314845476
ClusterId = 27
In = "/dev/null"
CondorVersion = "$CondorVersion: 7.6.3 Aug 17 2011 BuildID: 361356 $"
RemoteUserCpu = 0.0
WantRemoteSyscalls = false
MinHosts = 1
NumSystemHolds = 0
Environment = ""
JobUniverse = 5
PeriodicRelease = false
RequestDisk = DiskUsage
CumulativeSuspensionTime = 0
ExecutableSize = 1
Requirements = ( Arch == "X86_64" ) && ( TARGET.OpSys == "OSX" ) && ( TARGET.Disk >= DiskUsage ) && ( ( TARGET.Memory * 1024 ) >= ImageSize ) && ( ( RequestMemory * 1024 ) >= ImageSize ) && ( ( TARGET.HasFileTransfer ) || ( TARGET.FileSystemDomain == MY.FileSystemDomain ) )
RootDir = "/"
ShouldTransferFiles = "IF_NEEDED"
CommittedSlotTime = 0
GlobalJobId = "IP_replaced.washington.edu#27.0#1314845476"
LocalSysCpu = 0.0
DiskUsage = 1
WhenToTransferOutput = "ON_EXIT"
UserLog = "/sh_loop.log"
RequestMemory = ceiling(ifThenElse(JobVMMemory =!= undefined,JobVMMemory,ImageSize / 1024.000000))
NumCkpts_RAW = 0
ExecutableSize_RAW = 1
MaxHosts = 1
ServerTime = 1314849288
CoreSize = 0
WantCheckpoint = false
ProcId = 0
Err = "sh_loop.err"
CurrentHosts = 0
DiskUsage_RAW = 1
CommittedTime = 0
RemoteSysCpu = 0.0
OnExitRemove = true
AutoClusterAttrs = "JobUniverse,LastCheckpointPlatform,NumCkpts,FileSystemDomain,DiskUsage,ImageSize,RequestMemory,Requirements,NiceUser,ConcurrencyLimits"
TotalSuspensions = 0
RequestCpus = 1
LocalUserCpu = 0.0
StreamErr = false
NiceUser = false
AutoClusterId = 3
TargetType = "Machine"
QDate = 1314845476
CompletionDate = 0
OnExitHold = false
Rank = 0.0
JobPrio = 0
RemoteWallClockTime = 0.0
Args = "60"
NumJobStarts = 0
WantRemoteIO = true
CumulativeSlotTime = 0
CurrentTime = time()
User = "user@xxxxxxxxxxxxxxxxxxxxxxxxxx"
BufferSize = 524288
ExitStatus = 0
MyType = "Job"
CommittedSuspensionTime = 0
LastSuspensionTime = 0
Owner = "user"
TransferIn = false
On Aug 31, 2011, at 8:42 PM, Mark Cafaro wrote:
> Here is a status dump of one of the nodes. Start = TRUE in the local config.
>
> Machine = "IP_replaced.washington.edu"
> LastHeardFrom = 1314847744
> UpdateSequenceNumber = 15
> CpuIsBusy = false
> HasVM = false
> FileSystemDomain = "IP_replaced.washington.edu"
> Name = "slot2@xxxxxxxxxxxxxxxxxxxxxxxxxx"
> NumPids = 0
> MonitorSelfTime = 1314847619
> KeyboardIdle = 3142
> TimeToLive = 2147483647
> LastBenchmark = 1314843560
> TotalDisk = 299150288
> MaxJobRetirementTime = 0
> Unhibernate = MY.MachineLastMatchTime =!= undefined
> CondorPlatform = "$CondorPlatform: x86_macos_10.4 $"
> LastUpdate = 1314843560
> UpdatesTotal = 32
> Cpus = 1
> IsValidCheckpointPlatform = ( ( ( TARGET.JobUniverse == 1 ) == false ) || ( ( MY.CheckpointPlatform =!= undefined ) && ( ( TARGET.LastCheckpointPlatform =?= MY.CheckpointPlatform ) || ( TARGET.NumCkpts == 0 ) ) ) )
> MonitorSelfCPUUsage = 0.0
> ClockDay = 3
> StarterAbilityList = ""
> TotalTimeUnclaimedIdle = 4205
> CondorVersion = "$CondorVersion: 7.6.3 Aug 17 2011 BuildID: 361356 $"
> HasIOProxy = true
> MonitorSelfImageSize = 605636.000000
> HibernationSupportedStates = ""
> LastFetchWorkSpawned = 0
> Requirements = ( START ) && ( IsValidCheckpointPlatform )
> TotalMemory = 8192
> DaemonStartTime = 1314843539
> EnteredCurrentActivity = 1314843539
> MyAddress = "<IP_replaced:Port_replaced>"
> EnteredCurrentState = 1314843539
> CpuBusyTime = 0
> CpuBusy = ( ( LoadAvg - CondorLoadAvg ) >= 0.500000 )
> COLLECTOR_HOST_STRING = "host.washington.edu"
> Memory = 4096
> MyCurrentTime = 1314847744
> MonitorSelfRegisteredSocketCount = 1
> TotalCpus = 2
> ClockMin = 1228
> CurrentRank = 0.0
> NextFetchWorkDelay = -1
> AuthenticatedIdentity = "unauthenticated@unmapped"
> OpSys = "OSX"
> State = "Unclaimed"
> KFlops = 1369753
> UpdatesSequenced = 30
> UpdatesHistory = "0x00000000000000000000000000000000"
> Start = true
> MonitorSelfResidentSetSize = 3064
> Arch = "X86_64"
> Mips = 4232
> Activity = "Idle"
> ConsoleIdle = 3142
> LastFetchWorkCompleted = 0
> UpdatesLost = 0
> StartdIpAddr = "<IP_replaced:Port_replaced>"
> TargetType = "Job"
> TotalLoadAvg = 0.010000
> HibernationLevel = 0
> Rank = 0.0
> HibernationState = "NONE"
> MonitorSelfSecuritySessions = 3
> MonitorSelfAge = 4080
> LoadAvg = 0.0
> CheckpointPlatform = "OSX X86_64 10.8.0 normal N/A"
> CurrentTime = time()
> Disk = 149575144
> VirtualMemory = 799598
> TotalVirtualMemory = 1599196
> TotalSlots = 2
> UidDomain = "IP_replaced.washington.edu"
> SlotWeight = Cpus
> SlotID = 2
> MyType = "Machine"
> CanHibernate = false
> CondorLoadAvg = 0.0
> TotalCondorLoadAvg = 0.0
>
> On Aug 31, 2011, at 8:17 PM, Steven Timm wrote:
>
>>
>>
>> There must be something in the machine classad about the requirements
>> of what jobs it will start. Can you give a dump of condor_status -l
>> for one of the machines?
>>
>> Steve Timm
>>
>>
>> On Wed, 31 Aug 2011, Mark Cafaro wrote:
>>
>>> Hi Garrett,
>>
>> The job was successfully matched in the central manager's MatchLog (edited to remove ip and port):
>>
>> 08/31/11 20:05:16 Matched 27.0 user@...washington.edu <ip:port> preempting none <ip:port> slot1@...washington.edu
>>
>> On the node's StartLog is where I see it being rejected:
>>
>> 08/31/11 20:05:16 slot1: match_info called
>> 08/31/11 20:05:16 slot1: Received match <ip:port>#1314844613#28#...
>> 08/31/11 20:05:16 slot1: State change: match notification protocol successful
>> 08/31/11 20:05:16 slot1: Changing state: Unclaimed -> Matched
>> 08/31/11 20:05:16 slot1: Job requirements not satisfied.
>> 08/31/11 20:05:16 slot1: Request to claim resource refused.
>> 08/31/11 20:05:16 slot1: State change: claiming protocol failed
>> 08/31/11 20:05:16 slot1: Changing state: Matched -> Owner
>> 08/31/11 20:05:16 slot1: State change: IS_OWNER is false
>> 08/31/11 20:05:16 slot1: Changing state: Owner -> Unclaimed
>>
>>
>> condor_q -better-analyze returns:
>>
>> 027.000: Request has not yet been considered by the matchmaker.
>>
>> because it was successfully matched.
>>
>> Unfortunately I have been through all of the logs and there is no indication of a problem anywhere except for the line "Job requirements not satisfied."
>>
>>
>>
>> On Aug 31, 2011, at 7:45 PM, Koller, Garrett wrote:
>>
>>> Mr. Cafaro,
>>> I'm confused. I thought the problem was that the job kept being rejected with the error "Job requirements not satisfied." If that is so, how could it be matched in the MatchLog? Was it just considered in the MatchLog or was it actually assigned to a specific slot on a specific computer? If the MatchLog says it found a proper match and actually assigned it to that computer, check out http://servo.cs.wlu.edu/dokuwiki/doku.php/condor/submit/troubleshoot for a possible reason and solution to this problem.
>>> Also run 'condor_q -better-analyze' for a more in-depth look on why your job is being rejected. If the job is being rejected because of its requirements, this should tell you specifically which requirement is failing.
>>> Either way, let me know if this helps and what you find out.
>>> Best Regards,
>>> ~ Garrett Heath Koller
>>> kollerg14@xxxxxxxxxxxx
>>> Computer Science Major
>>> Member of the ÿÿÿÿÿÿ Fraternity
>>> Washington and Lee University
>>> Undergraduate Class of 2014
>>> P.O. Box 970
>>> Lexington, VA 24450
>>> Cell: (918) 246-6374
>>> On Aug 31, 2011, at 10:17 PM, Mark Cafaro wrote:
>>>> No luck there either. That should certainly evaluate to true.
>>>> I am just about out of ideas. The only thing I can gather from the logs is "Job requirements not satisfied." and condor_q -analyze says "Request has not yet been considered by the matchmaker." apparently because the match was made (I can see it in the MatchLog).
>>>> I am desperately hoping this is not a platform specific bug. We're on the often forgotten Macintosh.
>>>> On Aug 31, 2011, at 7:00 PM, Koller, Garrett wrote:
>>>>> Mr. Cafaro,
>>>>> Sure, that's easy. Just run 'condor_status -long | grep ^IsValidCheckpointPlatform' to see the expression that defines the value for "IsValidCheckpointPlatform". The expression depends a lot on the job being submitted. Because of this, note that in this expression "MY.*" refers to a variable in the machine's ClassAd (will be listed in 'condor_status -long') and "TARGET.*" refers to a variable in the job's ClassAd (will be listed in 'condor_q -long').
>>>>> Best Regards,
>>>>> ~ Garrett K.
>>>>> Washington and Lee University
>>>>> condor.cs.wlu.edu
>>>>> On Aug 31, 2011, at 9:51 PM, Mark Cafaro wrote:
>>>>>> Hi Garrett,
>>>>>> I have investigated this possibility and found it is likely not causing our problem. Requirements is appended, but I can overwrite the appended requirements with condor_qedit. In either case, I would not expect a match to be made if the manager wasn't able to match the requirements with the node. The manager matchs, but the node refuses.
>>>>>> I am wondering if this doesn't have to do with the fact that the node has:
>>>>>> Requirements = ( START ) && ( IsValidCheckpointPlatform )
>>>>>> I can't be sure that isValidCheckpointPlatform evaluates to true on my platform. Is there any way to determine
>>>>>> this?
>>>>>> On Aug 31, 2011, at 6:37 PM, Koller, Garrett wrote:
>>>>>>> Mr. Cafaro,
>>>>>>> The job's requirements expression is probably being appended to after it is submitted. Usually, the requirements in the submission file are logically and-ed (&&) with an expression that says what the job needs from its execution machine in terms of file transfer. When the job is in the queue, run something like 'condor_q -long <Job_Cluster_ID> | grep -i ^Requirements', where <Job_Cluster_ID> is the ID for the job you just submitted. There you will see the Requirement expression in its entirety. Most likely, you are asking Condor to do a file transfer mechanism that isn't supported by your environment. See Section 2.5.4, "Submitting Jobs Without a Shared File System: Condorÿÿs File Transfer Mechanism," in the Condor manual (7.6.1 for me) for more information and note when it talks about "FileSystemDomain" and the like as this is one of the things appended to the job's Requirements expression depending on the type of file transfer desired.
>>>>>>> Best Regards,
>>>>>>> ~ Garrett K.
>>>>>>> Washington and Lee University
>>>>>>> condor.cs.wlu.edu
>>>>>>> On Aug 31, 2011, at 9:18 PM, Mark Cafaro wrote:
>>>>>>>> I am submitting sh_loop.cmd (from the condor examples) to my manager. It matches with a node and sends the job off. The node, however, refuses to accept the job claiming "Job requirements not satisfied.". The job is set with Requirements = TRUE. How can requirements not be satisfied and how can a match be made if the requirements were not satisfied?
>>>>>>>> _______________________________________________
>>>>>>>> Condor-users mailing list
>>>>>>>> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
>>>>>>>> subject: Unsubscribe
>>>>>>>> You can also unsubscribe by visiting
>>>>>>>> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>>>>>>>> The archives can be found at:
>>>>>>>> https://lists.cs.wisc.edu/archive/condor-users/
>>>>>>> _______________________________________________
>>>>>>> Condor-users mailing list
>>>>>>> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
>>>>>>> subject: Unsubscribe
>>>>>>> You can also unsubscribe by visiting
>>>>>>> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>>>>>>> The archives can be found at:
>>>>>>> https://lists.cs.wisc.edu/archive/condor-users/
>>>>>> _______________________________________________
>>>>>> Condor-users mailing list
>>>>>> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
>>>>>> subject: Unsubscribe
>>>>>> You can also unsubscribe by visiting
>>>>>> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>>>>>> The archives can be found at:
>>>>>> https://lists.cs.wisc.edu/archive/condor-users/
>>>>> _______________________________________________
>>>>> Condor-users mailing list
>>>>> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
>>>>> subject: Unsubscribe
>>>>> You can also unsubscribe by visiting
>>>>> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>>>>> The archives can be found at:
>>>>> https://lists.cs.wisc.edu/archive/condor-users/
>>>> _______________________________________________
>>>> Condor-users mailing list
>>>> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
>>>> subject: Unsubscribe
>>>> You can also unsubscribe by visiting
>>>> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>>>> The archives can be found at:
>>>> https://lists.cs.wisc.edu/archive/condor-users/
>>> _______________________________________________
>>> Condor-users mailing list
>>> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
>>> subject: Unsubscribe
>>> You can also unsubscribe by visiting
>>> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>>> The archives can be found at:
>>> https://lists.cs.wisc.edu/archive/condor-users/
>>
>> _______________________________________________
>> Condor-users mailing list
>> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>>
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/condor-users/
>>
>> --
>> ------------------------------------------------------------------
>> Steven C. Timm, Ph.D (630) 840-8525
>> timm@xxxxxxxx http://home.fnal.gov/~timm/
>> Fermilab Computing Division, Scientific Computing Facilities,
>> Grid Facilities Department, FermiGrid Services Group, Group Leader.
>> Lead of FermiCloud project._______________________________________________
>> Condor-users mailing list
>> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>>
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/condor-users/
>
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/