Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] condor will not use preferred host at all
- Date: Mon, 19 Feb 2018 18:33:01 -0500
- From: Larry Martell <larry.martell@xxxxxxxxx>
- Subject: Re: [HTCondor-users] condor will not use preferred host at all
Here is the output of condor_q -better-analyze.
-- Schedd: bach.elucid.local : <192.168.10.2:20734>
The Requirements expression for job 21283.000 is
( TARGET.Arch == "X86_64" ) && ( TARGET.OpSys == "LINUX" ) && (
TARGET.Disk >= RequestDisk ) && ( TARGET.Memory >= RequestMemory ) &&
( TARGET.HasFileTransfer )
Job 21283.000 defines the following attributes:
DiskUsage = 0
ImageSize = 0
RequestDisk = DiskUsage
RequestMemory = ifthenelse(MemoryUsage =!= undefined,MemoryUsage,(
ImageSize + 1023 ) / 1024)
The Requirements expression for job 21283.000 reduces to these conditions:
Slots
Step Matched Condition
----- -------- ---------
[0] 132 TARGET.Arch == "X86_64"
[1] 132 TARGET.OpSys == "LINUX"
[3] 132 TARGET.Disk >= RequestDisk
[5] 132 TARGET.Memory >= RequestMemory
[7] 132 TARGET.HasFileTransfer
No successful match recorded.
Last failed match: Mon Feb 19 18:30:13 2018
Reason for last match failure: no match found
21283.000: Run analysis summary ignoring user priority. Of 132 machines,
0 are rejected by your job's requirements
0 reject your job because of their own requirements
0 match and are already running your jobs
0 match but are serving other users
0 are available to run your job
On Mon, Feb 19, 2018 at 6:09 PM, Larry Martell <larry.martell@xxxxxxxxx> wrote:
> If I grep in all the logs for one of the ids for a job in the queue I
> see this in the MatchLog:
>
> Matched 21283.0 prod_user@xxxxxxxxxxxxxxxxx
> <192.168.10.2:9618?addrs=192.168.10.2-9618+[--1]-9618&noUDP&sock=522229_3c3e_4>
> preempting none
> <192.168.11.1:9618?addrs=192.168.11.1-9618+[--1]-9618&noUDP&sock=14430_5bb5_3>
> slot1@chopin
>
> This message repeats 132 times (once for each slot on chopin) and then
> I see this:
>
> Rejected 21283.0 prod_user@xxxxxxxxxxxxxxxxx
> <192.168.10.2:9618?addrs=192.168.10.2-9618+[--1]-9618&noUDP&sock=522229_3c3e_4>:
> no match found
>
> That sequence of 133 message repeats over and over.
>
> Is this a clue to anything?
>
>
> On Mon, Feb 19, 2018 at 5:55 PM, Larry Martell <larry.martell@xxxxxxxxx> wrote:
>> If I run ps -efal | grep condor on the 2 execute hosts the only
>> difference is that on chopin (the one I cannot get condor to use) it
>> has this:
>>
>> condor_shared_port
>>
>> That is not on liszt. Is that an issue?
>>
>> On Mon, Feb 19, 2018 at 5:31 PM, Larry Martell <larry.martell@xxxxxxxxx> wrote:
>>> On Mon, Feb 19, 2018 at 5:22 PM, Todd Tannenbaum <tannenba@xxxxxxxxxxx> wrote:
>>>> On 2/19/2018 4:10 PM, Larry Martell wrote:
>>>>>
>>>>> As a test I removed liszt from the config and it still will not use
>>>>> chopin. Even thought a status shows all slots as 'Unclaimed Idle'
>>>>>
>>>>
>>>> What do you mean when you say "removed liszt from the config" ?
>>>
>>> I set NUM_SLOTS to 0
>>>
>>>> Do you mean you removed it from your HTCondor pool (i.e. did condor_off on
>>>> liszt) ?
>>>>
>>>> So now when you do a "condor_status", the only thing you see is slots on
>>>> chopin?
>>>
>>> Correct.
>>>
>>>> And yet your job remains idle and refuses to run on chopin?
>>>
>>> Correct.
>>>
>>>> Maybe your job does not match any slots on chopin
>>>
>>> What makes a match? Before the reboot the same jobs I am trying to run
>>> today were running on chopin and when that was full, ran on liszt.
>>>
>>>> -- perhaps because chopin
>>>> has less memory than liszt, for instance.
>>>
>>> The 2 machines are identical in every way - CPU, memory, etc. The
>>> reason they were rebooted was that a 10G point to point connection was
>>> installed between them.
>>>
>>>> See the manual section "Why is
>>>> the job not running?" at http://tinyurl.com/ycbut82r
>>>
>>> That tiny url does not work, but I've been looking at this page:
>>>
>>> http://research.cs.wisc.edu/htcondor/CondorWeek2004/presentations/adesmet_admin_tutorial/#DebuggingJobs
>>>
>>> and here is the output of condor_q -analyze:verbose when run on the master:
>>>
>>> -- Schedd: bach.elucid.local : <192.168.10.2:20734>
>>> No successful match recorded.
>>> Last failed match: Mon Feb 19 17:19:52 2018
>>>
>>> Reason for last match failure: no match found
>>>
>>> 21283.000: Run analysis summary ignoring user priority. Of 132 machines,
>>> 0 are rejected by your job's requirements
>>> 0 reject your job because of their own requirements
>>> 0 match and are already running your jobs
>>> 0 match but are serving other users
>>> 0 are available to run your job
>>>
>>>>> On Mon, Feb 19, 2018 at 4:46 PM, Larry Martell <larry.martell@xxxxxxxxx>
>>>>> wrote:
>>>>>>
>>>>>> I have a master and 2 execute hosts (chopin and liszt) and I have one
>>>>>> host (chopin) preferred over the other with these settings:
>>>>>>
>>>>>> NEGOTIATOR_PRE_JOB_RANK = (10000000 * My.Rank) + \
>>>>>> (1000000 * (RemoteOwner =?= UNDEFINED)) + \
>>>>>> (100 * Machine =?= "chopin")
>>>>>> NEGOTIATOR_DEPTH_FIRST = True
>>>>>>
>>>>>> The preferred host is chopin. This has been working fine until Friday
>>>>>> when both execute hosts were rebooted. Since then condor will only run
>>>>>> jobs on liszt. Even if there are more jobs in the queue then slots on
>>>>>> liszt it will not use chopin. A condor_status shows all the slots on
>>>>>> chopin as 'Unclaimed Idle' I see all the proper daemons running and no
>>>>>> errors in the logs.
>>>>>>
>>>>>> How can I debug and/or fix this?