Hello Larry,Did you change the network setting when you added the 10Gb link ? What happens if you disconnect the 10Gb link ?
You seems to use two distinct networks :192.168.10.x (scheduler) and 192.168.11.x (exec nodes) . Is there any reason for not having all the nodes on the same network ?
I'm not an expert at all and hope this input can help. I have my own questions coming soon on the list.
Cheers, Christophe. Le 20/02/2018 00:33, Larry Martell a écrit :
Here is the output of condor_q -better-analyze. -- Schedd: bach.elucid.local : <192.168.10.2:20734> The Requirements expression for job 21283.000 is ( TARGET.Arch == "X86_64" ) && ( TARGET.OpSys == "LINUX" ) && ( TARGET.Disk >= RequestDisk ) && ( TARGET.Memory >= RequestMemory ) && ( TARGET.HasFileTransfer ) Job 21283.000 defines the following attributes: DiskUsage = 0 ImageSize = 0 RequestDisk = DiskUsage RequestMemory = ifthenelse(MemoryUsage =!= undefined,MemoryUsage,( ImageSize + 1023 ) / 1024) The Requirements expression for job 21283.000 reduces to these conditions: Slots Step Matched Condition ----- -------- --------- [0] 132 TARGET.Arch == "X86_64" [1] 132 TARGET.OpSys == "LINUX" [3] 132 TARGET.Disk >= RequestDisk [5] 132 TARGET.Memory >= RequestMemory [7] 132 TARGET.HasFileTransfer No successful match recorded. Last failed match: Mon Feb 19 18:30:13 2018 Reason for last match failure: no match found 21283.000: Run analysis summary ignoring user priority. Of 132 machines, 0 are rejected by your job's requirements 0 reject your job because of their own requirements 0 match and are already running your jobs 0 match but are serving other users 0 are available to run your job On Mon, Feb 19, 2018 at 6:09 PM, Larry Martell <larry.martell@xxxxxxxxx> wrote:If I grep in all the logs for one of the ids for a job in the queue I see this in the MatchLog: Matched 21283.0 prod_user@xxxxxxxxxxxxxxxxx <192.168.10.2:9618?addrs=192.168.10.2-9618+[--1]-9618&noUDP&sock=522229_3c3e_4> preempting none <192.168.11.1:9618?addrs=192.168.11.1-9618+[--1]-9618&noUDP&sock=14430_5bb5_3> slot1@chopin This message repeats 132 times (once for each slot on chopin) and then I see this: Rejected 21283.0 prod_user@xxxxxxxxxxxxxxxxx <192.168.10.2:9618?addrs=192.168.10.2-9618+[--1]-9618&noUDP&sock=522229_3c3e_4>: no match found That sequence of 133 message repeats over and over. Is this a clue to anything? On Mon, Feb 19, 2018 at 5:55 PM, Larry Martell <larry.martell@xxxxxxxxx> wrote:If I run ps -efal | grep condor on the 2 execute hosts the only difference is that on chopin (the one I cannot get condor to use) it has this: condor_shared_port That is not on liszt. Is that an issue? On Mon, Feb 19, 2018 at 5:31 PM, Larry Martell <larry.martell@xxxxxxxxx> wrote:On Mon, Feb 19, 2018 at 5:22 PM, Todd Tannenbaum <tannenba@xxxxxxxxxxx> wrote:On 2/19/2018 4:10 PM, Larry Martell wrote:As a test I removed liszt from the config and it still will not use chopin. Even thought a status shows all slots as 'Unclaimed Idle'What do you mean when you say "removed liszt from the config" ?I set NUM_SLOTS to 0Do you mean you removed it from your HTCondor pool (i.e. did condor_off on liszt) ? So now when you do a "condor_status", the only thing you see is slots on chopin?Correct.And yet your job remains idle and refuses to run on chopin?Correct.Maybe your job does not match any slots on chopinWhat makes a match? Before the reboot the same jobs I am trying to run today were running on chopin and when that was full, ran on liszt.-- perhaps because chopin has less memory than liszt, for instance.The 2 machines are identical in every way - CPU, memory, etc. The reason they were rebooted was that a 10G point to point connection was installed between them.See the manual section "Why is the job not running?" at http://tinyurl.com/ycbut82rThat tiny url does not work, but I've been looking at this page: http://research.cs.wisc.edu/htcondor/CondorWeek2004/presentations/adesmet_admin_tutorial/#DebuggingJobs and here is the output of condor_q -analyze:verbose when run on the master: -- Schedd: bach.elucid.local : <192.168.10.2:20734> No successful match recorded. Last failed match: Mon Feb 19 17:19:52 2018 Reason for last match failure: no match found 21283.000: Run analysis summary ignoring user priority. Of 132 machines, 0 are rejected by your job's requirements 0 reject your job because of their own requirements 0 match and are already running your jobs 0 match but are serving other users 0 are available to run your jobOn Mon, Feb 19, 2018 at 4:46 PM, Larry Martell <larry.martell@xxxxxxxxx> wrote:I have a master and 2 execute hosts (chopin and liszt) and I have one host (chopin) preferred over the other with these settings: NEGOTIATOR_PRE_JOB_RANK = (10000000 * My.Rank) + \ (1000000 * (RemoteOwner =?= UNDEFINED)) + \ (100 * Machine =?= "chopin") NEGOTIATOR_DEPTH_FIRST = True The preferred host is chopin. This has been working fine until Friday when both execute hosts were rebooted. Since then condor will only run jobs on liszt. Even if there are more jobs in the queue then slots on liszt it will not use chopin. A condor_status shows all the slots on chopin as 'Unclaimed Idle' I see all the proper daemons running and no errors in the logs. How can I debug and/or fix this?_______________________________________________ HTCondor-users mailing list To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a subject: Unsubscribe You can also unsubscribe by visiting https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users The archives can be found at: https://lists.cs.wisc.edu/archive/htcondor-users/
-- Christophe DIARRA Institut de Physique Nucleaire 15 Rue Georges Clemenceau S2I/D2I - Bat 100A - Piece A108 F91406 ORSAY Cedex Tel: +33 (0)1 69 15 65 60 / +33 (0)6 31 26 23 69 Fax: +33 (0)1 69 15 64 70 / E-mail: diarra@xxxxxxxxxxxxx