Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] HTCondor doesn't run a job after some time
- Date: Tue, 15 Jun 2021 19:16:20 +0300 (MSK)
- From: "Dmitry A. Golubkov" <dmitry.golubkov@xxxxxxxxxxxxxx>
- Subject: Re: [HTCondor-users] HTCondor doesn't run a job after some time
Dear all,
Sorry, this is my fault. The job was not started because the condition NumJobStarts == 0 was not met.
----- Original Message -----
From: "HTCondor-Users Mail List" <htcondor-users@xxxxxxxxxxx>
To: "HTCondor-Users Mail List" <htcondor-users@xxxxxxxxxxx>
Cc: "Dmitry Golubkov" <dmitry.golubkov@xxxxxxxxxxxxxx>
Sent: Tuesday, June 15, 2021 6:38:42 PM
Subject: [HTCondor-users] HTCondor doesn't run a job after some time
Dear all,
After several successive runs, htcondor ended up in a strange state:
---
Name OpSys Arch State Activity LoadAv Mem ActvtyTime
slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 1800 0+00:54:38
slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 1800 0+00:54:39
slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 1800 0+00:54:39
Total Owner Claimed Unclaimed Matched Preempting Backfill Drain
X86_64/LINUX 3 0 0 3 0 0 0 0
Total 3 0 0 3 0 0 0 0
-- Schedd: parallel_schedd@xxxxxxxxxxxxxxxxxxxxxx : <10.42.0.171:46693?... @ 06/15/21 15:24:40
OWNER BATCH_NAME SUBMITTED DONE RUN IDLE HOLD TOTAL JOB_IDS
user20001 ID: 91 6/15 14:53 _ _ 5 _ 5 91.0-4
Total for query: 5 jobs; 0 completed, 0 removed, 5 idle, 0 running, 0 held, 0 suspended
Total for all users: 5 jobs; 0 completed, 0 removed, 5 idle, 0 running, 0 held, 0 suspended
----
As you see, I have three partitionable slots, absolutely free, one job in the idle state which can be started but nothing happens for a long time (a have waited for 30 minutes). In condor_q details I have found:
---
Slots
Step Matched Condition
----- -------- ---------
[2] 3 OpSys == "LINUX"
[5] 3 Arch == "X86_64"
[7] 3 DA__P7__RUNENV_PYTHON3 >= 13
[9] 3 DA__P7__CLUSTER_NODE == "True"
[11] 3 TARGET.Disk >= RequestDisk
[13] 3 TARGET.Memory >= RequestMemory
[15] 3 TARGET.FileSystemDomain == MY.FileSystemDomain
No successful match recorded.
Last failed match: Tue Jun 15 14:55:24 2021
Reason for last match failure: PREEMPTION_REQUIREMENTS == False
091.004: Run analysis summary ignoring user priority. Of 3 machines,
0 are rejected by your job's requirements
0 reject your job because of their own requirements
0 match and are already running your jobs
0 match but are serving other users
3 are able to run your job
---
But all previous jobs finished successfully. And I use dynamic slots to run the jobs. Any ideas?
Thanks in advance,
Dmitry.
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/