Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] job has all necessary slots claimed but still is idle
Hi David,
Thank you for your reply!
> You have slots that are claimed and stay idle and you have enough resources avaliable for that job?
Yes. Sometimes a job runs without any problems, sometimes it claims all necessary slots and tries to claim more. Sometimes it tries to start and fails, then it tries for the second time and runs ok.
My first thought was that there is some problem resources quantization, I indeed have changed MODIFY_REQUEST_EXPR_REQUESTMEMORY and CONSUMPTION_MEMORY:
MODIFY_REQUEST_EXPR_REQUESTMEMORY = quantize(RequestMemory, {64})
CONSUMPTION_MEMORY = quantize(target.RequestMemory,{64})
but resetting them to default values doesn't make any difference.
There are some errors in logs. This time I run the same job two times. First time it ran without any problem. Second time it tried to run but failed. After some time it did the second attempt and then ran.
2021-06-15T05:29:18.888907541Z condor_shadow[ 1175 ]: DaemonCore: command socket at < 10 . 244 . 0 . 161:35215 ?addrs= 10 . 244 . 0 . 161-35215 &alias= submit.p7-htcondor &noUDP&sock=shadow_153_8d5f_2>
2021-06-15T05:29:18.889071531Z condor_shadow[ 1175 ]: DaemonCore: private command socket at < 10 . 244 . 0 . 161:35215 ?addrs= 10 . 244 . 0 . 161-35215 &alias= submit.p7-htcondor &noUDP&sock=shadow_153_8d5f_2>
2021-06-15T05:29:18.899472741Z condor_shadow[ 1175 ]: Initializing a PARALLEL shadow for job 2 . 0
2021-06-15T05:29:18.901317447Z condor_shadow[ 1175 ]: LIMIT_DIRECTORY_ACCESS = <unset>
2021-06-15T05:29:18.909500761Z condor_collector[ 51 ]: QueryWorker: forked new worker with id 798 ( max 4 active 1 pending 0 )
2021-06-15T05:29:18.911321126Z condor_collector[ 51 ]: Got QUERY_SCHEDD_ADS
2021-06-15T05:29:18.911348173Z condor_collector[ 51 ]: (Sending 1 ads in response to query)
2021-06-15T05:29:18.911354100Z condor_collector[ 51 ]: Query info: matched= 1 ; skipped= 0 ; query_time= 0 . 000145 ; send_time= 0 . 000089 ; type=Scheduler; requirements={((TotalRunningJobs > 0 || TotalIdleJobs > 0 || TotalHeldJobs > 0 || TotalRemovedJobs > 0 || TotalJobAds > 0 ))}; locate= 0 ; limit= 0 ; from=TOOL; peer=< 10 . 244 . 0 . 161:39473 >; projection={ScheddIpAddr CondorVersion Name Machine}; filter_private_ads= 0
2021-06-15T05:29:18.911358189Z condor_collector[ 798 ]: (Sending 4 ads in response to query)
2021-06-15T05:29:18.911362278Z condor_collector[ 798 ]: Query info: matched= 4 ; skipped= 0 ; query_time= 0 . 000409 ; send_time= 0 . 000212 ; type=Machine; requirements={ true }; locate= 0 ; limit= 0 ; from=TOOL; peer=< 10 . 244 . 0 . 162:33279 >; projection={Activity Cpus JobId Machine Memory Name OpSys State}; filter_private_ads= 0
2021-06-15T05:29:18.946981574Z condor_schedd[ 153 ]: ZKM: In Condor_Auth_Passwd::wrap.
2021-06-15T05:29:18.947296808Z condor_schedd[ 153 ]: Number of Active Workers 0
2021-06-15T05:29:19.916226397Z condor_startd[ 167 ]: Error : can't find resource with ClaimId (< 10 . 244 . 0 . 162:38061 ?addrs= 10 . 244 . 0 . 162-38061 &alias= p7-7s4bb.p7-htcondor &noUDP&sock=startd_125_a1d5># 1623734651 # 1 #...) for 444 (ACTIVATE_CLAIM)
2021-06-15T05:29:19.916588962Z condor_shadow[ 1175 ]: Request to run on slot1_1@ p7-7s4bb.p7-htcondor < 10 . 244 . 0 . 162:38061 ?addrs= 10 . 244 . 0 . 162-38061 &alias= p7-7s4bb.p7-htcondor &noUDP&sock=startd_125_a1d5> was REFUSED
2021-06-15T05:29:19.917372796Z condor_shadow[ 1175 ]: Job 2 . 0 is being evicted
2021-06-15T05:29:19.931164599Z condor_startd[ 167 ]: Error : can't find resource with ClaimId (< 10 . 244 . 0 . 162:38061 ?addrs= 10 . 244 . 0 . 162-38061 &alias= p7-7s4bb.p7-htcondor &noUDP&sock=startd_125_a1d5># 1623734651 # 1 #...) -- perhaps this claim was already removed?
2021-06-15T05:29:19.931239776Z condor_startd[ 167 ]: Error : problem finding resource for 403 (DEACTIVATE_CLAIM)
2021-06-15T05:29:19.931358337Z condor_startd[ 122 ]: Failed to write ToE tag to . job.ad file ( 2 ): No such file or directory
2021-06-15T05:29:19.931404129Z condor_startd[ 122 ]: slot1_1: Called deactivate_claim()
2021-06-15T05:29:19.934665824Z condor_shadow[ 1175 ]: RemoteResource::killStarter(): Could not send command to startd
2021-06-15T05:29:19.934711905Z condor_shadow[ 1175 ]: logEvictEvent with unknown reason ( 108 ), not logging.
2021-06-15T05:29:19.939935680Z condor_shadow[ 1175 ]: **** condor_shadow (condor_SHADOW) pid 1175 EXITING WITH STATUS 108
2021-06-15T05:29:19.940043649Z condor_schedd[ 153 ]: In DedicatedScheduler::reaper pid 1175 has status 27648
2021-06-15T05:29:19.940129064Z condor_schedd[ 153 ]: Shadow pid 1175 exited with status 108
2021-06-15T05:29:19.940137774Z condor_schedd[ 153 ]: Dedicated job abnormally ended, releasing claim
2021-06-15T05:29:19.940159496Z condor_schedd[ 153 ]: Dedicated job abnormally ended, releasing claim
2021-06-15T05:29:19.941444269Z condor_startd[ 167 ]: Can't read ClaimId
2021-06-15T05:29:19.944025188Z condor_startd[ 122 ]: ZKM: In Condor_Auth_Passwd::wrap.
2021-06-15T05:29:19.944574506Z condor_schedd[ 153 ]: ZKM: In Condor_Auth_Passwd::unwrap.
2021-06-15T05:29:19.945556145Z condor_startd[ 122 ]: Can't read ClaimId
2021-06-15T05:29:19.942080482Z condor_startd[ 167 ]: condor_write(): Socket closed when trying to write 29 bytes to < 10 . 244 . 0 . 161:37487 >, fd is 10
2021-06-15T05:29:19.942098016Z condor_startd[ 167 ]: Buf::write(): condor_write() failed
It seems the problem is in CLAIM_WORKLIFE = 0 (and maybe in UNUSED_CLAIM_TIMEOUT = 120 as well) - claim was released because of CLAIM_WORKLIFE (probably too fast) and schedd didn't know about it.
I want unused claims to be released quickly so that other jobs with different requirements may start sooner, is there any recommendation how to choose value of CLAIM_WORKLIFE?
Best regards,
Stanislav Markevich
----- Original Message -----
From: "Dudu Handelman" <duduhandelman@xxxxxxxxxxx>
To: "htcondor-users" <htcondor-users@xxxxxxxxxxx>
Cc: "Stanislav V. Markevich" <stanislav.markevich@xxxxxxxxxxxxxx>
Sent: Monday, 14 June, 2021 21:33:30
Subject: Re: [HTCondor-users] job has all necessary slots claimed but still is idle
Stanislav.
You have slots that are claimed and stay idle and you have enough resources avaliable for that job?
If this is the case i would recommend turn on full debug on the startd daemon.
I have seen misconfiguration in the past that might affect.
For example if you have quantize for memory or disk.
If i understand correctly there is match the partitionable slot on the executer has enough resources for the job, it creates a dynamic slot but the dynamic slot does not have enough resources beacuse of something. You need to find that something
Good luck
David
Get [ https://aka.ms/AAb9ysg | Outlook for Android ]
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Stanislav V. Markevich via HTCondor-users <htcondor-users@xxxxxxxxxxx>
Sent: Monday, June 14, 2021 7:43:40 PM
To: htcondor-users <htcondor-users@xxxxxxxxxxx>
Cc: Stanislav V. Markevich <stanislav.markevich@xxxxxxxxxxxxxx>
Subject: Re: [HTCondor-users] job has all necessary slots claimed but still is idle
Hi,
I performed additional tests, tried HTCondor 8.9.11 and 9.0.1, tried different combinations of setting CLAIM_PARTITIONABLE_LEFTOVERS and CONSUMPTION_POLICY (False-True, True-False, False-False) and the problem was always there.
For some jobs schedd request and claims additional slots it won't use. If there are resources for such additional slots - the job will run, if no - the job will be IDLE forever.
Any ideas?
Best regards,
Stanislav Markevich
----- Original Message -----
From: "htcondor-users" <htcondor-users@xxxxxxxxxxx>
To: "htcondor-users" <htcondor-users@xxxxxxxxxxx>
Cc: "Stanislav V. Markevich" <stanislav.markevich@xxxxxxxxxxxxxx>
Sent: Thursday, 10 June, 2021 23:15:54
Subject: [HTCondor-users] job has all necessary slots claimed but still is idle
Hi all,
My job stuck in IDLE state and I cannot understand why.
It's the only job:
-- Schedd: parallel_schedd@xxxxxxxxxxxxxxxxxx : <10.42.0.253:45173?... @ 06/10/21 18:26:37
OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS
user20002 ID: 49 6/10 17:06 _ _ 2 2 49.0-1
Total for query: 2 jobs; 0 completed, 0 removed, 2 idle, 0 running, 0 held, 0 suspended
Total for all users: 2 jobs; 0 completed, 0 removed, 2 idle, 0 running, 0 held, 0 suspended
There are suitable slots (dynamic slots created for this job):
Name OpSys State Activity Cpus Memory
slot1@xxxxxxxxxxxxxxxxxxxx LINUX Unclaimed Idle 75 1544
slot1_1@xxxxxxxxxxxxxxxxxxxx LINUX Claimed Idle 25 256
slot1@xxxxxxxxxxxxxxxxxxxx LINUX Unclaimed Idle 0 8
slot1_1@xxxxxxxxxxxxxxxxxxxx LINUX Claimed Idle 100 1792
But "condor_q -better-analyze" says there is no successful match (and last failed match was an hour ago):
-- Schedd: parallel_schedd@xxxxxxxxxxxxxxxxxx : <10.42.0.253:45173?...
The Requirements expression for job 49.000 is
...
Job 49.000 defines the following attributes:
DiskUsage = 0
FileSystemDomain = "p7-htcondor"
NumJobStarts = 0
RequestCpus = 25
RequestDisk = DiskUsage
RequestMemory = 256
The Requirements expression for job 49.000 reduces to these conditions:
...
No successful match recorded.
Last failed match: Thu Jun 10 17:26:33 2021
Reason for last match failure: PREEMPTION_REQUIREMENTS == False
049.000: Run analysis summary ignoring user priority. Of 2 machines,
0 are rejected by your job's requirements
0 reject your job because of their own requirements
0 match and are already running your jobs
0 match but are serving other users
2 are able to run your job
The Requirements expression for job 49.001 is
...
Job 49.001 defines the following attributes:
DiskUsage = 42
FileSystemDomain = "p7-htcondor"
NumJobStarts = 0
RequestCpus = 100
RequestDisk = DiskUsage
RequestMemory = 1792
The Requirements expression for job 49.001 reduces to these conditions:
...
No successful match recorded.
Last failed match: Thu Jun 10 17:26:33 2021
Reason for last match failure: no match found
049.001: Run analysis summary ignoring user priority. Of 2 machines,
0 are rejected by your job's requirements
0 reject your job because of their own requirements
0 match and are already running your jobs
0 match but are serving other users
2 are able to run your job
And there are successful matches in logs:
17:57:50.766 condor_negotiator[51]: ---------- Started Negotiation Cycle ----------
17:57:50.766 condor_negotiator[51]: Phase 1: Obtaining ads from collector ...
17:57:50.766 condor_negotiator[51]: Getting startd private ads ...
17:57:50.771 condor_negotiator[51]: Getting Scheduler, Submitter and Machine ads ...
17:57:50.777 condor_negotiator[51]: Sorting 4 ads ...
17:57:50.777 condor_negotiator[51]: Got ads: 4 public and 2 private
17:57:50.777 condor_negotiator[51]: Public ads include 1 submitter, 2 startd
17:57:50.777 condor_negotiator[51]: Phase 2: Performing accounting ...
17:57:50.789 condor_schedd[143]: Activity on stashed negotiator socket: <10.42.0.18:35399>
17:57:50.789 condor_schedd[143]: Using negotiation protocol: NEGOTIATE
17:57:50.789 condor_schedd[143]: Negotiating for owner: DedicatedScheduler@parallel_schedd@submit.p7-htcondor
17:57:50.789 condor_negotiator[51]: Phase 3: Sorting submitter ads by priority ...
17:57:50.789 condor_negotiator[51]: Starting prefetch round; 1 potential prefetches to do.
17:57:50.789 condor_negotiator[51]: Starting prefetch negotiation for DedicatedScheduler@parallel_schedd@submit.p7-htcondor.
17:57:50.791 condor_schedd[143]: Finished sending rrls to negotiator
17:57:50.791 condor_schedd[143]: Finished negotiating for DedicatedScheduler@parallel_schedd in local pool: 0 matched, 0 rejected
17:57:50.791 condor_schedd[143]: Activity on stashed negotiator socket: <10.42.0.18:35399>
17:57:50.791 condor_schedd[143]: Using negotiation protocol: NEGOTIATE
17:57:50.791 condor_schedd[143]: Negotiating for owner: DedicatedScheduler@parallel_schedd@submit.p7-htcondor
17:57:50.791 condor_negotiator[51]: Got NO_MORE_JOBS; schedd has no more requests
17:57:50.791 condor_negotiator[51]: Prefetch summary: 1 attempted, 1 successful.
17:57:50.791 condor_negotiator[51]: Phase 4.1: Negotiating with schedds ...
17:57:50.791 condor_negotiator[51]: Negotiating with DedicatedScheduler@parallel_schedd@submit.p7-htcondor at <10.42.0.253:37745?addrs=10.42.0.253-37745&alias=submit.p7-htcondor&noUDP&sock=schedd_98_1cc6>
17:57:50.791 condor_negotiator[51]: 0 seconds so far for this submitter
17:57:50.791 condor_negotiator[51]: 0 seconds so far for this schedd
17:57:50.791 condor_negotiator[51]: Request 00049.00000: autocluster -1 (request count 1 of 0)
17:57:50.791 condor_negotiator[51]: Matched 49.0 DedicatedScheduler@parallel_schedd@submit.p7-htcondor <10.42.0.253:37745?addrs=10.42.0.253-37745&alias=submit.p7-htcondor&noUDP&sock=schedd_98_1cc6> preempting none <10.42.0.17:33779?addrs=10.42.0.17-33779&alias=p7-htcondorexecute-gb8z4.p7-htcondor&noUDP&sock=startd_99_cc5f> slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
17:57:50.800 condor_startd[135]: slot1_1: New machine resource of type -1 allocated
17:57:50.801 condor_startd[135]: slot1_1: Request accepted.
17:57:50.802 condor_schedd[143]: SetAttribute modifying attribute Scheduler in nonexistent job 49.2
17:57:50.804 condor_schedd[143]: SetAttribute modifying attribute Scheduler in nonexistent job 49.2
17:57:50.805 condor_schedd[143]: Found 2 potential dedicated resources in 0 seconds
17:57:50.805 condor_schedd[143]: Adding submitter DedicatedScheduler@parallel_schedd@submit.p7-htcondor to the submitter map for default pool.
17:57:50.807 condor_negotiator[51]: Successfully matched with slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
17:57:50.807 condor_negotiator[51]: Request 00049.00001: autocluster -1 (request count 1 of 0)
17:57:50.807 condor_negotiator[51]: Matched 49.1 DedicatedScheduler@parallel_schedd@submit.p7-htcondor <10.42.0.253:37745?addrs=10.42.0.253-37745&alias=submit.p7-htcondor&noUDP&sock=schedd_98_1cc6> preempting none <10.42.0.19:37885?addrs=10.42.0.19-37885&alias=p7-htcondorexecute-vxbqp.p7-htcondor&noUDP&sock=startd_87_913c> slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
17:57:50.811 condor_startd[123]: slot1_1: New machine resource of type -1 allocated
17:57:50.819 condor_schedd[143]: SetAttribute modifying attribute Scheduler in nonexistent job 49.2
17:57:50.819 condor_schedd[143]: SetAttribute modifying attribute Scheduler in nonexistent job 49.2
17:57:50.818 condor_startd[135]: slot1_1: Remote owner is user20002@p7-htcondor
17:57:50.819 condor_startd[135]: slot1_1: State change: claiming protocol successful
17:57:50.819 condor_startd[135]: slot1_1: Changing state: Owner -> Claimed
17:57:50.819 condor_startd[123]: slot1_1: Request accepted.
17:57:50.826 condor_negotiator[51]: Successfully matched with slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
17:57:50.826 condor_negotiator[51]: negotiateWithGroup resources used submitterAds length 0
17:57:50.827 condor_negotiator[51]: ---------- Finished Negotiation Cycle ----------
Maybe this is important, in the configuration files I have:
CLAIM_PARTITIONABLE_LEFTOVERS = FALSE
CONSUMPTION_POLICY = TRUE
Am I missing something? Is this a bug?
PS: after two hours of waiting I submitted another job and surprisingly both jobs ran pretty soon.
Best regards,
Stanislav V. Markevich
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
[ https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users | https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users ]
The archives can be found at:
[ https://lists.cs.wisc.edu/archive/htcondor-users/ | https://lists.cs.wisc.edu/archive/htcondor-users/ ]
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
[ https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users | https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users ]
The archives can be found at:
[ https://lists.cs.wisc.edu/archive/htcondor-users/ | https://lists.cs.wisc.edu/archive/htcondor-users/ ]