[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] job has all necessary slots claimed but still is idle



Stanislav. 

You have slots that are claimed and stay idle and you have enough resources avaliable for that job? 
If this is the case i would recommend  turn on full debug on the startd daemon. 

I have seen misconfiguration in the past that might affect. 
For example if you have quantize for memory or disk. 

If i understand correctly there is match the partitionable slot on the executer has enough resources for the job, it creates a dynamic slot but the dynamic slot does not have enough resources beacuse of something. You need to find that something

Good luck 
David



Get Outlook for Android


From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Stanislav V. Markevich via HTCondor-users <htcondor-users@xxxxxxxxxxx>
Sent: Monday, June 14, 2021 7:43:40 PM
To: htcondor-users <htcondor-users@xxxxxxxxxxx>
Cc: Stanislav V. Markevich <stanislav.markevich@xxxxxxxxxxxxxx>
Subject: Re: [HTCondor-users] job has all necessary slots claimed but still is idle
 
Hi,

I performed additional tests, tried HTCondor 8.9.11 and 9.0.1, tried different combinations of setting CLAIM_PARTITIONABLE_LEFTOVERS and CONSUMPTION_POLICY (False-True, True-False, False-False) and the problem was always there.
For some jobs schedd request and claims additional slots it won't use. If there are resources for such additional slots - the job will run, if no - the job will be IDLE forever.

Any ideas?


Best regards,
Stanislav Markevich


----- Original Message -----
From: "htcondor-users" <htcondor-users@xxxxxxxxxxx>
To: "htcondor-users" <htcondor-users@xxxxxxxxxxx>
Cc: "Stanislav V. Markevich" <stanislav.markevich@xxxxxxxxxxxxxx>
Sent: Thursday, 10 June, 2021 23:15:54
Subject: [HTCondor-users] job has all necessary slots claimed but still is      idle

Hi all,

My job stuck in IDLE state and I cannot understand why.


It's the only job:

    -- Schedd: parallel_schedd@xxxxxxxxxxxxxxxxxx : <10.42.0.253:45173?... @ 06/10/21 18:26:37
    OWNER     BATCH_NAME    SUBMITTED   DONE   RUN    IDLE  TOTAL JOB_IDS
    user20002 ID: 49       6/10 17:06      _      _      2      2 49.0-1

    Total for query: 2 jobs; 0 completed, 0 removed, 2 idle, 0 running, 0 held, 0 suspended
    Total for all users: 2 jobs; 0 completed, 0 removed, 2 idle, 0 running, 0 held, 0 suspended


There are suitable slots (dynamic slots created for this job):

    Name                         OpSys State     Activity Cpus Memory
    slot1@xxxxxxxxxxxxxxxxxxxx   LINUX Unclaimed Idle     75   1544 
    slot1_1@xxxxxxxxxxxxxxxxxxxx LINUX Claimed   Idle     25   256  
    slot1@xxxxxxxxxxxxxxxxxxxx   LINUX Unclaimed Idle     0    8    
    slot1_1@xxxxxxxxxxxxxxxxxxxx LINUX Claimed   Idle     100  1792


But "condor_q -better-analyze" says there is no successful match (and last failed match was an hour ago):

    -- Schedd: parallel_schedd@xxxxxxxxxxxxxxxxxx : <10.42.0.253:45173?...
    The Requirements _expression_ for job 49.000 is
    ...

    Job 49.000 defines the following attributes:

        DiskUsage = 0
        FileSystemDomain = "p7-htcondor"
        NumJobStarts = 0
        RequestCpus = 25
        RequestDisk = DiskUsage
        RequestMemory = 256

    The Requirements _expression_ for job 49.000 reduces to these conditions:
    ...

    No successful match recorded.
    Last failed match: Thu Jun 10 17:26:33 2021

    Reason for last match failure: PREEMPTION_REQUIREMENTS == False

    049.000:  Run analysis summary ignoring user priority.  Of 2 machines,
          0 are rejected by your job's requirements
          0 reject your job because of their own requirements
          0 match and are already running your jobs
          0 match but are serving other users
          2 are able to run your job

    The Requirements _expression_ for job 49.001 is
    ...

    Job 49.001 defines the following attributes:

        DiskUsage = 42
        FileSystemDomain = "p7-htcondor"
        NumJobStarts = 0
        RequestCpus = 100
        RequestDisk = DiskUsage
        RequestMemory = 1792

    The Requirements _expression_ for job 49.001 reduces to these conditions:
    ...

    No successful match recorded.
    Last failed match: Thu Jun 10 17:26:33 2021

    Reason for last match failure: no match found

    049.001:  Run analysis summary ignoring user priority.  Of 2 machines,
          0 are rejected by your job's requirements
          0 reject your job because of their own requirements
          0 match and are already running your jobs
          0 match but are serving other users
          2 are able to run your job


And there are successful matches in logs:

    17:57:50.766 condor_negotiator[51]: ---------- Started Negotiation Cycle ----------
    17:57:50.766 condor_negotiator[51]: Phase 1:  Obtaining ads from collector ...
    17:57:50.766 condor_negotiator[51]:   Getting startd private ads ...
    17:57:50.771 condor_negotiator[51]:   Getting Scheduler, Submitter and Machine ads ...
    17:57:50.777 condor_negotiator[51]:   Sorting 4 ads ...
    17:57:50.777 condor_negotiator[51]: Got ads: 4 public and 2 private
    17:57:50.777 condor_negotiator[51]: Public ads include 1 submitter, 2 startd
    17:57:50.777 condor_negotiator[51]: Phase 2:  Performing accounting ...
    17:57:50.789 condor_schedd[143]: Activity on stashed negotiator socket: <10.42.0.18:35399>
    17:57:50.789 condor_schedd[143]: Using negotiation protocol: NEGOTIATE
    17:57:50.789 condor_schedd[143]: Negotiating for owner: DedicatedScheduler@parallel_schedd@submit.p7-htcondor
    17:57:50.789 condor_negotiator[51]: Phase 3:  Sorting submitter ads by priority ...
    17:57:50.789 condor_negotiator[51]: Starting prefetch round; 1 potential prefetches to do.
    17:57:50.789 condor_negotiator[51]: Starting prefetch negotiation for DedicatedScheduler@parallel_schedd@submit.p7-htcondor.
    17:57:50.791 condor_schedd[143]: Finished sending rrls to negotiator
    17:57:50.791 condor_schedd[143]: Finished negotiating for DedicatedScheduler@parallel_schedd in local pool: 0 matched, 0 rejected
    17:57:50.791 condor_schedd[143]: Activity on stashed negotiator socket: <10.42.0.18:35399>
    17:57:50.791 condor_schedd[143]: Using negotiation protocol: NEGOTIATE
    17:57:50.791 condor_schedd[143]: Negotiating for owner: DedicatedScheduler@parallel_schedd@submit.p7-htcondor
    17:57:50.791 condor_negotiator[51]:     Got NO_MORE_JOBS;  schedd has no more requests
    17:57:50.791 condor_negotiator[51]: Prefetch summary: 1 attempted, 1 successful.
    17:57:50.791 condor_negotiator[51]: Phase 4.1:  Negotiating with schedds ...
    17:57:50.791 condor_negotiator[51]:   Negotiating with DedicatedScheduler@parallel_schedd@submit.p7-htcondor at <10.42.0.253:37745?addrs=10.42.0.253-37745&alias=submit.p7-htcondor&noUDP&sock=schedd_98_1cc6>
    17:57:50.791 condor_negotiator[51]: 0 seconds so far for this submitter
    17:57:50.791 condor_negotiator[51]: 0 seconds so far for this schedd
    17:57:50.791 condor_negotiator[51]:     Request 00049.00000: autocluster -1 (request count 1 of 0)
    17:57:50.791 condor_negotiator[51]:       Matched 49.0 DedicatedScheduler@parallel_schedd@submit.p7-htcondor <10.42.0.253:37745?addrs=10.42.0.253-37745&alias=submit.p7-htcondor&noUDP&sock=schedd_98_1cc6> preempting none <10.42.0.17:33779?addrs=10.42.0.17-33779&alias=p7-htcondorexecute-gb8z4.p7-htcondor&noUDP&sock=startd_99_cc5f> slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
    17:57:50.800 condor_startd[135]: slot1_1: New machine resource of type -1 allocated
    17:57:50.801 condor_startd[135]: slot1_1: Request accepted.
    17:57:50.802 condor_schedd[143]: SetAttribute modifying attribute Scheduler in nonexistent job 49.2
    17:57:50.804 condor_schedd[143]: SetAttribute modifying attribute Scheduler in nonexistent job 49.2
    17:57:50.805 condor_schedd[143]: Found 2 potential dedicated resources in 0 seconds
    17:57:50.805 condor_schedd[143]: Adding submitter DedicatedScheduler@parallel_schedd@submit.p7-htcondor to the submitter map for default pool.
    17:57:50.807 condor_negotiator[51]:       Successfully matched with slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
    17:57:50.807 condor_negotiator[51]:     Request 00049.00001: autocluster -1 (request count 1 of 0)
    17:57:50.807 condor_negotiator[51]:       Matched 49.1 DedicatedScheduler@parallel_schedd@submit.p7-htcondor <10.42.0.253:37745?addrs=10.42.0.253-37745&alias=submit.p7-htcondor&noUDP&sock=schedd_98_1cc6> preempting none <10.42.0.19:37885?addrs=10.42.0.19-37885&alias=p7-htcondorexecute-vxbqp.p7-htcondor&noUDP&sock=startd_87_913c> slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
    17:57:50.811 condor_startd[123]: slot1_1: New machine resource of type -1 allocated
    17:57:50.819 condor_schedd[143]: SetAttribute modifying attribute Scheduler in nonexistent job 49.2
    17:57:50.819 condor_schedd[143]: SetAttribute modifying attribute Scheduler in nonexistent job 49.2
    17:57:50.818 condor_startd[135]: slot1_1: Remote owner is user20002@p7-htcondor
    17:57:50.819 condor_startd[135]: slot1_1: State change: claiming protocol successful
    17:57:50.819 condor_startd[135]: slot1_1: Changing state: Owner -> Claimed
    17:57:50.819 condor_startd[123]: slot1_1: Request accepted.
    17:57:50.826 condor_negotiator[51]:       Successfully matched with slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
    17:57:50.826 condor_negotiator[51]:  negotiateWithGroup resources used submitterAds length 0
    17:57:50.827 condor_negotiator[51]: ---------- Finished Negotiation Cycle ----------

Maybe this is important, in the configuration files I have:
    CLAIM_PARTITIONABLE_LEFTOVERS = FALSE
    CONSUMPTION_POLICY = TRUE

Am I missing something? Is this a bug?


PS: after two hours of waiting I submitted another job and surprisingly both jobs ran pretty soon.


Best regards,
Stanislav V. Markevich
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/