[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Trouble getting jobs to run even though there is a valid machine.



Hi Cole,

    Thank you so much for your suggestion. Unfortunately, I get undefined when I try that command:
PS C:\Users\pat> condor_q 5.0 -af NumShadowStarts
undefined

Looking at the "condor_q --help", I didn't see an option for "NumShadowStarts". I'm running Condor 24.5.1.

Best regards,
Patrick


Patrick Claflin
System Administrator/Developer
Clemson Center for Geospatial Technologies (www.clemsongis.org)
Clemson University
(864) 656-7462
pat@xxxxxxxxxxx
On 3/7/2025 10:24 AM, Cole Bollig via HTCondor-users wrote:
Hi Patrick,

Since there is a matching slot, your job(s) may be cycling between startup and failure. What does condor_q <cluster id> -af NumShadowStarts say? If this is non-zero that means the jobs are matching and starting but failing quickly (especially if you see a large number).

-Cole Bollig

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Patrick Claflin <pat@xxxxxxxxxxx>
Sent: Friday, March 7, 2025 8:05 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>; Beyer, Christoph <christoph.beyer@xxxxxxx>
Subject: Re: [HTCondor-users] Trouble getting jobs to run even though there is a valid machine.
 
Hi Christoph,

    Thanks for responding! I did try the -better-analyze option and it shows me that I have an eligible machine. I tried your suggestion with the "reverse" option and it looks like it's ready to run the job but will not. Have you ever encountered an issue like this? Thanks again for your suggestions.

Best regards,
Patrick

PS C:\Users\pat> condor_q -better-analyze:reverse -machine ASG-PAT-7080.CAMPUS.CU.CLEMSON.EDU


-- Schedd: ASG-PAT-7080.CAMPUS.CU.CLEMSON.EDU : <130.127.55.243:9618?...

-- Slot: slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx : Analyzing matches for 1 Jobs in 1 autoclusters

The Requirements _expression_ for this slot is

    Start &&
    (WithinResourceLimits)

  START is
    true

  WithinResourceLimits is
    (MY.Cpus > 0 &&
      TARGET.RequestCpus <= MY.Cpus && MY.Memory > 0 &&
      TARGET.RequestMemory <= MY.Memory && MY.Disk > 0 &&
      TARGET.RequestDisk <= MY.Disk && (TARGET.RequestGPUs is undefined ||
        MY.GPUs >= TARGET.RequestGPUs))

    [0]    : Start
    [1]    : MY.Cpus > 0
    [2]    : TARGET.RequestCpus <= MY.Cpus
    [3]    : [1] && [2]
    [4]    : MY.Memory > 0
    [5]    : [3] && [4]
    [6]    : TARGET.RequestMemory <= MY.Memory
    [7]    : [5] && [6]
    [8]    : MY.Disk > 0
    [9]    : [7] && [8]
    [10]   : TARGET.RequestDisk <= MY.Disk
    [11]   : [9] && [10]
    [12]   : TARGET.RequestGPUs is undefined
    [13]   : MY.GPUs >= TARGET.RequestGPUs
    [14]   : [12] || [13]
    [15]   : [11] && [14]
    [16]   : [0] && [15]

This slot defines the following attributes:

    Cpus = 16
    Disk = 21436520
    GPUs = 1
    Memory = 32480

Job 4.0 has the following attributes:

    TARGET.RequestCpus = 1
    TARGET.RequestDisk = 10240
    TARGET.RequestMemory = 32

The Requirements _expression_ for this slot reduces to these conditions:

       Clusters
Step   Matched  Condition
----- --------- ---------
[2]           1  TARGET.RequestCpus <= MY.Cpus
[6]           1  TARGET.RequestMemory <= MY.Memory
[10]          1  TARGET.RequestDisk <= MY.Disk
[12]          1  TARGET.RequestGPUs is undefined

slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx: Run analysis summary of 1 jobs.
    1 (100.00 %) match both slot and job requirements.
    1 match the requirements of this slot.
    1 have job requirements that match this slot.



Patrick Claflin
System Administrator/Developer
Clemson Center for Geospatial Technologies (www.clemsongis.org)
Clemson University
(864) 656-7462
pat@xxxxxxxxxxx
On 3/7/2025 2:01 AM, Beyer, Christoph wrote:
Hi,

it sounds as if a requirement of the job or the machine is not fullfilled - have you tried

'condor_q -better-analyze <job-id>'

with one of the idle jobs ?

Also condor_q -better-analyze <job-id> -reverse - machine <FQDN of a possible workernode>'  can be enlightning in similar situations ...

Best
christoph


--
Christoph Beyer
DESY Hamburg
IT-Department

Notkestr. 85
Building 02b, Room 009
22607 Hamburg

phone:+49-(0)40-8998-2317
mail: christoph.beyer@xxxxxxx


Von: "Pat Claflin" <pat@xxxxxxxxxxx>
An: "htcondor-users" <htcondor-users@xxxxxxxxxxx>
Gesendet: Donnerstag, 6. MÃrz 2025 23:28:45
Betreff: [HTCondor-users] Trouble getting jobs to run even though there is a        valid machine.

Hi,

    I wanted to email my issue here in hopes that someone might have an idea on how to solve our problem.

    We are trying to get a working HTCondor solution in place with a Linux VM as the Central Manager and our lab machines which are Windows 11. We have done this in the past, but it was using a very old version of HTCondor (before 9.0.x) and we were using the insecure security authentication.

    We have downloaded and installed one of the the latest HTCondor software versions (23.9.6). Initially we had issues trying to authenticate from from our Windows machines to our Linux CM with IDTOKENS so we turned to SSL as the primary authentication, and were able to access the CM from our lab machines.

    Our latest (and hopefully) final issue is trying to get jobs to run on available machines instead of going immediately to "Idle". For some reason even though there is an eligible machine that can be used to run the job, it gets placed in the "Idle" status until it is force removed with "condor_rm". This had been happening consistently for weeks while I was working on it at the end of 2024.

    I finally came back to the problem this month and reinstalled Condor with a newer version (The CM is still running the same 23.9.6). When I submitted a job it actually worked and ran to completion! But then, it resumed it's old habit of going idle and not running even though it was the exact same job and no changes to any configurations were made. I haven't been able to recreate a successful submission again.

    I tried uninstalling and reinstalling Condor thinking that might be the link, but it still didn't run any jobs. I only have one machine (besides the CM in the pool as of now), but when I enter the "condor_q -better-analyze" command it shows that that machine is eligible.

I have all the logs and config files that will show this. I am hoping someone can help us solve this problem. Any help or advise on what we can try next would be greatly appreciated. I am attaching logs and command output that I think are relevant for this issue. If you'd like any more documentation please let me know and I'll be happy to generate it.

Thank you so much for taking the time to read this.

Best regards,
Patrick


Here is a list of files attachments and their significance to the issue (These are all client side logs):
--------------------------------------------------------
File)) log_directory_of_idle_submission.png >>> Screenshot of the log directory where I can't get a job to run. No "StarterLog.slotx_x" or "startd_history" files are ever created.

File)) log_directory_of_successful_submission.png >>> Screenshot of the log directory where it successfully ran a submitted job.

File)) commands_showing_idle_job.txt >>> Commands I issued before and after the job was submitted. "condor_q -better-analyze" shows there is an eligible machine.

File)) SchedLog_with_idle_jobs.txt >>> SchedLog showing all jobs going idle.

File)) SchedLog_with_successful_job.txt >>> SchedLog showing a successful job running. Timestamp at 03/06/25 14:41 is when it was submitted.

File)) MasterLog_with_idle_jobs.txt >>> MasterLog showing all jobs going idle.

File)) MasterLog_with_successful_job.txt >>> MasterLog showing a successful job running. Timestamp at 03/06/25 14:41 is when it was submitted.



--


Patrick Claflin
System Administrator/Developer
Clemson Center for Geospatial Technologies (www.clemsongis.org)
Clemson University
(864) 656-7462
pat@xxxxxxxxxxx

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe

Join us in June at Throughput Computing 25: https://osg-htc.org/htc25

The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe

Join us in June at Throughput Computing 25: https://osg-htc.org/htc25

The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe

Join us in June at Throughput Computing 25: https://osg-htc.org/htc25

The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/