[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] [External] schduler universe job won't start on the submisson machine



Ah Ok, I think I have found the problem!
In the document, 
"A boolean value that defaults to TotalSchedulerJobsRunning < 500. The condor_schedd uses this macro to determine whether to start a scheduler universe job. At intervals determined by SCHEDD_INTERVAL, the condor_schedd daemon evaluates this macro for each idle scheduler universe job that it has. For each job, if the START_SCHEDULER_UNIVERSE macro is True, then the jobâs Requirements _expression_ is evaluated. If both conditions are met, then the job is allowed to begin execution."


And I see this in the SchedLog
Failed _expression_ 'Requirements = (TARGET.TotalDisk >= 21000000) && (TARGET.Arch == "X86_64") && (TARGET.OpSys == "LINUX") && (TARGET.Disk >= RequestDisk) && (TARGET.Memory >= RequestMemory)'

After I fix the requirement, the job starts to run.

Cheers!


Wendy/Wenjing
wuwj@xxxxxxxxx
 
From: wuwj@xxxxxxxxx
Date: 2023-01-12 16:56
To: htcondor-users
CC: Pelletier, Michael V. RTX; guhitj
Subject: Re: Re: [HTCondor-users] [External] schduler universe job won't start on the submisson machine
Hi Michael
Thanks for your response! And go blue!
Indeed, these jobs are DAG jobs, so we have to use the scheduler universe instead of local. 

I use condor_submit_dag upon the dag file [1], and condor  generates the submision file [2], then the jobs sit in idle, never get executed, and I do not see any matching job slot [3]



[1]
JOB training /xxx/training.sub
JOB evaluation /xxx/evaluation.
sub
PARENT training CHILD evaluation

[2]
# Generated by condor_submit_dag 7a2af431-4f48-4dec-8302-ee602b2ef6d2.dag 
universe = scheduler
executable = /usr/bin/condor_dagman
getenv = True
output = 7a2af431-4f48-4dec-8302-ee602b2ef6d2.dag.lib.out
error = 7a2af431-4f48-4dec-8302-ee602b2ef6d2.dag.lib.err
log = 7a2af431-4f48-4dec-8302-ee602b2ef6d2.dag.dagman.log
remove_kill_sig = SIGUSR1
+OtherJobRemoveRequirements = "DAGManJobId =?= $(cluster)"
# Note: default on_exit_remove _expression_:
# ( ExitSignal =?= 11 || (ExitCode =!= UNDEFINED && ExitCode >=0 && ExitCode <= 2))
# attempts to ensure that DAGMan is automatically
# requeued by the schedd if it exits abnormally or
# is killed (e.g., during a reboot).
on_exit_remove = (ExitSignal =?= 11 || (ExitCode =!= UNDEFINED && ExitCode >=0 && ExitCode <= 2))
copy_to_spool = False
arguments = "-p 0 -f -l . -Lockfile 7a2af431-4f48-4dec-8302-ee602b2ef6d2.dag.lock -AutoRescue 1 -DoRescueFrom 0 -Dag 7a2af431-4f48-4dec-8302-ee602b2ef6d2.dag -Suppress_notific
ation -CsdVersion $CondorVersion:' '9.0.17' 'Oct' '04' '2022' 'PackageID:' '9.0.17-1.1' '$ -Dagman /usr/bin/condor_dagman"
environment = _CONDOR_SCHEDD_ADDRESS_FILE=/var/lib/condor/spool/.schedd_address;_CONDOR_MAX_DAGMAN_LOG=0;_CONDOR_SCHEDD_DAEMON_AD_FILE=/var/lib/condor/spool/.schedd_classad;_CON
DOR_DAGMAN_LOG=7a2af431-4f48-4dec-8302-ee602b2ef6d2.dag.dagman.out
queue

[3]
bash-4.2$ condor_q -better 543


-- Schedd: gc-7-31.aglt2.org : <10.10.1.114:9618?...
The Requirements _expression_ for job 543.000 is

    ((TARGET.TotalDisk >= 21000000 && TARGET.IsSL7WN is true && TARGET.AGLT2_SITE == "UM")) && (TARGET.Arch == "X86_64") && (TARGET.OpSys == "LINUX") &&
    (TARGET.Disk >= RequestDisk) && (TARGET.Memory >= RequestMemory)

Job 543.000 defines the following attributes:

    DiskUsage = 325
    RequestDisk = DiskUsage
    RequestMemory = 3200

The Requirements _expression_ for job 543.000 reduces to these conditions:

         Slots
Step    Matched  Condition
-----  --------  ---------
[0]       12210  TARGET.TotalDisk >= 21000000
[1]       12166  TARGET.IsSL7WN is true
[2]       12154  [0] && [1]
[3]        6478  TARGET.AGLT2_SITE == "UM"
[4]        6412  [2] && [3]

WARNING: Analysis is meaningless for Scheduler universe jobs.


543.000:  This schedd's StartSchedulerUniverse evalutes to true for this job.



543.000:  Run analysis summary ignoring user priority.  Of 1 machines,
      0 are rejected by your job's requirements
      0 reject your job because of their own requirements
      0 match and are already running your jobs
      0 match but are serving other users
      0 are able to run your job


Wendy/Wenjing
wuwj@xxxxxxxxx
 
Date: 2023-01-12 16:23
Subject: Re: [HTCondor-users] [External] schduler universe job won't start on the submisson machine

Hi Wendy! Go blue! Have you tried running it in the âlocalâ universe instead of the scheduler universe? This is in the docs:

However, unlike the local universe, the scheduler universe does not use a condor_starter daemon to manage the job, and thus offers limited features and policy support. The local universe is a better choice for most jobs which must run on the submit host, as it offers a richer set of job management features, and is more consistent with other universes such as the vanilla universe. The scheduler universe may be retired in the future, in favor of the newer local universe.

 

However, are you using condor_submit_dag? The DAGman is what is intended to run in the scheduler universe, and thatâs handled internally, but the job nodes within the DAG would typically run in the vanilla universe. Where in the submission or DAG are you specifying the scheduler universe?

 

Michael V. Pelletier
Digital Technology
HPC Support Team
Raytheon Missiles and Defense

 

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of wuwj@xxxxxxxxx
Sent: Thursday, January 12, 2023 12:33 PM
To: htcondor-users <htcondor-users@xxxxxxxxxxx>
Subject: [External] [HTCondor-users] schduler universe job won't start on the submisson machine

 

Hi All, 

Our condor version is 9.0.17, and I have a DAG job, and the universe is set to be scheduler, after submitting, the job gets zero matched slots and sits in idle forever. (I would think it will get executed on the submission node )

Here [1] is the condor_q output, I wonder if we missed any configuration which does not support scheduler universe? [2] is some configuration which might be relevant. 

 

 

Cheers!

 

 

 

[1]

The Requirements _expression_ for job 1283.000 is    ((TARGET.TotalDisk >= 21000000 && TARGET.IsSL7WN is true && TARGET.AGLT2_SITE == "UM")) && (TARGET.Arch == "X86_64") && (TARGET.OpSys == "LINUX") && (TARGET.Disk >= RequestDisk) && (TARGET.Memory >= RequestMemory)Job 1283.000 defines the following attributes:    DiskUsage = 1000
    RequestDisk = DiskUsage
    RequestMemory = 3200The Requirements _expression_ for job 1283.000 reduces to these conditions:         Slots
Step    Matched  Condition
-----  --------  ---------
[0]       12044  TARGET.TotalDisk >= 21000000
[1]       11993  TARGET.IsSL7WN is true
[2]       11981  [0] && [1]
[3]        6427  TARGET.AGLT2_SITE == "UM"
[4]        6354  [2] && [3]WARNING: Analysis is meaningless for Scheduler universe jobs.1283.000:  This schedd's StartSchedulerUniverse evalutes to true for this job.1283.000:  Run analysis summary ignoring user priority.  Of 1 machines,
      0 are rejected by your job's requirements
      0 reject your job because of their own requirements
      0 match and are already running your jobs
      0 match but are serving other users
      0 are able to run your job



[2]

ALWAYS_VM_UNIV_USE_NOBODY = false

DEFAULT_UNIVERSE = 

ENABLE_KERNEL_TUNING = true

IsMPI = (TARGET.JobUniverse == $(MPI))

IsStandard = (TARGET.JobUniverse == $(STANDARD))

IsVanilla = (TARGET.JobUniverse == $(VANILLA))

IsVM = (TARGET.JobUniverse == $(VM))

KERNEL_TUNING_LOG = $(LOG)/KernelTuning.log

LINUX_KERNEL_TUNING_SCRIPT = $(LIBEXEC)/linux_kernel_tuning

LOCAL_UNIV_EXECUTE = $(SPOOL)/local_univ_execute

LOCAL_UNIVERSE_JOB_CLEANUP_RETRY_DELAY = 30

LOCAL_UNIVERSE_MAX_JOB_CLEANUP_RETRIES = 5

SCHED_UNIV_RENICE_INCREMENT = 0

START_LOCAL_UNIVERSE = TotalLocalJobsRunning < 200

START_SCHEDULER_UNIVERSE = 500

SYSTEM_STARTD_JOB_ATTRS = ImageSize, ExecutableSize, JobUniverse, NiceUser, CPUsUsage, ResidentSetSize, ProportionalSetSizeKb, MemoryUsage, DiskUsage, ScratchDirFileCount

SYSTEM_VALID_SPOOL_FILES = job_queue.log, job_queue.log.tmp, history, Accountant.log, Accountantnew.log, local_univ_execute, .pgpass, .schedd_address, .schedd_address.super, .schedd_classad, OfflineLog

UNICORE_GAHP = $(SBIN)/unicore_gahp

VM_UNIV_NOBODY_USER = 

 


Wendy/Wenjing