Hi Wendy,
The use of condor_q - better is not going to be useful in this case because it does analysis against the job and execute machines requirements while a scheduler universe job runs locally on the access point. Has DAGMan spewed out all of the extra DAGMan files.
Specifically, the *.dagman.out file (in your case 7a2af431-4f48-4dec-8302-ee602b2ef6d2.dag.dagman.out)?
If so, could you send that over to me. You can send it to me privately if you want.
-Cole
Bollig
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of wuwj@xxxxxxxxx <wuwj@xxxxxxxxx>
Sent: Thursday, January 12, 2023 3:56 PM To: htcondor-users <htcondor-users@xxxxxxxxxxx> Cc: guhitj <guhitj@xxxxxxxxx> Subject: Re: [HTCondor-users] [External] schduler universe job won't start on the submisson machine Hi Michael
Thanks for your response! And go blue!
Indeed, these jobs are DAG jobs, so we have to use the scheduler universe instead of local.
I use condor_submit_dag upon the dag file [1], and condor generates the submision file [2], then the jobs sit in idle, never get executed, and I do not see any matching job slot [3]
[1]
JOB training /xxx/training.sub
JOB evaluation /xxx/evaluation.
sub
PARENT training CHILD evaluation
[2]
# Generated by condor_submit_dag 7a2af431-4f48-4dec-8302-ee602b2ef6d2.dag
universe = scheduler
executable = /usr/bin/condor_dagman
getenv = True
output = 7a2af431-4f48-4dec-8302-ee602b2ef6d2.dag.lib.out
error = 7a2af431-4f48-4dec-8302-ee602b2ef6d2.dag.lib.err
log = 7a2af431-4f48-4dec-8302-ee602b2ef6d2.dag.dagman.log
remove_kill_sig
= SIGUSR1
+OtherJobRemoveRequirements
= "DAGManJobId =?= $(cluster)"
# Note: default on_exit_remove _expression_:
# ( ExitSignal =?= 11 || (ExitCode =!= UNDEFINED && ExitCode >=0 && ExitCode <= 2))
# attempts to ensure that DAGMan is automatically
# requeued by the schedd if it exits abnormally or
# is killed (e.g., during a reboot).
on_exit_remove
= (ExitSignal =?= 11 || (ExitCode =!= UNDEFINED && ExitCode >=0 && ExitCode <= 2))
copy_to_spool
= False
arguments = "-p 0 -f -l . -Lockfile 7a2af431-4f48-4dec-8302-ee602b2ef6d2.dag.lock -AutoRescue 1 -DoRescueFrom 0 -Dag 7a2af431-4f48-4dec-8302-ee602b2ef6d2.dag -Suppress_notific
ation -CsdVersion $CondorVersion:' '9.0.17' 'Oct' '04' '2022' 'PackageID:' '9.0.17-1.1' '$ -Dagman /usr/bin/condor_dagman"
environment = _CONDOR_SCHEDD_ADDRESS_FILE=/var/lib/condor/spool/.schedd_address;_CONDOR_MAX_DAGMAN_LOG=0;_CONDOR_SCHEDD_DAEMON_AD_FILE=/var/lib/condor/spool/.schedd_classad;_CON
DOR_DAGMAN_LOG=7a2af431-4f48-4dec-8302-ee602b2ef6d2.dag.dagman.out
queue
[3]
bash-4.2$ condor_q -better 543
-- Schedd: gc-7-31.aglt2.org : <10.10.1.114:9618?...
The Requirements _expression_ for job 543.000 is
((TARGET.TotalDisk >= 21000000 && TARGET.IsSL7WN is true && TARGET.AGLT2_SITE == "UM")) && (TARGET.Arch == "X86_64") && (TARGET.OpSys == "LINUX") &&
(TARGET.Disk >= RequestDisk) && (TARGET.Memory >= RequestMemory)
Job 543.000 defines the following attributes:
DiskUsage = 325
RequestDisk = DiskUsage
RequestMemory = 3200
The Requirements _expression_ for job 543.000 reduces to these conditions:
Slots
Step Matched Condition
----- -------- ---------
[0] 12210 TARGET.TotalDisk >= 21000000
[1] 12166 TARGET.IsSL7WN is true
[2] 12154 [0] && [1]
[3] 6478 TARGET.AGLT2_SITE == "UM"
[4] 6412 [2] && [3]
WARNING: Analysis is meaningless for Scheduler universe jobs.
543.000: This schedd's StartSchedulerUniverse evalutes to true for this job.
543.000: Run analysis summary ignoring user priority. Of 1 machines,
0 are rejected by your job's requirements
0 reject your job because of their own requirements
0 match and are already running your jobs
0 match but are serving other users
0 are able to run your job
Wendy/Wenjing
wuwj@xxxxxxxxx
|