[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] condor-C to DAGman problem
Hi,
I am trying to use condor-C
to DAGman, which means submitting a remote DAGman job.As the condor team
suggested, I modified the dagman.condor.sub file, and submit the dag.condor.sub
file with condor_submit.
However,the dag job ended
very fast also, without submitting and running any jobs. In file dag_test.dag.lib.err
there is nothing in, but there is an error from the dag_test.dag.dagman.out
file below which is confused to me, I searched this error in google, but
seems not much helpful...can anyone give some suggestions? Thank you very
much!
PS: I can submit local
DAGman and it runs well, but with the remote mode, the error occurs. And
also I can submit local jobs from the remote machine, which means the schedd
daemon in the remote machine works well.
The error is as below:
9/1 14:25:15 Submitting Condor Node
testA job(s)...
9/1 14:25:15 submitting: condor_submit
-a dag_node_name' '=' 'testA -a +DAGManJobId' '=' '-1 -a DAGManJobId' '='
'-1 -a submit_event_notes' '=' 'DAG' 'Node:' 'testA -a +DAGParentNodeNames'
'=' '"" testA.sub
9/1 14:25:16 From submit:
9/1 14:25:16 From submit: ERROR: Can't
find address of local schedd 9/1 14:25:16 failed while reading from pipe.
9/1 14:25:16 Read so far: ERROR: Can't
find address of local schedd 9/1 14:25:16 ERROR: submit attempt failed
9/1 14:25:16 submit command was: condor_submit
-a dag_node_name' '=' 'testA -a +DAGManJobId' '=' '-1 -a DAGManJobId' '='
'-1 -a submit_event_notes' '=' 'DAG' 'Node:' 'testA -a +DAGParentNodeNames'
'=' '"" testA.sub
9/1 14:25:16 Job submit try 2/6 failed,
will try again in >= 2 seconds.
Here are the files I use:
DAD file:
JOB testA testA.sub
JOB testB testB.sub
JOB testC testC.sub
PARENT testA CHILD testB testC
PARENT testC CHILD testB
dag.condor.sub file
# Filename: dag_test.dag.condor.sub
# Generated by condor_submit_dag dag_test.dag
universe =
grid
grid_resource = condor L50.com L50**.com
executable =
C:\condor\bin\condor_dagman.exe
getenv
= True
output
= dag_test.dag.lib.out
error
= dag_test.dag.lib.err
log
= dag_test.dag.dagman.log
# Note: default on_exit_remove _expression_:
# ( ExitSignal =?= 11 || (ExitCode
=!= UNDEFINED && ExitCode >=0 && ExitCode <= 2))
# attempts to ensure that DAGMan is
automatically
# requeued by the schedd if it exits
abnormally or
# is killed (e.g., during a reboot).
on_exit_remove
= ( ExitSignal =?= 11 || (ExitCode =!= UNDEFINED &&
ExitCode >=0 && ExitCode <= 2))
copy_to_spool
= False
arguments =
-f -l . -Debug 3 -Lockfile dag_test.dag.lock -Condorlog DAGmantest.log.txt
-Dag dag_test.dag -Rescue dag_test.dag.rescue
environment =
_CONDOR_DAGMAN_LOG=dag_test.dag.dagman.out|_CONDOR_MAX_DAGMAN_LOG=0
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
transfer_input_files =DAGmantestA.bat,testA.sub,DAGmantestB.bat,testB.sub,DAGmantestC.bat,testC.sub,dag_test.dag
queue
job sub file(only testA is given here):
Universe = Vanilla
Executable =DAGmantestA.bat
GetEnv = True
RunAsOwner = True
Log = DAGmantest.log.txt
Error = DAGmantest.bat.error.txt
Queue