[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Passing condor_dagman args with condor_submit_dag?



Hi,

I am attempting to use Condor for a large distributed batch processing project. I'm using condor_dagman as a meta scheduler by limiting the number of jobs that occur at the same time. I've organized each job to be an iteration of a loop, and I have 2 layers of recursion. Let me throw some numbers out there: my outer loop iterates 100 times, and my inner loop iterates 1000 times (each of these loops contains a DAG). I am implementing looping by unrolling the logical loop into a dynamically generated DAG file.
While my solution might prevent condor's scheduler from getting 
overloaded with jobs, I am faced with another problem: organizing the 
files on disk so that one directory doesn't contain something on the 
order of 100*1000 = 100,000's of submit files (and a multiple for output 
and log files).  I'm starting with the obvious: make a directory and 
subdirectory for each iteration of the inner loop.  However, I am 
running across a problem.
condor_dag_submit -no_submit accepts my *.dag file and produces a 
*.dag.condor.sub file, but I am having trouble properly referencing this 
*.dag.condor.sub file from a *.dag file in the parent directory.  I 
think this is because condor_dag_submit does not let me configure some 
of condor_dagman's arguments in the submit file it generates.  For example:
outer *.dag file:

JOB MAINDAG_111 111/maindag_111.dag.condor.sub
JOB MAINDAG_222 222/maindag_222.dag.condor.sub

111/maindag_111.dag.condor.sub:

# Filename: maindag_111.dag.condor.sub
# Generated by condor_submit_dag maindag_111.dag
universe        = scheduler
executable      = /opt/condor/bin/condor_dagman
getenv          = True
output          = maindag_111.dag.lib.out
error           = maindag_111.dag.lib.out
log             = maindag_111.dag.dagman.log
remove_kill_sig = SIGUSR1
on_exit_remove  = (ExitBySignal == false || ExitSignal =!= 9)
arguments = -f -l . -Debug 3 -Lockfile maindag_111.dag.lock -Condorlog /tmp/exp6/111/process_a_111.log -Dag maindag_111.dag -Rescue maindag_111.dag.rescue -MaxIdle 5 -MaxJobs 1 -UseDagDir environment = _CONDOR_DAGMAN_LOG=maindag_111.dag.dagman.out;_CONDOR_MAX_DAGMAN_LOG=0
queue

When the outer condor_dagman reads and tries to execute the inner loop's condor_dagman, it fails, because it looks in the outer directory for maindag_111.dag rather than in the directory 111 (where the above submit file, and anything related to maindag_111*, is).
Is there a way I can tell condor_dag_submit to pass particular arguments 
(e.g. -Dag, -Rescue, output files) to the submit file it generates?
It would be cool if there was a way to get condor_dagman to chdir() into 
a directory before executing.  I looked at -UseDagDir, but this will put 
output/log files in the parent directory - something I am trying to avoid.
I guess I could write my own condor_submit_dag too, but I'd rather not 
go to that extreme. :-)
Any insight would be great.  Thanks!

 - Armen

--
Armen Babikyan
MIT Lincoln Laboratory
armenb@xxxxxxxxxx . 781-981-1796