I ran into some strange DAGMan behaviour in a software system I maintain. The system submits a DAG that may recursively submit another DAG, and so on. The problem is that the execution of one of the DAGs eventually fails without ever having run. The first DAG always succeeds, but the following DAGs seem to have about a 50-50 chance of success. Sometimes it will iterate several times, but most of the time, it fails on the first or second iteration. In the DAGMan log for the last DAG, I get this: 005 (518.000.000) 10/18 15:42:55 Job terminated. (0) Abnormal termination (signal 9) (0) No core file Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage 0 - Run Bytes Sent By Job 0 - Run Bytes Received By Job 0 - Total Bytes Sent By Job 0 - Total Bytes Received By Job I've managed to write a simplification of the system in a couple of simple perl scripts. I get the same behaviour on the machine I'm working on. The scripts are really straight-forward. They need to be run on a machine that is both a Condor Submit and Execute machine. The behavior is far from consistent, but it tends to fail earlier rather than later. I'm running Condor 6.7.12. I'm coming up with no reason for the behaviour, but I'm by no means a Condor guru. Any suggestions are welcome. (And if you're going to suggest that I use the Condor Perl module. I agree... but my supervisor doesn't.) Mark
#!/usr/bin/perl # The script needs to be run on a machine that can both Submit and # Execute Condor jobs. I use a Requirement to ensure that they are the # same machine, but this should work as long as all accessible Execute # machines are also Submit machines. my $condor_machine = "your.condor.submit.and.execute.machine"; my $max_runs = 10; my $run = 0; if (defined($ARGV[0])) { $run = $ARGV[0]; } if ($run < $max_runs) { my $next_run = $run + 1; open(TESTJOB, ">test.job.$run") or die "Couldn't open 'test.job.$run'.\n"; print TESTJOB <<"EOF"; Universe = vanilla Executable = testscript.sh Requirements = Machine == "$condor_machine" Log = test.log.$run Output = test.out.$run Error = test.error.$run GetEnv = true Notification = never Queue EOF close(TESTJOB); open(CHECKJOB, ">check.job.$run") or die "Couldn't open 'check.pl'.\n"; print CHECKJOB <<"EOF"; Universe = vanilla Executable = checkscript.pl Arguments = $next_run Requirements = Machine == "$condor_machine" Log = test.log.$run Output = check.out.$run Error = check.error.$run GetEnv = true Notification = never Queue EOF close(CHECKJOB); open(DAG, ">test.dag.$run") or die "Couldn't open 'test.dag.$run'.\n"; print DAG <<"EOF"; Job test test.job.$run Job check check.job.$run PARENT test CHILD check EOF close(DAG); my $retval = system("condor_submit_dag -notification never test.dag.$run"); # Did condor_submit_dag fail? print $retval / 256, "\n"; }
Attachment:
testscript.sh
Description: Bourne shell script