I just watched a DAG take 63 minutes to remove the
last DAG node after a condor_rm command. The DAG nodes are running on
remote resources, but using the new glideinWMS system from FNAL/UCSD. Is there some way to have condor_rm finish more quickly? The full DAG had 100k nodes, but we have a configuration setting that no more than 1000 nodes for a DAG will be idle, so there were about 1000 queued (idle) jobs in our local job pool, and maybe 100 running when the condor_rm command was made. 1 hour seems like a long time to remove 100 jobs. You can see the logs if you like here: http://glidein.nebiogrid.org/~ijstokes/phaser/clean/3cqg/config_old/ 50 jobs from the overall DAG finished -- the latest one around 1 hour after the condor_rm command was issued. The problem we see when this happens is the following: 3pm: condor_submit_dag job.dag 4pm: discover mistake, execute condor_rm dag.jobid 4:10pm: fix script or classad or DAG, resubmit DAG 4:11pm: oops, rescue DAG exists, and log files. Delete these, resubmit A - 4:12pm: hey, the log files are still there! the DAG nodes are still running and writing to them (thus re-creating them). B - 5:00pm: discover that old jobs are still running and have now mixed their job output with new jobs and DAG. Scenario B at the end there is what I've just witnessed, but I'd swear I've also seen scenario A before as well. What advice do people have for completing a condor_rm in <15 minutes? Ian -- Ian Stokes-Rees, PhD W: http://hkl.hms.harvard.edu ijstokes@xxxxxxxxxxxxxxxxxxx T: +1 617 432-5608 x75 NEBioGrid, Harvard Medical School C: +1 617 331-5993 |