[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Stuck dagman jobs after restart



On 15/12/2014 15:50, R. Kent Wenger wrote:
On Mon, 15 Dec 2014, Brian Bockelman wrote:

Hi Brian,

It might be worth it to look at the UserLog of these jobs - it's possible they are switching quickly between R and I?
Hmm, you could look, but I'd be really surprised if that were happening.
Could you send us your SchedLog? I think that's the most likely log to give us some useful information.
We actually have a test for DAGs getting correctly restarted across a 
Condor restart, so I'm a little surprised this is happening.
Something else I just thought of -- you might want to try doing 
condor_hold and then condor_release on one of the DAGs, to see if that 
gets it to run (just a wild guess). 
I thought I had condor_rm'd these jobs, but right now I see they're still there.
condor_hold and condor_release didn't help.

It's possible that the working directory for these two jobs has been removed. Oh well, not to worry. I've condor_rm'd them (again?) and I'll let you know if they resurface :-)
Thanks,

Brian.