[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Queries regarding reset retries in rescue dag



thanks a lot Cole.

Yeah. I work with Vijay on this, as you may have suspected.

We still haven't been able to get firm evidenceÂthat dagman config file was read,
but after we removed `-DoRecover` from `condor_dagman` arguments
the retry count appears to be reset and Dagman does what we are expecting it to do.

Looks like at some point in the far past CRAB developers decided to
switch Dagman from Rescue to Recovery mode
https://github.com/dmwm/CRABServer/commit/c812d1c1a7c5fc1e5d7a5ef9f27c247fde2c7a4f#diff-cc7fafd6621a3816cc74145abaa7220e550bf8933933ab306af23467af7119c4

We are now trying to switch to Rescue mode instead, since as discussed
we want to remove the code which hacks Dagman logs and status files.

I think we need to go a bit more along this way before we understand how
to use it. Then we can maybe have a discussion about whether our
strategy makes sense for our goal. IIUC Dagman will still use recovery mode
in case of incidents like schedd restarts, machine reboots etc. That's fine.

Stefano