I've got a situation where the starter thinks that transferring output files has hung
02/14/25 05:26:28 ERROR: Child pid 77106 appears hung! Killing it hard.
02/14/25 05:26:28 Starter pid 77106 died on signal 9 (signal 9 (Killed))
02/14/25 05:26:33 slot1: State change: starter exited
It looks like the same situation ashttps://www-auth.cs.wisc.edu/lists/htcondor-users/2019-April/msg00051.shtml. In that message, Todd refers to increasing the "I think
you are dead" timeout.
Is that still the case 6 years later?
If so, what's the "I think you are dead" setting actually called so I can test it?
Is there a way to log the amount to be returned? i.e. transfer_history for output files as well as input files.
I'm seeing it in 23.0.12 on debian bookworm. I've just updated that starter to 23.0.20 in case that helps.
klint.
|