Because I will be talking to RSEʼs who might be skeptical that the extra process steps have tangible benefits, Iʼd like to be able to explain some of the robustness features enabled by this design.
...
The more important property to maintain is that, in the presence of errors and crashes, that we leave the machine in a state where we can continue to operate after a reboot or restart.
From the perspective of someone submitting jobs, HTCondor takes the position that robustness means that the submitter doesn't have to take any particular action because of a hardware or software failure; HTCondor won't forget the job and will (eventually) run it again.
- ToddM