[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Jobs getting died on Signal 9



Hi Gagan,

An additional option for getting back the StarterLog, if you're on HTCondor 23.5 or later, is adding "starter_log = path/to/file.log" (and optionally set "starter_debug" to the debug level you want) in your submit file, which will tell condor to keep a copy of the StarterLog for your job in your job's scratch directory and to transfer it back (including on job failure):

https://htcondor.readthedocs.io/en/latest/man-pages/condor_submit.html#starter_debug

Jason

On Mon, Mar 3, 2025 at 9:07âAM Jason Patton <jpatton@xxxxxxxxxxx> wrote:
Hi Gagan,

In /var/log/condor on the execution point, the StarterLog for the slot that the job ran on might have some more information on what happened, particularly if it had something to do with condor. If the error is not condor related, one thing you can try when submitting the job, particularly if you have reason to expect it to fail, is to set stream_ouptut and stream_error in your submit file to stream the job's stdout and stderr back to the access point, though you will probably want to limit this to only a couple jobs as it can stress the network and disk of the access point: https://htcondor.readthedocs.io/en/latest/man-pages/condor_submit.html#stream_error

Jason

On Fri, Feb 28, 2025 at 8:49âAM gagan tiwari <gagan.tiwari@xxxxxxxxxxxxxxxxxx> wrote:
Hi Guys,
           At times , jobs running on exec nodes crashed withÂSignal 9 error. But this is a generic message and we don't know exactly what went wrong with the jobs.ÂÂ
Is thereÂany settingÂin condor whichÂcan be tweaked to provide detailed information about what exactly happened?




Thanks,
Gagan

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe

Join us in June at Throughput Computing 25: https://urldefense.com/v3/__https://osg-htc.org/htc25__;!!Mak6IKo!PFiVWScAJOMtq2okmjp_-ajHBiC5_bzkouhme7wn3DIKu44q0KeT0PQ9Jy8VgsyMoRma3jmG5X06Rw-oaLOUOrDn42hVSXwH$

The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/