Hi Steffen,no good idea here, but have you checked for anything suspicious on the scheduler? The `Shadow exception!` sounds suspicious to me - if it is not a job fault but a problem on the sched with the job-shadows??
Is there maybe something in the sched logs? (or maybe just space run out or available file handles got used up?)
Cheers, Thomas On 07/01/2021 09.09, Steffen Grunewald wrote:
Good morning, in my pool, there's a couple of jobs going into Hold state, with the HoldReason(s) shown in the subject. The last log lines look like this: ... 012 (347486.000.000) 01/07 09:00:58 Job was held. Error from slot1_5@xxxxxxxxxxxxxxxxxxx: unable to establish standard output stream Code 9 Subcode 0 ... 012 (347487.000.000) 01/07 09:00:58 Job was held. Error from slot1_3@xxxxxxxxxxxxxxxxxxx: unable to establish standard output stream Code 9 Subcode 0 ... 007 (347485.000.000) 01/07 09:00:58 Shadow exception! Error from slot1_4@xxxxxxxxxxxxxxxxxxx: unable to establish standard output stream 0 - Run Bytes Sent By Job 2469644 - Run Bytes Received By Job ... 012 (347485.000.000) 01/07 09:00:58 Job was held. Error from slot1_4@xxxxxxxxxxxxxxxxxxx: unable to establish standard output stream Code 9 Subcode 0 ... I ran "condor_q -l $jobid | egrep '^(UserLog|Out|Err)'" and checked the existence of the files on all pool nodes (inlcuding the head nodes) - nothing suspicious. How to further debug this? Do I have a gaping black hole in the pool (that only affects this particular user), is there something in the submit file (which I haven't found yet) that's different from everything else? condor_release doesn't reset the error state... Any suggestion is appreciated. condor_version is 8.8.3 on the HN, and 8.8.x (x >= 3) anywhere else (due to scattered reinstalls, a full update currently isn't possible) Thanks, S -- Steffen Grunewald, Cluster Administrator Max Planck Institute for Gravitational Physics (Albert Einstein Institute) Am MÃhlenberg 1 * D-14476 Potsdam-Golm * Germany ~~~ Fon: +49-331-567 7274 Mail: steffen.grunewald(at)aei.mpg.de ~~~ _______________________________________________ HTCondor-users mailing list To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a subject: Unsubscribe You can also unsubscribe by visiting https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users The archives can be found at: https://lists.cs.wisc.edu/archive/htcondor-users/
Attachment:
smime.p7s
Description: S/MIME Cryptographic Signature