[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] job cannot reconnect to starter running MPI
- Date: Thu, 08 Jun 2017 14:17:51 -0500 (CDT)
- From: Todd L Miller <tlmiller@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] job cannot reconnect to starter running MPI
I was monitoring the logs in the execution nodes as suggested earlier
and I got some errors just after HTCondor set the job from Running ->
Idle.
My question is, why is the job going idle? What was in the
starter log before it caught the sigquit (why is it being killed)? The
ShadowLog you quoted below runs from 14:37:13 to 14:40:04; what does the
StartLog and StarterLog.* say for that period of time for those nodes?
Why is the StartLog you showed full of deactivate_claim_forcibly()?
- ToddM