Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Jobs Becoming Idle SharedPortClient Error
- Date: Wed, 30 Aug 2017 11:51:49 -0500 (CDT)
- From: Todd L Miller <tlmiller@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] Jobs Becoming Idle SharedPortClient Error
The above content of the job.log confuses me, clearly the job had run
for 20 seconds, why had the job.log not been updated to include the
message that the job was executing on host xxxx?
Because the job never started. The job goes into the run state
when the schedd forks its shadow, not when the starter actually starts the
job. This usually doesn't matter much -- although it can if file transfer
in is slow enough -- but in this case it's confusing.
As for the shadow log, if I recall correctly, the job exits the
run state after file transfer out finishes -- it doesn't wait for the
starter (or startd) to finish cleaning up the job sandbox. Therefore,
HTCondor can try to start a job in a slot before it's been cleaned up
from the previous job. Rather than wait indefinitely, the shadow gives up
if the slot's not ready after twenty seconds, exiting with code 108, which
the manual defines as JOB_NOT_STARTED.
The startd log is a little less useful -- the StarterLog for the
given slot may have more information. At any rate, accounting for what
appears to be an 8-second clock difference, the stories match up: it takes
the starter 21 seconds to clean up the job directory, it accepts the job
after the shadow restart, and then decides that the negotiator was wrong
about the job actually matching, and kicks it back off.
08/29/17 17:51:38 (fd:4) (pid:4060) (D_ALWAYS) ERROR: SharedPortClient:
Failed to open named pipe id '1904_30e0_4' as requested by STARTD
<10.122.227.253:9618?addrs=10.122.227.253-9618&noUDP&sock=3696_5e98_3>
on <10.122.227.253:56884> for sending socket: 2 The system cannot find
the file specified.
This probably just means that someone tried to contact a starter
after it had killed itself. You should be able to find the named pipe id
in elsewhere in the HTCondor logs of the machine that produced this error;
it will show up after the string 'sock='; the third line quoted above is
an example.
The SharedPortClient error appeared to occur around the 20 second mark
for when the job got evicted again.
The starter that's trying to clean up may finish, give up, or be
killed at this point. (The startd should try to finish cleaning up if the
starter doesn't exit cleanly.)
Maybe this is somehow related to how the execute machine is being shared
between multiple central managers?
That's more likely to cause weird timing errors. The other thing
that may be worth checking is if the slot were matched by more than one
negotiator. (Check the match log of all your central managers.) I
don't know what that would like in the logs, or to the job, but it's an
inevitable part of reporting to more than one CM.
- ToddM