>
>
> At the time when the shadow log indicates a read failure on the
> connection to the starter, what appears in the corresponding
> StarterLog?
>
> --Dan
>
> Alan Cass wrote:
>
> > Hi,
> >
> > I've upgraded to Condor 7.0.0 on our cluster of Student Lab Windows
> > PCs but have not been able to have a job complete that takes a
> 'long'
> > amount of time since. The jobs do the computation (since I can
> see the
> > updates being applied to the SIZE in condor_q). As a test I sent a
> > node a 7MB file and got it to 'touch' it so it would be
> automatically
> > sent back. This works without a problem. However, if I tell the node
> > to 'sleep' for 7 hours before exiting it will never finish,
> > communication with the starter fails, the job requeues and this
> > behaviour cycles.
> >
> > I'm worried it might be a problem with the University port scanner.
> > Every so often I get an entry like this in the nodes' Starter
> log (and
> > similar in the Master log) file:
> >
> > 5/21 07:11:34 condor_read(): recv() returned -1, errno = 10054,
> > assuming failure reading 4 bytes from <SCANNER_IP:PORT>.
> > 5/21 07:11:34 condor_read(): recv() returned -1, errno = 10054,
> > assuming failure reading 5 bytes from <SCANNER_IP:PORT>.
> > 5/21 07:11:34 IO: Failed to read packet header
> > 5/21 07:11:34 DaemonCore: Can't receive command request from
> > SCANNER_IP (perhaps a timeout?)
> > 5/21 07:11:37 IO: Incoming packet header unrecognized
> > 5/21 07:11:37 DaemonCore: Can't receive command request from
> > SCANNER_IP (perhaps a timeout?)
> > 5/21 07:11:37 condor_read(): Socket closed when trying to read 4
> bytes
> > from <SCANNER_IP:PORT>
> > 5/21 07:11:37 condor_read(): Socket closed when trying to read 5
> bytes
> > from <SCANNER_IP:PORT>
> > 5/21 07:11:37 IO: EOF reading packet header
> > 5/21 07:11:37 DaemonCore: Can't receive command request from
> > SCANNER_IP (perhaps a timeout?)
> > 5/21 07:11:40 Received HTTP GET connection from <SCANNER_IP:PORT> --
> > DENIED because ENABLE_WEB_SERVER=FALSE
> > 5/21 07:11:40 IO: Incoming packet header unrecognized
> > 5/21 07:11:40 DaemonCore: Can't receive command request from
> > SCANNER_IP (perhaps a timeout?)
> > 5/21 07:11:40 condor_read(): Socket closed when trying to read 4
> bytes
> > from <SCANNER_IP:PORT>
> > 5/21 07:11:40 condor_read(): Socket closed when trying to read 5
> bytes
> > from <SCANNER_IP:PORT>
> > 5/21 07:11:40 IO: EOF reading packet header
> > 5/21 07:11:40 DaemonCore: Can't receive command request from
> > SCANNER_IP (perhaps a timeout?)
> > 5/21 07:11:45 Entering JICShadow::updateShadow()
> > 5/21 07:11:45 TokenCache contents:
> > condor-reuse-slot1@.
> > 5/21 07:11:45 In VanillaProc::PublishUpdateAd()
> > 5/21 07:11:45 About to get usage data from ProcD for family with
> root 4036
> > 5/21 07:11:45 Result of "get_usage" operation from ProcD: SUCCESS
> > 5/21 07:11:45 Inside OsProc::PublishUpdateAd()
> > 5/21 07:11:45 Sent job ClassAd update to startd.
> > 5/21 07:11:45 Leaving JICShadow::updateShadow(): success
> > 5/21 07:11:49 condor_read(): Socket closed when trying to read 4
> bytes
> > from <SCANNER_IP:PORT>
> > 5/21 07:11:49 condor_read(): Socket closed when trying to read 5
> bytes
> > from <SCANNER_IP:PORT>
> > 5/21 07:11:49 IO: EOF reading packet header
> > 5/21 07:11:49 DaemonCore: Can't receive command request from
> > SCANNER_IP (perhaps a timeout?)
> >
> >
> > and the shadow eventually bombs out with:
> >
> > 5/21 23:11:22 (14933.0) (3964): condor_read(): recv() returned -1,
> > errno = 10054, assuming failure reading 5 bytes from <EXEC_IP:PORT>.
> > 5/21 23:11:22 (14933.0) (3964): IO: Failed to read packet header
> > 5/21 23:11:22 (14933.0) (3964): Can no longer talk to condor_starter
> > <EXEC_IP:PORT>
> > 5/21 23:11:22 (14933.0) (3964): Trying to reconnect to
> disconnected job
> > 5/21 23:11:22 (14933.0) (3964): LastJobLeaseRenewal: 1211370100 Wed
> > May 21 21:11:40 2008
> > 5/21 23:11:22 (14933.0) (3964): JobLeaseDuration: 1200 seconds
> > 5/21 23:11:22 (14933.0) (3964): JobLeaseDuration remaining: EXPIRED!
> > 5/21 23:11:22 (14933.0) (3964): Reconnect FAILED: Job
> disconnected too
> > long: JobLeaseDuration (1200 seconds) expired
> > 5/21 23:11:22 (14933.0) (3964): **** condor_shadow (condor_SHADOW)
> > EXITING WITH STATUS 107
> >
> >
> >
> > Is the scanner somehow stealing the starter port and not
> allowing the
> > shadow to get information back? What settings can I give the
> config to
> > get it to completely ignore anything coming from the port
> scanner? Or
> > could it be something else?
> >
> > Thanks,
> >
> > Alan
> >
> >------------------------------------------------------------------------
> >
> >_______________________________________________
> >Condor-users mailing list
> >To unsubscribe, send a message to
>
condor-users-request@xxxxxxxxxxx