[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Connection to shadow may be lost



You can ignore these messages in this situation. When an OOM event kills the running job, the starter and shadow daemons both clean up and exit fairly quickly. Most their work is independent of each other. The shadow is exiting before the starter and closing the network connection between the two.

We should consider making the âconnection closedâ code in the starter smarter in this situation, so it doesnât print the worrying messages.

 - Jaime

On Dec 12, 2023, at 2:17âPM, Vikrant Aggarwal <ervikrant06@xxxxxxxxx> wrote:

Hello Experts,

Recently while troubleshooting some job failure issues, we noticed that we are receiving " Connection to shadow may be lost, will test by sending whoami request."  very frequently.

In this case the job went into HELD status because of OOM condition, it's expected but don't understand why after OOM it's complaining about communication issues with the submit node? Checked the network stack and couldn't find anything obvious. It's showing it will wait for 40m to reconnect but it decided to exit quickly without any reconnect attempt. We are seeing this "Lost connection to shadow" many times across different worker nodes.  

11/07/23 09:38:04 (pid:3515857) Job was held due to OOM event: Job has gone over memory limit of 17296 megabytes. Peak usage: 19163 megabytes.
11/07/23 09:38:04 (pid:3515857) Got SIGQUIT.  Performing fast shutdown.
11/07/23 09:38:04 (pid:3515857) ShutdownFast all jobs.
11/07/23 09:38:04 (pid:3515857) Got SIGTERM. Performing graceful shutdown.
11/07/23 09:38:04 (pid:3515857) ShutdownGraceful all jobs.
11/07/23 09:38:04 (pid:3515857) Connection to shadow may be lost, will test by sending whoami request.
11/07/23 09:38:04 (pid:3515857) condor_write(): Socket closed when trying to write 37 bytes to <xx.xx.xx.xx:32049>, fd is 22
11/07/23 09:38:04 (pid:3515857) Buf::write(): condor_write() failed
11/07/23 09:38:04 (pid:3515857) i/o error result is 0, errno is 0
11/07/23 09:38:04 (pid:3515857) Lost connection to shadow, waiting 2400 secs for reconnect
11/07/23 09:38:04 (pid:3515857) Process exited, pid=3515943, signal=9
11/07/23 09:38:04 (pid:3515857) Failed to open .job.ad, can't forward ToE tag.
11/07/23 09:38:04 (pid:3515857) RPC error: disconnected from shadow
11/07/23 09:38:04 (pid:3515857) All jobs have exited... starter exiting
11/07/23 09:38:04 (pid:3515857) **** condor_starter (condor_STARTER) pid 3515857 EXITING WITH STATUS 0

Found old thread [1] but nothing conclusive out of it. 


Regards,
Vikrant
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/