[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Connection to shadow may be lost



Hello Experts,

Recently while troubleshooting some job failure issues, we noticed that we are receiving " Connection to shadow may be lost, will test by sending whoami request."Â very frequently.

In this case the job went into HELD status because of OOM condition, it's expected but don't understand why after OOM it's complaining about communication issues with the submit node? Checked the network stack and couldn't find anything obvious. It's showing it will wait for 40m to reconnect but it decided to exit quickly without any reconnect attempt. We are seeing this "Lost connection to shadow" many times across different worker nodes.ÂÂ

11/07/23 09:38:04 (pid:3515857) Job was held due to OOM event: Job has gone over memory limit of 17296 megabytes. Peak usage: 19163 megabytes.
11/07/23 09:38:04 (pid:3515857) Got SIGQUIT. Performing fast shutdown.
11/07/23 09:38:04 (pid:3515857) ShutdownFast all jobs.
11/07/23 09:38:04 (pid:3515857) Got SIGTERM. Performing graceful shutdown.
11/07/23 09:38:04 (pid:3515857) ShutdownGraceful all jobs.
11/07/23 09:38:04 (pid:3515857) Connection to shadow may be lost, will test by sending whoami request.
11/07/23 09:38:04 (pid:3515857) condor_write(): Socket closed when trying to write 37 bytes to <xx.xx.xx.xx:32049>, fd is 22
11/07/23 09:38:04 (pid:3515857) Buf::write(): condor_write() failed
11/07/23 09:38:04 (pid:3515857) i/o error result is 0, errno is 0
11/07/23 09:38:04 (pid:3515857) Lost connection to shadow, waiting 2400 secs for reconnect
11/07/23 09:38:04 (pid:3515857) Process exited, pid=3515943, signal=9
11/07/23 09:38:04 (pid:3515857) Failed to open .job.ad, can't forward ToE tag.
11/07/23 09:38:04 (pid:3515857) RPC error: disconnected from shadow
11/07/23 09:38:04 (pid:3515857) All jobs have exited... starter exiting
11/07/23 09:38:04 (pid:3515857) **** condor_starter (condor_STARTER) pid 3515857 EXITING WITH STATUS 0

Found old thread [1] but nothing conclusive out of it.Â

[1]Âhttps://www-auth.cs.wisc.edu/lists/htcondor-users/2016-February/msg00099.shtml

Regards,
Vikrant